linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* bcachefs status update (it's done cooking; let's get this sucker merged)
@ 2019-06-10 19:14 Kent Overstreet
  2019-06-10 19:14 ` [PATCH 01/12] Compiler Attributes: add __flatten Kent Overstreet
                   ` (14 more replies)
  0 siblings, 15 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong , Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Linus Torvalds, Tejun Heo

Last status update: https://lkml.org/lkml/2018/12/2/46

Current status - I'm pretty much running out of things to polish and excuses to
keep tinkering. The core featureset is _done_ and the list of known outstanding
bugs is getting to be short and unexciting. The next big things on my todo list
are finishing erasure coding and reflink, but there's no reason for merging to
wait on those.

So. Here's my bcachefs-for-review branch - this has the minimal set of patches
outside of fs/bcachefs/. My master branch has some performance optimizations for
the core buffered IO paths, but those are fairly tricky and invasive so I want
to hold off on those for now - this branch is intended to be more or less
suitable for merging as is.

https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-for-review

The list of non bcachefs patches is:

closures: fix a race on wakeup from closure_sync
closures: closure_wait_event()
bcache: move closures to lib/
bcache: optimize continue_at_nobarrier()
block: Add some exports for bcachefs
Propagate gfp_t when allocating pte entries from __vmalloc
fs: factor out d_mark_tmpfile()
fs: insert_inode_locked2()
mm: export find_get_pages()
mm: pagecache add lock
locking: SIX locks (shared/intent/exclusive)
Compiler Attributes: add __flatten

Most of the patches are pretty small, of the ones that aren't:

 - SIX locks have already been discussed, and seem to be pretty uncontroversial.

 - pagecache add lock: it's kind of ugly, but necessary to rigorously prevent
   page cache inconsistencies with dio and other operations, in particular
   racing vs. page faults - honestly, it's criminal that we still don't have a
   mechanism in the kernel to address this, other filesystems are susceptible to
   these kinds of bugs too.

   My patch is intentionally ugly in the hopes that someone else will come up
   with a magical elegant solution, but in the meantime it's an "it's ugly but
   it works" sort of thing, and I suspect in real world scenarios it's going to
   beat any kind of range locking performance wise, which is the only
   alternative I've heard discussed.
   
 - Propaget gfp_t from __vmalloc() - bcachefs needs __vmalloc() to respect
   GFP_NOFS, that's all that is.

 - and, moving closures out of drivers/md/bcache to lib/. 

The rest of the tree is 62k lines of code in fs/bcachefs. So, I obviously won't
be mailing out all of that as patches, but if any code reviewers have
suggestions on what would make that go easier go ahead and speak up. The last
time I was mailing things out for review the main thing that came up was ioctls,
but the ioctl interface hasn't really changed since then. I'm pretty confident
in the on disk format stuff, which was the other thing that was mentioned.

----------

This has been a monumental effort over a lot of years, and I'm _really_ happy
with how it's turned out. I'm excited to finally unleash this upon the world.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/12] Compiler Attributes: add __flatten
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-12 17:16   ` Greg KH
  2019-06-10 19:14 ` [PATCH 02/12] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 include/linux/compiler_attributes.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/compiler_attributes.h b/include/linux/compiler_attributes.h
index 6b318efd8a..48b2c6ae6f 100644
--- a/include/linux/compiler_attributes.h
+++ b/include/linux/compiler_attributes.h
@@ -253,4 +253,9 @@
  */
 #define __weak                          __attribute__((__weak__))
 
+/*
+ *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-flatten-function-attribute
+ */
+#define __flatten __attribute__((flatten))
+
 #endif /* __LINUX_COMPILER_ATTRIBUTES_H */
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 02/12] locking: SIX locks (shared/intent/exclusive)
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
  2019-06-10 19:14 ` [PATCH 01/12] Compiler Attributes: add __flatten Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 03/12] mm: pagecache add lock Kent Overstreet
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

New lock for bcachefs, like read/write locks but with a third state,
intent.

Intent locks conflict with each other, but not with read locks; taking a
write lock requires first holding an intent lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 include/linux/six.h     | 192 +++++++++++++++
 kernel/Kconfig.locks    |   3 +
 kernel/locking/Makefile |   1 +
 kernel/locking/six.c    | 512 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 708 insertions(+)
 create mode 100644 include/linux/six.h
 create mode 100644 kernel/locking/six.c

diff --git a/include/linux/six.h b/include/linux/six.h
new file mode 100644
index 0000000000..0fb1b2f493
--- /dev/null
+++ b/include/linux/six.h
@@ -0,0 +1,192 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_SIX_H
+#define _LINUX_SIX_H
+
+/*
+ * Shared/intent/exclusive locks: sleepable read/write locks, much like rw
+ * semaphores, except with a third intermediate state, intent. Basic operations
+ * are:
+ *
+ * six_lock_read(&foo->lock);
+ * six_unlock_read(&foo->lock);
+ *
+ * six_lock_intent(&foo->lock);
+ * six_unlock_intent(&foo->lock);
+ *
+ * six_lock_write(&foo->lock);
+ * six_unlock_write(&foo->lock);
+ *
+ * Intent locks block other intent locks, but do not block read locks, and you
+ * must have an intent lock held before taking a write lock, like so:
+ *
+ * six_lock_intent(&foo->lock);
+ * six_lock_write(&foo->lock);
+ * six_unlock_write(&foo->lock);
+ * six_unlock_intent(&foo->lock);
+ *
+ * Other operations:
+ *
+ *   six_trylock_read()
+ *   six_trylock_intent()
+ *   six_trylock_write()
+ *
+ *   six_lock_downgrade():	convert from intent to read
+ *   six_lock_tryupgrade():	attempt to convert from read to intent
+ *
+ * Locks also embed a sequence number, which is incremented when the lock is
+ * locked or unlocked for write. The current sequence number can be grabbed
+ * while a lock is held from lock->state.seq; then, if you drop the lock you can
+ * use six_relock_(read|intent_write)(lock, seq) to attempt to retake the lock
+ * iff it hasn't been locked for write in the meantime.
+ *
+ * There are also operations that take the lock type as a parameter, where the
+ * type is one of SIX_LOCK_read, SIX_LOCK_intent, or SIX_LOCK_write:
+ *
+ *   six_lock_type(lock, type)
+ *   six_unlock_type(lock, type)
+ *   six_relock(lock, type, seq)
+ *   six_trylock_type(lock, type)
+ *   six_trylock_convert(lock, from, to)
+ *
+ * A lock may be held multiple types by the same thread (for read or intent,
+ * not write). However, the six locks code does _not_ implement the actual
+ * recursive checks itself though - rather, if your code (e.g. btree iterator
+ * code) knows that the current thread already has a lock held, and for the
+ * correct type, six_lock_increment() may be used to bump up the counter for
+ * that type - the only effect is that one more call to unlock will be required
+ * before the lock is unlocked.
+ */
+
+#include <linux/lockdep.h>
+#include <linux/osq_lock.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+
+#define SIX_LOCK_SEPARATE_LOCKFNS
+
+union six_lock_state {
+	struct {
+		atomic64_t	counter;
+	};
+
+	struct {
+		u64		v;
+	};
+
+	struct {
+		/* for waitlist_bitnr() */
+		unsigned long	l;
+	};
+
+	struct {
+		unsigned	read_lock:28;
+		unsigned	intent_lock:1;
+		unsigned	waiters:3;
+		/*
+		 * seq works much like in seqlocks: it's incremented every time
+		 * we lock and unlock for write.
+		 *
+		 * If it's odd write lock is held, even unlocked.
+		 *
+		 * Thus readers can unlock, and then lock again later iff it
+		 * hasn't been modified in the meantime.
+		 */
+		u32		seq;
+	};
+};
+
+enum six_lock_type {
+	SIX_LOCK_read,
+	SIX_LOCK_intent,
+	SIX_LOCK_write,
+};
+
+struct six_lock {
+	union six_lock_state	state;
+	unsigned		intent_lock_recurse;
+	struct task_struct	*owner;
+	struct optimistic_spin_queue osq;
+
+	raw_spinlock_t		wait_lock;
+	struct list_head	wait_list[2];
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+#endif
+};
+
+static __always_inline void __six_lock_init(struct six_lock *lock,
+					    const char *name,
+					    struct lock_class_key *key)
+{
+	atomic64_set(&lock->state.counter, 0);
+	raw_spin_lock_init(&lock->wait_lock);
+	INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_read]);
+	INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_intent]);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	debug_check_no_locks_freed((void *) lock, sizeof(*lock));
+	lockdep_init_map(&lock->dep_map, name, key, 0);
+#endif
+}
+
+#define six_lock_init(lock)						\
+do {									\
+	static struct lock_class_key __key;				\
+									\
+	__six_lock_init((lock), #lock, &__key);				\
+} while (0)
+
+#define __SIX_VAL(field, _v)	(((union six_lock_state) { .field = _v }).v)
+
+#define __SIX_LOCK(type)						\
+bool six_trylock_##type(struct six_lock *);				\
+bool six_relock_##type(struct six_lock *, u32);				\
+void six_lock_##type(struct six_lock *);				\
+void six_unlock_##type(struct six_lock *);
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+#undef __SIX_LOCK
+
+#define SIX_LOCK_DISPATCH(type, fn, ...)			\
+	switch (type) {						\
+	case SIX_LOCK_read:					\
+		return fn##_read(__VA_ARGS__);			\
+	case SIX_LOCK_intent:					\
+		return fn##_intent(__VA_ARGS__);		\
+	case SIX_LOCK_write:					\
+		return fn##_write(__VA_ARGS__);			\
+	default:						\
+		BUG();						\
+	}
+
+static inline bool six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_trylock, lock);
+}
+
+static inline bool six_relock_type(struct six_lock *lock, enum six_lock_type type,
+		     unsigned seq)
+{
+	SIX_LOCK_DISPATCH(type, six_relock, lock, seq);
+}
+
+static inline void six_lock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_lock, lock);
+}
+
+static inline void six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_unlock, lock);
+}
+
+void six_lock_downgrade(struct six_lock *);
+bool six_lock_tryupgrade(struct six_lock *);
+bool six_trylock_convert(struct six_lock *, enum six_lock_type,
+			 enum six_lock_type);
+
+void six_lock_increment(struct six_lock *, enum six_lock_type);
+
+#endif /* _LINUX_SIX_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index fbba478ae5..ff3e6121ae 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -251,3 +251,6 @@ config ARCH_USE_QUEUED_RWLOCKS
 config QUEUED_RWLOCKS
 	def_bool y if ARCH_USE_QUEUED_RWLOCKS
 	depends on SMP
+
+config SIXLOCKS
+	bool
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 392c7f23af..9a73c8564e 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
 obj-$(CONFIG_QUEUED_RWLOCKS) += qrwlock.o
 obj-$(CONFIG_LOCK_TORTURE_TEST) += locktorture.o
 obj-$(CONFIG_WW_MUTEX_SELFTEST) += test-ww_mutex.o
+obj-$(CONFIG_SIXLOCKS) += six.o
diff --git a/kernel/locking/six.c b/kernel/locking/six.c
new file mode 100644
index 0000000000..9fa58b6fad
--- /dev/null
+++ b/kernel/locking/six.c
@@ -0,0 +1,512 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/export.h>
+#include <linux/log2.h>
+#include <linux/preempt.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/sched/rt.h>
+#include <linux/six.h>
+
+#ifdef DEBUG
+#define EBUG_ON(cond)		BUG_ON(cond)
+#else
+#define EBUG_ON(cond)		do {} while (0)
+#endif
+
+#define six_acquire(l, t)	lock_acquire(l, 0, t, 0, 0, NULL, _RET_IP_)
+#define six_release(l)		lock_release(l, 0, _RET_IP_)
+
+struct six_lock_vals {
+	/* Value we add to the lock in order to take the lock: */
+	u64			lock_val;
+
+	/* If the lock has this value (used as a mask), taking the lock fails: */
+	u64			lock_fail;
+
+	/* Value we add to the lock in order to release the lock: */
+	u64			unlock_val;
+
+	/* Mask that indicates lock is held for this type: */
+	u64			held_mask;
+
+	/* Waitlist we wakeup when releasing the lock: */
+	enum six_lock_type	unlock_wakeup;
+};
+
+#define __SIX_LOCK_HELD_read	__SIX_VAL(read_lock, ~0)
+#define __SIX_LOCK_HELD_intent	__SIX_VAL(intent_lock, ~0)
+#define __SIX_LOCK_HELD_write	__SIX_VAL(seq, 1)
+
+#define LOCK_VALS {							\
+	[SIX_LOCK_read] = {						\
+		.lock_val	= __SIX_VAL(read_lock, 1),		\
+		.lock_fail	= __SIX_LOCK_HELD_write,		\
+		.unlock_val	= -__SIX_VAL(read_lock, 1),		\
+		.held_mask	= __SIX_LOCK_HELD_read,			\
+		.unlock_wakeup	= SIX_LOCK_write,			\
+	},								\
+	[SIX_LOCK_intent] = {						\
+		.lock_val	= __SIX_VAL(intent_lock, 1),		\
+		.lock_fail	= __SIX_LOCK_HELD_intent,		\
+		.unlock_val	= -__SIX_VAL(intent_lock, 1),		\
+		.held_mask	= __SIX_LOCK_HELD_intent,		\
+		.unlock_wakeup	= SIX_LOCK_intent,			\
+	},								\
+	[SIX_LOCK_write] = {						\
+		.lock_val	= __SIX_VAL(seq, 1),			\
+		.lock_fail	= __SIX_LOCK_HELD_read,			\
+		.unlock_val	= __SIX_VAL(seq, 1),			\
+		.held_mask	= __SIX_LOCK_HELD_write,		\
+		.unlock_wakeup	= SIX_LOCK_read,			\
+	},								\
+}
+
+static inline void six_set_owner(struct six_lock *lock, enum six_lock_type type,
+				 union six_lock_state old)
+{
+	if (type != SIX_LOCK_intent)
+		return;
+
+	if (!old.intent_lock) {
+		EBUG_ON(lock->owner);
+		lock->owner = current;
+	} else {
+		EBUG_ON(lock->owner != current);
+	}
+}
+
+static __always_inline bool do_six_trylock_type(struct six_lock *lock,
+						enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old;
+	u64 v = READ_ONCE(lock->state.v);
+
+	EBUG_ON(type == SIX_LOCK_write && lock->owner != current);
+
+	do {
+		old.v = v;
+
+		EBUG_ON(type == SIX_LOCK_write &&
+			((old.v & __SIX_LOCK_HELD_write) ||
+			 !(old.v & __SIX_LOCK_HELD_intent)));
+
+		if (old.v & l[type].lock_fail)
+			return false;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v,
+				old.v + l[type].lock_val)) != old.v);
+
+	six_set_owner(lock, type, old);
+	return true;
+}
+
+__always_inline __flatten
+static bool __six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	if (!do_six_trylock_type(lock, type))
+		return false;
+
+	six_acquire(&lock->dep_map, 1);
+	return true;
+}
+
+__always_inline __flatten
+static bool __six_relock_type(struct six_lock *lock, enum six_lock_type type,
+			      unsigned seq)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old;
+	u64 v = READ_ONCE(lock->state.v);
+
+	do {
+		old.v = v;
+
+		if (old.seq != seq || old.v & l[type].lock_fail)
+			return false;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v,
+				old.v + l[type].lock_val)) != old.v);
+
+	six_set_owner(lock, type, old);
+	six_acquire(&lock->dep_map, 1);
+	return true;
+}
+
+struct six_lock_waiter {
+	struct list_head	list;
+	struct task_struct	*task;
+};
+
+/* This is probably up there with the more evil things I've done */
+#define waitlist_bitnr(id) ilog2((((union six_lock_state) { .waiters = 1 << (id) }).l))
+
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER
+
+static inline int six_can_spin_on_owner(struct six_lock *lock)
+{
+	struct task_struct *owner;
+	int retval = 1;
+
+	if (need_resched())
+		return 0;
+
+	rcu_read_lock();
+	owner = READ_ONCE(lock->owner);
+	if (owner)
+		retval = owner->on_cpu;
+	rcu_read_unlock();
+	/*
+	 * if lock->owner is not set, the mutex owner may have just acquired
+	 * it and not set the owner yet or the mutex has been released.
+	 */
+	return retval;
+}
+
+static inline bool six_spin_on_owner(struct six_lock *lock,
+				     struct task_struct *owner)
+{
+	bool ret = true;
+
+	rcu_read_lock();
+	while (lock->owner == owner) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+
+		if (!owner->on_cpu || need_resched()) {
+			ret = false;
+			break;
+		}
+
+		cpu_relax();
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+	struct task_struct *task = current;
+
+	if (type == SIX_LOCK_write)
+		return false;
+
+	preempt_disable();
+	if (!six_can_spin_on_owner(lock))
+		goto fail;
+
+	if (!osq_lock(&lock->osq))
+		goto fail;
+
+	while (1) {
+		struct task_struct *owner;
+
+		/*
+		 * If there's an owner, wait for it to either
+		 * release the lock or go to sleep.
+		 */
+		owner = READ_ONCE(lock->owner);
+		if (owner && !six_spin_on_owner(lock, owner))
+			break;
+
+		if (do_six_trylock_type(lock, type)) {
+			osq_unlock(&lock->osq);
+			preempt_enable();
+			return true;
+		}
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!owner && (need_resched() || rt_task(task)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax();
+	}
+
+	osq_unlock(&lock->osq);
+fail:
+	preempt_enable();
+
+	/*
+	 * If we fell out of the spin path because of need_resched(),
+	 * reschedule now, before we try-lock again. This avoids getting
+	 * scheduled out right after we obtained the lock.
+	 */
+	if (need_resched())
+		schedule();
+
+	return false;
+}
+
+#else /* CONFIG_LOCK_SPIN_ON_OWNER */
+
+static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+	return false;
+}
+
+#endif
+
+noinline
+static void __six_lock_type_slowpath(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old, new;
+	struct six_lock_waiter wait;
+	u64 v;
+
+	if (six_optimistic_spin(lock, type))
+		return;
+
+	lock_contended(&lock->dep_map, _RET_IP_);
+
+	INIT_LIST_HEAD(&wait.list);
+	wait.task = current;
+
+	while (1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (type == SIX_LOCK_write)
+			EBUG_ON(lock->owner != current);
+		else if (list_empty_careful(&wait.list)) {
+			raw_spin_lock(&lock->wait_lock);
+			list_add_tail(&wait.list, &lock->wait_list[type]);
+			raw_spin_unlock(&lock->wait_lock);
+		}
+
+		v = READ_ONCE(lock->state.v);
+		do {
+			new.v = old.v = v;
+
+			if (!(old.v & l[type].lock_fail))
+				new.v += l[type].lock_val;
+			else if (!(new.waiters & (1 << type)))
+				new.waiters |= 1 << type;
+			else
+				break; /* waiting bit already set */
+		} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+					old.v, new.v)) != old.v);
+
+		if (!(old.v & l[type].lock_fail))
+			break;
+
+		schedule();
+	}
+
+	six_set_owner(lock, type, old);
+
+	__set_current_state(TASK_RUNNING);
+
+	if (!list_empty_careful(&wait.list)) {
+		raw_spin_lock(&lock->wait_lock);
+		list_del_init(&wait.list);
+		raw_spin_unlock(&lock->wait_lock);
+	}
+}
+
+__always_inline
+static void __six_lock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	six_acquire(&lock->dep_map, 0);
+
+	if (!do_six_trylock_type(lock, type))
+		__six_lock_type_slowpath(lock, type);
+
+	lock_acquired(&lock->dep_map, _RET_IP_);
+}
+
+static inline void six_lock_wakeup(struct six_lock *lock,
+				   union six_lock_state state,
+				   unsigned waitlist_id)
+{
+	struct list_head *wait_list = &lock->wait_list[waitlist_id];
+	struct six_lock_waiter *w, *next;
+
+	if (waitlist_id == SIX_LOCK_write && state.read_lock)
+		return;
+
+	if (!(state.waiters & (1 << waitlist_id)))
+		return;
+
+	clear_bit(waitlist_bitnr(waitlist_id),
+		  (unsigned long *) &lock->state.v);
+
+	if (waitlist_id == SIX_LOCK_write) {
+		struct task_struct *p = READ_ONCE(lock->owner);
+
+		if (p)
+			wake_up_process(p);
+		return;
+	}
+
+	raw_spin_lock(&lock->wait_lock);
+
+	list_for_each_entry_safe(w, next, wait_list, list) {
+		list_del_init(&w->list);
+
+		if (wake_up_process(w->task) &&
+		    waitlist_id != SIX_LOCK_read) {
+			if (!list_empty(wait_list))
+				set_bit(waitlist_bitnr(waitlist_id),
+					(unsigned long *) &lock->state.v);
+			break;
+		}
+	}
+
+	raw_spin_unlock(&lock->wait_lock);
+}
+
+__always_inline __flatten
+static void __six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state state;
+
+	EBUG_ON(!(lock->state.v & l[type].held_mask));
+	EBUG_ON(type == SIX_LOCK_write &&
+		!(lock->state.v & __SIX_LOCK_HELD_intent));
+
+	six_release(&lock->dep_map);
+
+	if (type == SIX_LOCK_intent) {
+		EBUG_ON(lock->owner != current);
+
+		if (lock->intent_lock_recurse) {
+			--lock->intent_lock_recurse;
+			return;
+		}
+
+		lock->owner = NULL;
+	}
+
+	state.v = atomic64_add_return_release(l[type].unlock_val,
+					      &lock->state.counter);
+	six_lock_wakeup(lock, state, l[type].unlock_wakeup);
+}
+
+#define __SIX_LOCK(type)						\
+bool six_trylock_##type(struct six_lock *lock)				\
+{									\
+	return __six_trylock_type(lock, SIX_LOCK_##type);		\
+}									\
+EXPORT_SYMBOL_GPL(six_trylock_##type);					\
+									\
+bool six_relock_##type(struct six_lock *lock, u32 seq)			\
+{									\
+	return __six_relock_type(lock, SIX_LOCK_##type, seq);		\
+}									\
+EXPORT_SYMBOL_GPL(six_relock_##type);					\
+									\
+void six_lock_##type(struct six_lock *lock)				\
+{									\
+	__six_lock_type(lock, SIX_LOCK_##type);				\
+}									\
+EXPORT_SYMBOL_GPL(six_lock_##type);					\
+									\
+void six_unlock_##type(struct six_lock *lock)				\
+{									\
+	__six_unlock_type(lock, SIX_LOCK_##type);			\
+}									\
+EXPORT_SYMBOL_GPL(six_unlock_##type);
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+
+#undef __SIX_LOCK
+
+/* Convert from intent to read: */
+void six_lock_downgrade(struct six_lock *lock)
+{
+	six_lock_increment(lock, SIX_LOCK_read);
+	six_unlock_intent(lock);
+}
+EXPORT_SYMBOL_GPL(six_lock_downgrade);
+
+bool six_lock_tryupgrade(struct six_lock *lock)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old, new;
+	u64 v = READ_ONCE(lock->state.v);
+
+	do {
+		new.v = old.v = v;
+
+		EBUG_ON(!(old.v & l[SIX_LOCK_read].held_mask));
+
+		new.v += l[SIX_LOCK_read].unlock_val;
+
+		if (new.v & l[SIX_LOCK_intent].lock_fail)
+			return false;
+
+		new.v += l[SIX_LOCK_intent].lock_val;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v, new.v)) != old.v);
+
+	six_set_owner(lock, SIX_LOCK_intent, old);
+	six_lock_wakeup(lock, new, l[SIX_LOCK_read].unlock_wakeup);
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(six_lock_tryupgrade);
+
+bool six_trylock_convert(struct six_lock *lock,
+			 enum six_lock_type from,
+			 enum six_lock_type to)
+{
+	EBUG_ON(to == SIX_LOCK_write || from == SIX_LOCK_write);
+
+	if (to == from)
+		return true;
+
+	if (to == SIX_LOCK_read) {
+		six_lock_downgrade(lock);
+		return true;
+	} else {
+		return six_lock_tryupgrade(lock);
+	}
+}
+EXPORT_SYMBOL_GPL(six_trylock_convert);
+
+/*
+ * Increment read/intent lock count, assuming we already have it read or intent
+ * locked:
+ */
+void six_lock_increment(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+
+	EBUG_ON(type == SIX_LOCK_write);
+	six_acquire(&lock->dep_map, 0);
+
+	/* XXX: assert already locked, and that we don't overflow: */
+
+	switch (type) {
+	case SIX_LOCK_read:
+		atomic64_add(l[type].lock_val, &lock->state.counter);
+		break;
+	case SIX_LOCK_intent:
+		lock->intent_lock_recurse++;
+		break;
+	case SIX_LOCK_write:
+		BUG();
+		break;
+	}
+}
+EXPORT_SYMBOL_GPL(six_lock_increment);
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 03/12] mm: pagecache add lock
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
  2019-06-10 19:14 ` [PATCH 01/12] Compiler Attributes: add __flatten Kent Overstreet
  2019-06-10 19:14 ` [PATCH 02/12] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 04/12] mm: export find_get_pages() Kent Overstreet
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Add a per address space lock around adding pages to the pagecache - making it
possible for fallocate INSERT_RANGE/COLLAPSE_RANGE to work correctly, and also
hopefully making truncate and dio a bit saner.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 fs/inode.c            |  1 +
 include/linux/fs.h    | 24 +++++++++++++
 include/linux/sched.h |  4 +++
 init/init_task.c      |  1 +
 mm/filemap.c          | 81 +++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9a453f3637..8881dc551f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -350,6 +350,7 @@ EXPORT_SYMBOL(inc_nlink);
 static void __address_space_init_once(struct address_space *mapping)
 {
 	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ);
+	pagecache_lock_init(&mapping->add_lock);
 	init_rwsem(&mapping->i_mmap_rwsem);
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd28e76790..a88d994751 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -418,6 +418,28 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata);
 
+/*
+ * Two-state lock - can be taken for add or block - both states are shared,
+ * like read side of rwsem, but conflict with other state:
+ */
+struct pagecache_lock {
+	atomic_long_t		v;
+	wait_queue_head_t	wait;
+};
+
+static inline void pagecache_lock_init(struct pagecache_lock *lock)
+{
+	atomic_long_set(&lock->v, 0);
+	init_waitqueue_head(&lock->wait);
+}
+
+void pagecache_add_put(struct pagecache_lock *);
+void pagecache_add_get(struct pagecache_lock *);
+void __pagecache_block_put(struct pagecache_lock *);
+void __pagecache_block_get(struct pagecache_lock *);
+void pagecache_block_put(struct pagecache_lock *);
+void pagecache_block_get(struct pagecache_lock *);
+
 /**
  * struct address_space - Contents of a cacheable, mappable object.
  * @host: Owner, either the inode or the block_device.
@@ -452,6 +474,8 @@ struct address_space {
 	spinlock_t		private_lock;
 	struct list_head	private_list;
 	void			*private_data;
+	struct pagecache_lock	add_lock
+		____cacheline_aligned_in_smp;	/* protects adding new pages */
 } __attribute__((aligned(sizeof(long)))) __randomize_layout;
 	/*
 	 * On most architectures that alignment is already the case; but
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1549584a15..a46baade99 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -43,6 +43,7 @@ struct io_context;
 struct mempolicy;
 struct nameidata;
 struct nsproxy;
+struct pagecache_lock;
 struct perf_event_context;
 struct pid_namespace;
 struct pipe_inode_info;
@@ -935,6 +936,9 @@ struct task_struct {
 	unsigned int			in_ubsan;
 #endif
 
+	/* currently held lock, for avoiding recursing in fault path: */
+	struct pagecache_lock *pagecache_lock;
+
 	/* Journalling filesystem info: */
 	void				*journal_info;
 
diff --git a/init/init_task.c b/init/init_task.c
index c70ef656d0..92bbb6e909 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -115,6 +115,7 @@ struct task_struct init_task
 	},
 	.blocked	= {{0}},
 	.alloc_lock	= __SPIN_LOCK_UNLOCKED(init_task.alloc_lock),
+	.pagecache_lock = NULL,
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
diff --git a/mm/filemap.c b/mm/filemap.c
index d78f577bae..93d7e0e686 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -113,6 +113,73 @@
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
+static void __pagecache_lock_put(struct pagecache_lock *lock, long i)
+{
+	BUG_ON(atomic_long_read(&lock->v) == 0);
+
+	if (atomic_long_sub_return_release(i, &lock->v) == 0)
+		wake_up_all(&lock->wait);
+}
+
+static bool __pagecache_lock_tryget(struct pagecache_lock *lock, long i)
+{
+	long v = atomic_long_read(&lock->v), old;
+
+	do {
+		old = v;
+
+		if (i > 0 ? v < 0 : v > 0)
+			return false;
+	} while ((v = atomic_long_cmpxchg_acquire(&lock->v,
+					old, old + i)) != old);
+	return true;
+}
+
+static void __pagecache_lock_get(struct pagecache_lock *lock, long i)
+{
+	wait_event(lock->wait, __pagecache_lock_tryget(lock, i));
+}
+
+void pagecache_add_put(struct pagecache_lock *lock)
+{
+	__pagecache_lock_put(lock, 1);
+}
+EXPORT_SYMBOL(pagecache_add_put);
+
+void pagecache_add_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, 1);
+}
+EXPORT_SYMBOL(pagecache_add_get);
+
+void __pagecache_block_put(struct pagecache_lock *lock)
+{
+	__pagecache_lock_put(lock, -1);
+}
+EXPORT_SYMBOL(__pagecache_block_put);
+
+void __pagecache_block_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, -1);
+}
+EXPORT_SYMBOL(__pagecache_block_get);
+
+void pagecache_block_put(struct pagecache_lock *lock)
+{
+	BUG_ON(current->pagecache_lock != lock);
+	current->pagecache_lock = NULL;
+	__pagecache_lock_put(lock, -1);
+}
+EXPORT_SYMBOL(pagecache_block_put);
+
+void pagecache_block_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, -1);
+	BUG_ON(current->pagecache_lock);
+	current->pagecache_lock = lock;
+}
+EXPORT_SYMBOL(pagecache_block_get);
+
 static void page_cache_delete(struct address_space *mapping,
 				   struct page *page, void *shadow)
 {
@@ -829,11 +896,14 @@ static int __add_to_page_cache_locked(struct page *page,
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 	mapping_set_update(&xas, mapping);
 
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_get(&mapping->add_lock);
+
 	if (!huge) {
 		error = mem_cgroup_try_charge(page, current->mm,
 					      gfp_mask, &memcg, false);
 		if (error)
-			return error;
+			goto out;
 	}
 
 	get_page(page);
@@ -869,14 +939,19 @@ static int __add_to_page_cache_locked(struct page *page,
 	if (!huge)
 		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
-	return 0;
+	error = 0;
+out:
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_put(&mapping->add_lock);
+	return error;
 error:
 	page->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
 	if (!huge)
 		mem_cgroup_cancel_charge(page, memcg, false);
 	put_page(page);
-	return xas_error(&xas);
+	error = xas_error(&xas);
+	goto out;
 }
 
 /**
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 04/12] mm: export find_get_pages()
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (2 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 03/12] mm: pagecache add lock Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 05/12] fs: insert_inode_locked2() Kent Overstreet
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Needed for bcachefs

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 mm/filemap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 93d7e0e686..617168474e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1899,6 +1899,7 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
 
 	return ret;
 }
+EXPORT_SYMBOL(find_get_pages_range);
 
 /**
  * find_get_pages_contig - gang contiguous pagecache lookup
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 05/12] fs: insert_inode_locked2()
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (3 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 04/12] mm: export find_get_pages() Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 06/12] fs: factor out d_mark_tmpfile() Kent Overstreet
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

New helper for bcachefs, so that when we race inserting an inode we can
atomically grab a ref to the inode already in the inode cache.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 fs/inode.c         | 40 ++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  1 +
 2 files changed, 41 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 8881dc551f..cc44f345e0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1479,6 +1479,46 @@ int insert_inode_locked(struct inode *inode)
 }
 EXPORT_SYMBOL(insert_inode_locked);
 
+struct inode *insert_inode_locked2(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	ino_t ino = inode->i_ino;
+	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+
+	while (1) {
+		struct inode *old = NULL;
+		spin_lock(&inode_hash_lock);
+		hlist_for_each_entry(old, head, i_hash) {
+			if (old->i_ino != ino)
+				continue;
+			if (old->i_sb != sb)
+				continue;
+			spin_lock(&old->i_lock);
+			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&old->i_lock);
+				continue;
+			}
+			break;
+		}
+		if (likely(!old)) {
+			spin_lock(&inode->i_lock);
+			inode->i_state |= I_NEW | I_CREATING;
+			hlist_add_head(&inode->i_hash, head);
+			spin_unlock(&inode->i_lock);
+			spin_unlock(&inode_hash_lock);
+			return NULL;
+		}
+		__iget(old);
+		spin_unlock(&old->i_lock);
+		spin_unlock(&inode_hash_lock);
+		wait_on_inode(old);
+		if (unlikely(!inode_unhashed(old)))
+			return old;
+		iput(old);
+	}
+}
+EXPORT_SYMBOL(insert_inode_locked2);
+
 int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a88d994751..d5d12d6981 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3010,6 +3010,7 @@ extern struct inode *find_inode_nowait(struct super_block *,
 				       void *data);
 extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
 extern int insert_inode_locked(struct inode *);
+extern struct inode *insert_inode_locked2(struct inode *);
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 extern void lockdep_annotate_inode_mutex_key(struct inode *inode);
 #else
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 06/12] fs: factor out d_mark_tmpfile()
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (4 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 05/12] fs: insert_inode_locked2() Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 07/12] Propagate gfp_t when allocating pte entries from __vmalloc Kent Overstreet
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

New helper for bcachefs - bcachefs doesn't want the
inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
its own atomically with other btree updates

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 fs/dcache.c            | 10 ++++++++--
 include/linux/dcache.h |  1 +
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index aac41adf47..18edb4e5bc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3042,9 +3042,8 @@ void d_genocide(struct dentry *parent)
 
 EXPORT_SYMBOL(d_genocide);
 
-void d_tmpfile(struct dentry *dentry, struct inode *inode)
+void d_mark_tmpfile(struct dentry *dentry, struct inode *inode)
 {
-	inode_dec_link_count(inode);
 	BUG_ON(dentry->d_name.name != dentry->d_iname ||
 		!hlist_unhashed(&dentry->d_u.d_alias) ||
 		!d_unlinked(dentry));
@@ -3054,6 +3053,13 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
 				(unsigned long long)inode->i_ino);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dentry->d_parent->d_lock);
+}
+EXPORT_SYMBOL(d_mark_tmpfile);
+
+void d_tmpfile(struct dentry *dentry, struct inode *inode)
+{
+	inode_dec_link_count(inode);
+	d_mark_tmpfile(dentry, inode);
 	d_instantiate(dentry, inode);
 }
 EXPORT_SYMBOL(d_tmpfile);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 60996e64c5..e0fe330162 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -255,6 +255,7 @@ extern struct dentry * d_make_root(struct inode *);
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
 
+extern void d_mark_tmpfile(struct dentry *, struct inode *);
 extern void d_tmpfile(struct dentry *, struct inode *);
 
 extern struct dentry *d_find_alias(struct inode *);
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 07/12] Propagate gfp_t when allocating pte entries from __vmalloc
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (5 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 06/12] fs: factor out d_mark_tmpfile() Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 08/12] block: Add some exports for bcachefs Kent Overstreet
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

This fixes a lockdep recursion when using __vmalloc from places that
aren't GFP_KERNEL safe.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 arch/alpha/include/asm/pgalloc.h             | 11 ++---
 arch/arc/include/asm/pgalloc.h               |  9 +---
 arch/arm/include/asm/pgalloc.h               | 11 +++--
 arch/arm/mm/idmap.c                          |  2 +-
 arch/arm/mm/mmu.c                            |  5 +-
 arch/arm/mm/pgd.c                            |  8 +--
 arch/arm64/include/asm/pgalloc.h             | 17 ++++---
 arch/arm64/mm/hugetlbpage.c                  |  8 +--
 arch/csky/include/asm/pgalloc.h              |  4 +-
 arch/hexagon/include/asm/pgalloc.h           |  5 +-
 arch/ia64/include/asm/pgalloc.h              | 14 +++---
 arch/ia64/mm/hugetlbpage.c                   |  4 +-
 arch/ia64/mm/init.c                          |  6 +--
 arch/m68k/include/asm/mcf_pgalloc.h          | 12 ++---
 arch/m68k/include/asm/motorola_pgalloc.h     |  7 +--
 arch/m68k/include/asm/sun3_pgalloc.h         | 12 ++---
 arch/m68k/mm/kmap.c                          |  5 +-
 arch/m68k/sun3x/dvma.c                       |  6 ++-
 arch/microblaze/include/asm/pgalloc.h        |  6 +--
 arch/microblaze/mm/pgtable.c                 |  6 +--
 arch/mips/include/asm/pgalloc.h              | 14 +++---
 arch/mips/mm/hugetlbpage.c                   |  4 +-
 arch/mips/mm/ioremap.c                       |  6 +--
 arch/nds32/include/asm/pgalloc.h             | 14 ++----
 arch/nds32/kernel/dma.c                      |  4 +-
 arch/nios2/include/asm/pgalloc.h             |  8 +--
 arch/nios2/mm/ioremap.c                      |  6 +--
 arch/openrisc/include/asm/pgalloc.h          |  2 +-
 arch/openrisc/mm/ioremap.c                   |  4 +-
 arch/parisc/include/asm/pgalloc.h            | 16 +++---
 arch/parisc/kernel/pci-dma.c                 |  6 +--
 arch/parisc/mm/hugetlbpage.c                 |  4 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h |  4 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 20 ++++----
 arch/powerpc/include/asm/nohash/32/pgalloc.h |  6 +--
 arch/powerpc/include/asm/nohash/64/pgalloc.h | 14 +++---
 arch/powerpc/kvm/book3s_64_mmu_radix.c       |  2 +-
 arch/powerpc/mm/hugetlbpage.c                |  8 +--
 arch/powerpc/mm/pgtable-book3e.c             |  6 +--
 arch/powerpc/mm/pgtable-book3s64.c           | 14 +++---
 arch/powerpc/mm/pgtable-hash64.c             |  6 +--
 arch/powerpc/mm/pgtable-radix.c              | 12 ++---
 arch/powerpc/mm/pgtable_32.c                 |  6 +--
 arch/riscv/include/asm/pgalloc.h             | 11 ++---
 arch/s390/include/asm/pgalloc.h              | 25 +++++-----
 arch/s390/mm/hugetlbpage.c                   |  6 +--
 arch/s390/mm/pgalloc.c                       | 10 ++--
 arch/s390/mm/pgtable.c                       |  6 +--
 arch/s390/mm/vmem.c                          |  2 +-
 arch/sh/include/asm/pgalloc.h                |  7 +--
 arch/sh/mm/hugetlbpage.c                     |  4 +-
 arch/sh/mm/init.c                            |  4 +-
 arch/sh/mm/pgtable.c                         |  8 ++-
 arch/sparc/include/asm/pgalloc_32.h          |  6 +--
 arch/sparc/include/asm/pgalloc_64.h          | 12 +++--
 arch/sparc/mm/hugetlbpage.c                  |  4 +-
 arch/sparc/mm/init_64.c                      | 10 +---
 arch/sparc/mm/srmmu.c                        |  2 +-
 arch/um/include/asm/pgalloc.h                |  2 +-
 arch/um/include/asm/pgtable-3level.h         |  3 +-
 arch/um/kernel/mem.c                         | 17 ++-----
 arch/um/kernel/skas/mmu.c                    |  4 +-
 arch/unicore32/include/asm/pgalloc.h         |  8 ++-
 arch/unicore32/mm/pgd.c                      |  2 +-
 arch/x86/include/asm/pgalloc.h               | 30 ++++++------
 arch/x86/kernel/espfix_64.c                  |  2 +-
 arch/x86/kernel/tboot.c                      |  6 +--
 arch/x86/mm/pgtable.c                        |  4 +-
 arch/x86/platform/efi/efi_64.c               |  9 ++--
 arch/xtensa/include/asm/pgalloc.h            |  4 +-
 drivers/staging/media/ipu3/ipu3-dmamap.c     |  2 +-
 include/asm-generic/4level-fixup.h           |  6 +--
 include/asm-generic/5level-fixup.h           |  6 +--
 include/asm-generic/pgtable-nop4d-hack.h     |  2 +-
 include/asm-generic/pgtable-nop4d.h          |  2 +-
 include/asm-generic/pgtable-nopmd.h          |  2 +-
 include/asm-generic/pgtable-nopud.h          |  2 +-
 include/linux/mm.h                           | 40 ++++++++-------
 include/linux/vmalloc.h                      |  2 +-
 lib/ioremap.c                                |  8 +--
 mm/hugetlb.c                                 | 11 +++--
 mm/kasan/init.c                              |  8 +--
 mm/memory.c                                  | 51 +++++++++++---------
 mm/migrate.c                                 |  6 +--
 mm/mremap.c                                  |  6 +--
 mm/userfaultfd.c                             |  6 +--
 mm/vmalloc.c                                 | 49 +++++++++++--------
 mm/zsmalloc.c                                |  2 +-
 virt/kvm/arm/mmu.c                           |  6 +--
 89 files changed, 377 insertions(+), 392 deletions(-)

diff --git a/arch/alpha/include/asm/pgalloc.h b/arch/alpha/include/asm/pgalloc.h
index 02f9f91bb4..6b8336865e 100644
--- a/arch/alpha/include/asm/pgalloc.h
+++ b/arch/alpha/include/asm/pgalloc.h
@@ -39,9 +39,9 @@ pgd_free(struct mm_struct *mm, pgd_t *pgd)
 }
 
 static inline pmd_t *
-pmd_alloc_one(struct mm_struct *mm, unsigned long address)
+pmd_alloc_one(struct mm_struct *mm, unsigned long address, gfp_t gfp)
 {
-	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
+	pmd_t *ret = (pmd_t *)get_zeroed_page(gfp);
 	return ret;
 }
 
@@ -52,10 +52,9 @@ pmd_free(struct mm_struct *mm, pmd_t *pmd)
 }
 
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm)
+pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
-	return pte;
+	return (pte_t *)get_zeroed_page(gfp);
 }
 
 static inline void
@@ -67,7 +66,7 @@ pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 static inline pgtable_t
 pte_alloc_one(struct mm_struct *mm)
 {
-	pte_t *pte = pte_alloc_one_kernel(mm);
+	pte_t *pte = pte_alloc_one_kernel(mm, GFP_KERNEL);
 	struct page *page;
 
 	if (!pte)
diff --git a/arch/arc/include/asm/pgalloc.h b/arch/arc/include/asm/pgalloc.h
index 9c9b5a5ebf..491535bb2b 100644
--- a/arch/arc/include/asm/pgalloc.h
+++ b/arch/arc/include/asm/pgalloc.h
@@ -90,14 +90,9 @@ static inline int __get_order_pte(void)
 	return get_order(PTRS_PER_PTE * sizeof(pte_t));
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	pte_t *pte;
-
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
-					 __get_order_pte());
-
-	return pte;
+	return (pte_t *) __get_free_pages(gfp|__GFP_ZERO, __get_order_pte());
 }
 
 static inline pgtable_t
diff --git a/arch/arm/include/asm/pgalloc.h b/arch/arm/include/asm/pgalloc.h
index 17ab72f0cc..f21ba862f6 100644
--- a/arch/arm/include/asm/pgalloc.h
+++ b/arch/arm/include/asm/pgalloc.h
@@ -27,9 +27,10 @@
 
 #ifdef CONFIG_ARM_LPAE
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return (pmd_t *)get_zeroed_page(GFP_KERNEL);
+	return (pmd_t *)get_zeroed_page(gfp);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -48,7 +49,7 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 /*
  * Since we have only two-level page tables, these are trivial
  */
-#define pmd_alloc_one(mm,addr)		({ BUG(); ((pmd_t *)2); })
+#define pmd_alloc_one(mm,addr,gfp)	({ BUG(); ((pmd_t *)2); })
 #define pmd_free(mm, pmd)		do { } while (0)
 #define pud_populate(mm,pmd,pte)	BUG()
 
@@ -81,11 +82,11 @@ static inline void clean_pte_table(pte_t *pte)
  *  +------------+
  */
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm)
+pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *pte;
 
-	pte = (pte_t *)__get_free_page(PGALLOC_GFP);
+	pte = (pte_t *)get_zeroed_page(gfp);
 	if (pte)
 		clean_pte_table(pte);
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index a033f6134a..b90d2deedc 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -28,7 +28,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 	unsigned long next;
 
 	if (pud_none_or_clear_bad(pud) || (pud_val(*pud) & L_PGD_SWAPPER)) {
-		pmd = pmd_alloc_one(&init_mm, addr);
+		pmd = pmd_alloc_one(&init_mm, addr, GFP_KERNEL);
 		if (!pmd) {
 			pr_warn("Failed to allocate identity pmd.\n");
 			return;
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index f3ce34113f..7cc18e5174 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -979,10 +979,11 @@ void __init create_mapping_late(struct mm_struct *mm, struct map_desc *md,
 				bool ng)
 {
 #ifdef CONFIG_ARM_LPAE
-	pud_t *pud = pud_alloc(mm, pgd_offset(mm, md->virtual), md->virtual);
+	pud_t *pud = pud_alloc(mm, pgd_offset(mm, md->virtual), md->virtual,
+			       GFP_KERNEL);
 	if (WARN_ON(!pud))
 		return;
-	pmd_alloc(mm, pud, 0);
+	pmd_alloc(mm, pud, 0, GFP_KERNEL);
 #endif
 	__create_mapping(mm, md, late_alloc, ng);
 }
diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c
index a1606d9502..6c3a640672 100644
--- a/arch/arm/mm/pgd.c
+++ b/arch/arm/mm/pgd.c
@@ -57,11 +57,11 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	 * Allocate PMD table for modules and pkmap mappings.
 	 */
 	new_pud = pud_alloc(mm, new_pgd + pgd_index(MODULES_VADDR),
-			    MODULES_VADDR);
+			    MODULES_VADDR, GFP_KERNEL);
 	if (!new_pud)
 		goto no_pud;
 
-	new_pmd = pmd_alloc(mm, new_pud, 0);
+	new_pmd = pmd_alloc(mm, new_pud, 0, GFP_KERNEL);
 	if (!new_pmd)
 		goto no_pmd;
 #endif
@@ -72,11 +72,11 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 		 * contains the machine vectors. The vectors are always high
 		 * with LPAE.
 		 */
-		new_pud = pud_alloc(mm, new_pgd, 0);
+		new_pud = pud_alloc(mm, new_pgd, 0, GFP_KERNEL);
 		if (!new_pud)
 			goto no_pud;
 
-		new_pmd = pmd_alloc(mm, new_pud, 0);
+		new_pmd = pmd_alloc(mm, new_pud, 0, GFP_KERNEL);
 		if (!new_pmd)
 			goto no_pmd;
 
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index 52fa47c73b..54199d52ea 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -26,14 +26,14 @@
 
 #define check_pgt_cache()		do { } while (0)
 
-#define PGALLOC_GFP	(GFP_KERNEL | __GFP_ZERO)
 #define PGD_SIZE	(PTRS_PER_PGD * sizeof(pgd_t))
 
 #if CONFIG_PGTABLE_LEVELS > 2
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return (pmd_t *)__get_free_page(PGALLOC_GFP);
+	return (pmd_t *)get_zeroed_page(gfp);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmdp)
@@ -60,9 +60,10 @@ static inline void __pud_populate(pud_t *pudp, phys_addr_t pmdp, pudval_t prot)
 
 #if CONFIG_PGTABLE_LEVELS > 3
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return (pud_t *)__get_free_page(PGALLOC_GFP);
+	return (pud_t *)get_zeroed_page(gfp);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pudp)
@@ -91,9 +92,9 @@ extern pgd_t *pgd_alloc(struct mm_struct *mm);
 extern void pgd_free(struct mm_struct *mm, pgd_t *pgdp);
 
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm)
+pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return (pte_t *)__get_free_page(PGALLOC_GFP);
+	return (pte_t *)get_zeroed_page(gfp);
 }
 
 static inline pgtable_t
@@ -101,7 +102,7 @@ pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
-	pte = alloc_pages(PGALLOC_GFP, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if (!pte)
 		return NULL;
 	if (!pgtable_page_ctor(pte)) {
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 6b4a47b3ad..0a17776894 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -230,14 +230,14 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	pte_t *ptep = NULL;
 
 	pgdp = pgd_offset(mm, addr);
-	pudp = pud_alloc(mm, pgdp, addr);
+	pudp = pud_alloc(mm, pgdp, addr, GFP_KERNEL);
 	if (!pudp)
 		return NULL;
 
 	if (sz == PUD_SIZE) {
 		ptep = (pte_t *)pudp;
 	} else if (sz == (PAGE_SIZE * CONT_PTES)) {
-		pmdp = pmd_alloc(mm, pudp, addr);
+		pmdp = pmd_alloc(mm, pudp, addr, GFP_KERNEL);
 
 		WARN_ON(addr & (sz - 1));
 		/*
@@ -253,9 +253,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 		    pud_none(READ_ONCE(*pudp)))
 			ptep = huge_pmd_share(mm, addr, pudp);
 		else
-			ptep = (pte_t *)pmd_alloc(mm, pudp, addr);
+			ptep = (pte_t *)pmd_alloc(mm, pudp, addr, GFP_KERNEL);
 	} else if (sz == (PMD_SIZE * CONT_PMDS)) {
-		pmdp = pmd_alloc(mm, pudp, addr);
+		pmdp = pmd_alloc(mm, pudp, addr, GFP_KERNEL);
 		WARN_ON(addr & (sz - 1));
 		return (pte_t *)pmdp;
 	}
diff --git a/arch/csky/include/asm/pgalloc.h b/arch/csky/include/asm/pgalloc.h
index d213bb47b7..1611a84be5 100644
--- a/arch/csky/include/asm/pgalloc.h
+++ b/arch/csky/include/asm/pgalloc.h
@@ -24,12 +24,12 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 
 extern void pgd_init(unsigned long *p);
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *pte;
 	unsigned long i;
 
-	pte = (pte_t *) __get_free_page(GFP_KERNEL);
+	pte = (pte_t *) __get_free_page(gfp);
 	if (!pte)
 		return NULL;
 
diff --git a/arch/hexagon/include/asm/pgalloc.h b/arch/hexagon/include/asm/pgalloc.h
index d36183887b..2c42f912f4 100644
--- a/arch/hexagon/include/asm/pgalloc.h
+++ b/arch/hexagon/include/asm/pgalloc.h
@@ -74,10 +74,9 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm)
 }
 
 /* _kernel variant gets to use a different allocator */
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	gfp_t flags =  GFP_KERNEL | __GFP_ZERO;
-	return (pte_t *) __get_free_page(flags);
+	return (pte_t *) get_zeroed_page(gfp);
 }
 
 static inline void pte_free(struct mm_struct *mm, struct page *pte)
diff --git a/arch/ia64/include/asm/pgalloc.h b/arch/ia64/include/asm/pgalloc.h
index c9e481023c..dd99d58a89 100644
--- a/arch/ia64/include/asm/pgalloc.h
+++ b/arch/ia64/include/asm/pgalloc.h
@@ -40,9 +40,10 @@ pgd_populate(struct mm_struct *mm, pgd_t * pgd_entry, pud_t * pud)
 	pgd_val(*pgd_entry) = __pa(pud);
 }
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return quicklist_alloc(0, GFP_KERNEL, NULL);
+	return quicklist_alloc(0, gfp, NULL);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
@@ -58,9 +59,10 @@ pud_populate(struct mm_struct *mm, pud_t * pud_entry, pmd_t * pmd)
 	pud_val(*pud_entry) = __pa(pmd);
 }
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return quicklist_alloc(0, GFP_KERNEL, NULL);
+	return quicklist_alloc(0, gfp, NULL);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -99,9 +101,9 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 	return page;
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return quicklist_alloc(0, GFP_KERNEL, NULL);
+	return quicklist_alloc(0, gfp, NULL);
 }
 
 static inline void pte_free(struct mm_struct *mm, pgtable_t pte)
diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index d16e419fd7..01e08edc9d 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -35,9 +35,9 @@ huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 	pte_t *pte = NULL;
 
 	pgd = pgd_offset(mm, taddr);
-	pud = pud_alloc(mm, pgd, taddr);
+	pud = pud_alloc(mm, pgd, taddr, GFP_KERNEL);
 	if (pud) {
-		pmd = pmd_alloc(mm, pud, taddr);
+		pmd = pmd_alloc(mm, pud, taddr, GFP_KERNEL);
 		if (pmd)
 			pte = pte_alloc_map(mm, pmd, taddr);
 	}
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index e49200e317..a420c0d04f 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -216,13 +216,13 @@ put_kernel_page (struct page *page, unsigned long address, pgprot_t pgprot)
 	pgd = pgd_offset_k(address);		/* note: this is NOT pgd_offset()! */
 
 	{
-		pud = pud_alloc(&init_mm, pgd, address);
+		pud = pud_alloc(&init_mm, pgd, address, GFP_KERNEL);
 		if (!pud)
 			goto out;
-		pmd = pmd_alloc(&init_mm, pud, address);
+		pmd = pmd_alloc(&init_mm, pud, address, GFP_KERNEL);
 		if (!pmd)
 			goto out;
-		pte = pte_alloc_kernel(pmd, address);
+		pte = pte_alloc_kernel(pmd, address, GFP_KERNEL);
 		if (!pte)
 			goto out;
 		if (!pte_none(*pte))
diff --git a/arch/m68k/include/asm/mcf_pgalloc.h b/arch/m68k/include/asm/mcf_pgalloc.h
index 4399d712f6..95384360cf 100644
--- a/arch/m68k/include/asm/mcf_pgalloc.h
+++ b/arch/m68k/include/asm/mcf_pgalloc.h
@@ -12,15 +12,9 @@ extern inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 
 extern const char bad_pmd_string[];
 
-extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	unsigned long page = __get_free_page(GFP_DMA);
-
-	if (!page)
-		return NULL;
-
-	memset((void *)page, 0, PAGE_SIZE);
-	return (pte_t *) (page);
+	return (pte_t *) get_zeroed_page(gfp|GFP_DMA);
 }
 
 extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned long address)
@@ -29,7 +23,7 @@ extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned long address)
 }
 
 #define pmd_alloc_one_fast(mm, address) ({ BUG(); ((pmd_t *)1); })
-#define pmd_alloc_one(mm, address)      ({ BUG(); ((pmd_t *)2); })
+#define pmd_alloc_one(mm, address, gfp) ({ BUG(); ((pmd_t *)2); })
 
 #define pmd_populate(mm, pmd, page) (pmd_val(*pmd) = \
 	(unsigned long)(page_address(page)))
diff --git a/arch/m68k/include/asm/motorola_pgalloc.h b/arch/m68k/include/asm/motorola_pgalloc.h
index d04d9ba9b9..e9b598f96b 100644
--- a/arch/m68k/include/asm/motorola_pgalloc.h
+++ b/arch/m68k/include/asm/motorola_pgalloc.h
@@ -8,11 +8,11 @@
 extern pmd_t *get_pointer_table(void);
 extern int free_pointer_table(pmd_t *);
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *pte;
 
-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
+	pte = (pte_t *)get_zeroed_page(gfp);
 	if (pte) {
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
@@ -67,7 +67,8 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t page,
 }
 
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address,
+				   gfp_t gfp)
 {
 	return get_pointer_table();
 }
diff --git a/arch/m68k/include/asm/sun3_pgalloc.h b/arch/m68k/include/asm/sun3_pgalloc.h
index 1456c5eecb..18324d4a33 100644
--- a/arch/m68k/include/asm/sun3_pgalloc.h
+++ b/arch/m68k/include/asm/sun3_pgalloc.h
@@ -15,7 +15,7 @@
 
 extern const char bad_pmd_string[];
 
-#define pmd_alloc_one(mm,address)       ({ BUG(); ((pmd_t *)2); })
+#define pmd_alloc_one(mm,address,gfp)       ({ BUG(); ((pmd_t *)2); })
 
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
@@ -35,15 +35,9 @@ do {							\
 	tlb_remove_page((tlb), pte);			\
 } while (0)
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	unsigned long page = __get_free_page(GFP_KERNEL);
-
-	if (!page)
-		return NULL;
-
-	memset((void *)page, 0, PAGE_SIZE);
-	return (pte_t *) (page);
+	return (pte_t *) get_zeroed_page(gfp);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/m68k/mm/kmap.c b/arch/m68k/mm/kmap.c
index 40a3b327da..8de716049a 100644
--- a/arch/m68k/mm/kmap.c
+++ b/arch/m68k/mm/kmap.c
@@ -196,7 +196,7 @@ void __iomem *__ioremap(unsigned long physaddr, unsigned long size, int cachefla
 			printk ("\npa=%#lx va=%#lx ", physaddr, virtaddr);
 #endif
 		pgd_dir = pgd_offset_k(virtaddr);
-		pmd_dir = pmd_alloc(&init_mm, pgd_dir, virtaddr);
+		pmd_dir = pmd_alloc(&init_mm, pgd_dir, virtaddr, GFP_KERNEL);
 		if (!pmd_dir) {
 			printk("ioremap: no mem for pmd_dir\n");
 			return NULL;
@@ -208,7 +208,8 @@ void __iomem *__ioremap(unsigned long physaddr, unsigned long size, int cachefla
 			virtaddr += PTRTREESIZE;
 			size -= PTRTREESIZE;
 		} else {
-			pte_dir = pte_alloc_kernel(pmd_dir, virtaddr);
+			pte_dir = pte_alloc_kernel(pmd_dir, virtaddr,
+						   GFP_KERNEL);
 			if (!pte_dir) {
 				printk("ioremap: no mem for pte_dir\n");
 				return NULL;
diff --git a/arch/m68k/sun3x/dvma.c b/arch/m68k/sun3x/dvma.c
index 89e630e665..86ffbe2785 100644
--- a/arch/m68k/sun3x/dvma.c
+++ b/arch/m68k/sun3x/dvma.c
@@ -95,7 +95,8 @@ inline int dvma_map_cpu(unsigned long kaddr,
 		pmd_t *pmd;
 		unsigned long end2;
 
-		if((pmd = pmd_alloc(&init_mm, pgd, vaddr)) == NULL) {
+		pmd = pmd_alloc(&init_mm, pgd, vaddr, GFP_KERNEL);
+		if (!pmd) {
 			ret = -ENOMEM;
 			goto out;
 		}
@@ -109,7 +110,8 @@ inline int dvma_map_cpu(unsigned long kaddr,
 			pte_t *pte;
 			unsigned long end3;
 
-			if((pte = pte_alloc_kernel(pmd, vaddr)) == NULL) {
+			pte = pte_alloc_kernel(pmd, vaddr, GFP_KERNEL);
+			if (!pte) {
 				ret = -ENOMEM;
 				goto out;
 			}
diff --git a/arch/microblaze/include/asm/pgalloc.h b/arch/microblaze/include/asm/pgalloc.h
index f4cc9ffc44..240e0bcd14 100644
--- a/arch/microblaze/include/asm/pgalloc.h
+++ b/arch/microblaze/include/asm/pgalloc.h
@@ -106,9 +106,9 @@ static inline void free_pgd_slow(pgd_t *pgd)
  * the pgd will always be present..
  */
 #define pmd_alloc_one_fast(mm, address)	({ BUG(); ((pmd_t *)1); })
-#define pmd_alloc_one(mm, address)	({ BUG(); ((pmd_t *)2); })
+#define pmd_alloc_one(mm, address, gfp)	({ BUG(); ((pmd_t *)2); })
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp);
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm)
 {
@@ -166,7 +166,7 @@ static inline void pte_free(struct mm_struct *mm, struct page *ptepage)
  * We don't have any real pmd's, and this code never triggers because
  * the pgd will always be present..
  */
-#define pmd_alloc_one(mm, address)	({ BUG(); ((pmd_t *)2); })
+#define pmd_alloc_one(mm, address, gfp)	({ BUG(); ((pmd_t *)2); })
 #define pmd_free(mm, x)			do { } while (0)
 #define __pmd_free_tlb(tlb, x, addr)	pmd_free((tlb)->mm, x)
 #define pgd_populate(mm, pmd, pte)	BUG()
diff --git a/arch/microblaze/mm/pgtable.c b/arch/microblaze/mm/pgtable.c
index c2ce1e42b8..796c422af7 100644
--- a/arch/microblaze/mm/pgtable.c
+++ b/arch/microblaze/mm/pgtable.c
@@ -144,7 +144,7 @@ int map_page(unsigned long va, phys_addr_t pa, int flags)
 	/* Use upper 10 bits of VA to index the first level map */
 	pd = pmd_offset(pgd_offset_k(va), va);
 	/* Use middle 10 bits of VA to index the second-level map */
-	pg = pte_alloc_kernel(pd, va); /* from powerpc - pgtable.c */
+	pg = pte_alloc_kernel(pd, va, GFP_KERNEL); /* from powerpc - pgtable.c */
 	/* pg = pte_alloc_kernel(&init_mm, pd, va); */
 
 	if (pg != NULL) {
@@ -235,11 +235,11 @@ unsigned long iopa(unsigned long addr)
 	return pa;
 }
 
-__ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+__ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *pte;
 	if (mem_init_done) {
-		pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+		pte = (pte_t *)get_zeroed_page(gfp);
 	} else {
 		pte = (pte_t *)early_get_page();
 		if (pte)
diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index 27808d9461..7e832f978a 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -50,9 +50,9 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_pages((unsigned long)pgd, PGD_ORDER);
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return (pte_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, PTE_ORDER);
+	return (pte_t *)__get_free_pages(gfp | __GFP_ZERO, PTE_ORDER);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm)
@@ -89,11 +89,12 @@ do {							\
 
 #ifndef __PAGETABLE_PMD_FOLDED
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address,
+				   gfp_t gfp)
 {
 	pmd_t *pmd;
 
-	pmd = (pmd_t *) __get_free_pages(GFP_KERNEL, PMD_ORDER);
+	pmd = (pmd_t *) __get_free_pages(gfp, PMD_ORDER);
 	if (pmd)
 		pmd_init((unsigned long)pmd, (unsigned long)invalid_pte_table);
 	return pmd;
@@ -110,11 +111,12 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 
 #ifndef __PAGETABLE_PUD_FOLDED
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address,
+				   gfp_t gfp)
 {
 	pud_t *pud;
 
-	pud = (pud_t *) __get_free_pages(GFP_KERNEL, PUD_ORDER);
+	pud = (pud_t *) __get_free_pages(gfp, PUD_ORDER);
 	if (pud)
 		pud_init((unsigned long)pud, (unsigned long)invalid_pmd_table);
 	return pud;
diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
index cef1522343..27843e10f6 100644
--- a/arch/mips/mm/hugetlbpage.c
+++ b/arch/mips/mm/hugetlbpage.c
@@ -29,9 +29,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr,
 	pte_t *pte = NULL;
 
 	pgd = pgd_offset(mm, addr);
-	pud = pud_alloc(mm, pgd, addr);
+	pud = pud_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (pud)
-		pte = (pte_t *)pmd_alloc(mm, pud, addr);
+		pte = (pte_t *)pmd_alloc(mm, pud, addr, GFP_KERNEL);
 
 	return pte;
 }
diff --git a/arch/mips/mm/ioremap.c b/arch/mips/mm/ioremap.c
index 1601d90b08..40da8f0ba7 100644
--- a/arch/mips/mm/ioremap.c
+++ b/arch/mips/mm/ioremap.c
@@ -56,7 +56,7 @@ static inline int remap_area_pmd(pmd_t * pmd, unsigned long address,
 	phys_addr -= address;
 	BUG_ON(address >= end);
 	do {
-		pte_t * pte = pte_alloc_kernel(pmd, address);
+		pte_t *pte = pte_alloc_kernel(pmd, address, GFP_KERNEL);
 		if (!pte)
 			return -ENOMEM;
 		remap_area_pte(pte, address, end - address, address + phys_addr, flags);
@@ -82,10 +82,10 @@ static int remap_area_pages(unsigned long address, phys_addr_t phys_addr,
 		pmd_t *pmd;
 
 		error = -ENOMEM;
-		pud = pud_alloc(&init_mm, dir, address);
+		pud = pud_alloc(&init_mm, dir, address, GFP_KERNEL);
 		if (!pud)
 			break;
-		pmd = pmd_alloc(&init_mm, pud, address);
+		pmd = pmd_alloc(&init_mm, pud, address, GFP_KERNEL);
 		if (!pmd)
 			break;
 		if (remap_area_pmd(pmd, address, end - address,
diff --git a/arch/nds32/include/asm/pgalloc.h b/arch/nds32/include/asm/pgalloc.h
index 3c5fee5b57..b187a2f127 100644
--- a/arch/nds32/include/asm/pgalloc.h
+++ b/arch/nds32/include/asm/pgalloc.h
@@ -12,8 +12,8 @@
 /*
  * Since we have only two-level page tables, these are trivial
  */
-#define pmd_alloc_one(mm, addr)		({ BUG(); ((pmd_t *)2); })
-#define pmd_free(mm, pmd)			do { } while (0)
+#define pmd_alloc_one(mm, addr, gfp)	({ BUG(); ((pmd_t *)2); })
+#define pmd_free(mm, pmd)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
@@ -22,15 +22,9 @@ extern void pgd_free(struct mm_struct *mm, pgd_t * pgd);
 
 #define check_pgt_cache()		do { } while (0)
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	pte_t *pte;
-
-	pte =
-	    (pte_t *) __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL |
-				      __GFP_ZERO);
-
-	return pte;
+	return (pte_t *) get_zeroed_page(gfp | __GFP_RETRY_MAYFAIL);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/nds32/kernel/dma.c b/arch/nds32/kernel/dma.c
index d0dbd4fe96..920a003762 100644
--- a/arch/nds32/kernel/dma.c
+++ b/arch/nds32/kernel/dma.c
@@ -300,7 +300,7 @@ static int __init consistent_init(void)
 
 	do {
 		pgd = pgd_offset(&init_mm, CONSISTENT_BASE);
-		pmd = pmd_alloc(&init_mm, pgd, CONSISTENT_BASE);
+		pmd = pmd_alloc(&init_mm, pgd, CONSISTENT_BASE, GFP_KERNEL);
 		if (!pmd) {
 			pr_err("%s: no pmd tables\n", __func__);
 			ret = -ENOMEM;
@@ -310,7 +310,7 @@ static int __init consistent_init(void)
 		 * It's not necessary to warn here. */
 		/* WARN_ON(!pmd_none(*pmd)); */
 
-		pte = pte_alloc_kernel(pmd, CONSISTENT_BASE);
+		pte = pte_alloc_kernel(pmd, CONSISTENT_BASE, GFP_KERNEL);
 		if (!pte) {
 			ret = -ENOMEM;
 			break;
diff --git a/arch/nios2/include/asm/pgalloc.h b/arch/nios2/include/asm/pgalloc.h
index 3a149ead12..2ce9bd5399 100644
--- a/arch/nios2/include/asm/pgalloc.h
+++ b/arch/nios2/include/asm/pgalloc.h
@@ -37,13 +37,9 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_pages((unsigned long)pgd, PGD_ORDER);
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gf_t gfp)
 {
-	pte_t *pte;
-
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_ZERO, PTE_ORDER);
-
-	return pte;
+	return (pte_t *)__get_free_pages(gfp|__GFP_ZERO, PTE_ORDER);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/nios2/mm/ioremap.c b/arch/nios2/mm/ioremap.c
index 3a28177a01..50c38da029 100644
--- a/arch/nios2/mm/ioremap.c
+++ b/arch/nios2/mm/ioremap.c
@@ -61,7 +61,7 @@ static inline int remap_area_pmd(pmd_t *pmd, unsigned long address,
 	if (address >= end)
 		BUG();
 	do {
-		pte_t *pte = pte_alloc_kernel(pmd, address);
+		pte_t *pte = pte_alloc_kernel(pmd, address, GFP_KERNEL);
 
 		if (!pte)
 			return -ENOMEM;
@@ -90,10 +90,10 @@ static int remap_area_pages(unsigned long address, unsigned long phys_addr,
 		pmd_t *pmd;
 
 		error = -ENOMEM;
-		pud = pud_alloc(&init_mm, dir, address);
+		pud = pud_alloc(&init_mm, dir, address, GFP_KERNEL);
 		if (!pud)
 			break;
-		pmd = pmd_alloc(&init_mm, pud, address);
+		pmd = pmd_alloc(&init_mm, pud, address, GFP_KERNEL);
 		if (!pmd)
 			break;
 		if (remap_area_pmd(pmd, address, end - address,
diff --git a/arch/openrisc/include/asm/pgalloc.h b/arch/openrisc/include/asm/pgalloc.h
index 149c82ee4b..f33f2a4504 100644
--- a/arch/openrisc/include/asm/pgalloc.h
+++ b/arch/openrisc/include/asm/pgalloc.h
@@ -70,7 +70,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long)pgd);
 }
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp);
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm)
 {
diff --git a/arch/openrisc/mm/ioremap.c b/arch/openrisc/mm/ioremap.c
index a8509950db..93d295d26a 100644
--- a/arch/openrisc/mm/ioremap.c
+++ b/arch/openrisc/mm/ioremap.c
@@ -118,12 +118,12 @@ EXPORT_SYMBOL(iounmap);
  * the memblock infrastructure.
  */
 
-pte_t __ref *pte_alloc_one_kernel(struct mm_struct *mm)
+pte_t __ref *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *pte;
 
 	if (likely(mem_init_done)) {
-		pte = (pte_t *)get_zeroed_page(GFP_KERNEL);
+		pte = (pte_t *)get_zeroed_page(gfp);
 	} else {
 		pte = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
 		if (!pte)
diff --git a/arch/parisc/include/asm/pgalloc.h b/arch/parisc/include/asm/pgalloc.h
index d05c678c77..705f5fffbd 100644
--- a/arch/parisc/include/asm/pgalloc.h
+++ b/arch/parisc/include/asm/pgalloc.h
@@ -62,12 +62,10 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmd)
 		        (__u32)(__pa((unsigned long)pmd) >> PxD_VALUE_SHIFT));
 }
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address,
+				   gfp_t gfp)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_pages(GFP_KERNEL, PMD_ORDER);
-	if (pmd)
-		memset(pmd, 0, PAGE_SIZE<<PMD_ORDER);
-	return pmd;
+	return (pmd_t *)__get_free_pages(gfp|__GFP_ZERO, PMD_ORDER);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -94,7 +92,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
  * inside the pgd, so has no extra memory associated with it.
  */
 
-#define pmd_alloc_one(mm, addr)		({ BUG(); ((pmd_t *)2); })
+#define pmd_alloc_one(mm, addr, gfp)	({ BUG(); ((pmd_t *)2); })
 #define pmd_free(mm, x)			do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
 
@@ -134,11 +132,9 @@ pte_alloc_one(struct mm_struct *mm)
 	return page;
 }
 
-static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
-	return pte;
+	return (pte_t *)get_zeroed_page(gfp);
 }
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index 239162355b..285688e735 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -113,7 +113,7 @@ static inline int map_pmd_uncached(pmd_t * pmd, unsigned long vaddr,
 	if (end > PGDIR_SIZE)
 		end = PGDIR_SIZE;
 	do {
-		pte_t * pte = pte_alloc_kernel(pmd, vaddr);
+		pte_t *pte = pte_alloc_kernel(pmd, vaddr, GFP_KERNEL);
 		if (!pte)
 			return -ENOMEM;
 		if (map_pte_uncached(pte, orig_vaddr, end - vaddr, paddr_ptr))
@@ -134,8 +134,8 @@ static inline int map_uncached_pages(unsigned long vaddr, unsigned long size,
 	dir = pgd_offset_k(vaddr);
 	do {
 		pmd_t *pmd;
-		
-		pmd = pmd_alloc(NULL, dir, vaddr);
+
+		pmd = pmd_alloc(NULL, dir, vaddr, GFP_KERNEL);
 		if (!pmd)
 			return -ENOMEM;
 		if (map_pmd_uncached(pmd, vaddr, end - vaddr, &paddr))
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index d77479ae3a..6351549539 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -61,9 +61,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	addr &= HPAGE_MASK;
 
 	pgd = pgd_offset(mm, addr);
-	pud = pud_alloc(mm, pgd, addr);
+	pud = pud_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (pud) {
-		pmd = pmd_alloc(mm, pud, addr);
+		pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 		if (pmd)
 			pte = pte_alloc_map(mm, pmd, addr);
 	}
diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 3633502e10..9032660c0e 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -42,7 +42,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
  * We don't have any real pmd's, and this code never triggers because
  * the pgd will always be present..
  */
-/* #define pmd_alloc_one(mm,address)       ({ BUG(); ((pmd_t *)2); }) */
+/* #define pmd_alloc_one(mm,address,gfp) ({ BUG(); ((pmd_t *)2); }) */
 #define pmd_free(mm, x) 		do { } while (0)
 #define __pmd_free_tlb(tlb,x,a)		do { } while (0)
 /* #define pgd_populate(mm, pmd, pte)      BUG() */
@@ -61,7 +61,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmdp,
 
 #define pmd_pgtable(pmd) ((pgtable_t)pmd_page_vaddr(pmd))
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp);
 extern pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_frag_destroy(void *pte_frag);
 pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel);
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index 138bc2ecc0..c2199361cf 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -39,8 +39,8 @@ extern struct vmemmap_backing *vmemmap_list;
 extern struct kmem_cache *pgtable_cache[];
 #define PGT_CACHE(shift) pgtable_cache[shift]
 
-extern pte_t *pte_fragment_alloc(struct mm_struct *, int);
-extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned long);
+extern pte_t *pte_fragment_alloc(struct mm_struct *, int, gfp_t);
+extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned long, gfp_t);
 extern void pte_fragment_free(unsigned long *, int);
 extern void pmd_fragment_free(unsigned long *);
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
@@ -114,12 +114,13 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
 	*pgd =  __pgd(__pgtable_ptr_val(pud) | PGD_VAL_BITS);
 }
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
 	pud_t *pud;
 
 	pud = kmem_cache_alloc(PGT_CACHE(PUD_CACHE_INDEX),
-			       pgtable_gfp_flags(mm, GFP_KERNEL));
+			       pgtable_gfp_flags(mm, gfp));
 	/*
 	 * Tell kmemleak to ignore the PUD, that means don't scan it for
 	 * pointers and don't consider it a leak. PUDs are typically only
@@ -152,9 +153,10 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pud,
 	pgtable_free_tlb(tlb, pud, PUD_INDEX);
 }
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return pmd_fragment_alloc(mm, addr);
+	return pmd_fragment_alloc(mm, addr, gfp);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -190,14 +192,14 @@ static inline pgtable_t pmd_pgtable(pmd_t pmd)
 	return (pgtable_t)pmd_page_vaddr(pmd);
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return (pte_t *)pte_fragment_alloc(mm, 1);
+	return (pte_t *)pte_fragment_alloc(mm, 1, gfp);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
-	return (pgtable_t)pte_fragment_alloc(mm, 0);
+	return (pgtable_t)pte_fragment_alloc(mm, 0, GFP_KERNEL);
 }
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index bd186e85b4..8a5a944251 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -42,8 +42,8 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
  * We don't have any real pmd's, and this code never triggers because
  * the pgd will always be present..
  */
-/* #define pmd_alloc_one(mm,address)       ({ BUG(); ((pmd_t *)2); }) */
-#define pmd_free(mm, x) 		do { } while (0)
+/* #define pmd_alloc_one(mm,address,gfp) ({ BUG(); ((pmd_t *)2); }) */
+#define pmd_free(mm, x)			do { } while (0)
 #define __pmd_free_tlb(tlb,x,a)		do { } while (0)
 /* #define pgd_populate(mm, pmd, pte)      BUG() */
 
@@ -79,7 +79,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmdp,
 #define pmd_pgtable(pmd) ((pgtable_t)pmd_page_vaddr(pmd))
 #endif
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp);
 extern pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_frag_destroy(void *pte_frag);
 pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel);
diff --git a/arch/powerpc/include/asm/nohash/64/pgalloc.h b/arch/powerpc/include/asm/nohash/64/pgalloc.h
index 66d086f85b..e30f21916a 100644
--- a/arch/powerpc/include/asm/nohash/64/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/64/pgalloc.h
@@ -51,10 +51,11 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 
 #define pgd_populate(MM, PGD, PUD)	pgd_set(PGD, (unsigned long)PUD)
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
 	return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE),
-			pgtable_gfp_flags(mm, GFP_KERNEL));
+			pgtable_gfp_flags(mm, gfp));
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
@@ -81,10 +82,11 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
 	return kmem_cache_alloc(PGT_CACHE(PMD_CACHE_INDEX),
-			pgtable_gfp_flags(mm, GFP_KERNEL));
+			pgtable_gfp_flags(mm, gfp));
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -93,9 +95,9 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 }
 
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+	return (pte_t *)get_zeroed_page(gfp);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index f55ef07188..d9a9856029 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -585,7 +585,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 	if (pgd_present(*pgd))
 		pud = pud_offset(pgd, gpa);
 	else
-		new_pud = pud_alloc_one(kvm->mm, gpa);
+		new_pud = pud_alloc_one(kvm->mm, gpa, GFP_KERNEL);
 
 	pmd = NULL;
 	if (pud && pud_present(*pud) && !pud_huge(*pud))
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 9e732bb2c8..f66c42c933 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -153,7 +153,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 		hpdp = (hugepd_t *)pg;
 	} else {
 		pdshift = PUD_SHIFT;
-		pu = pud_alloc(mm, pg, addr);
+		pu = pud_alloc(mm, pg, addr, GFP_KERNEL);
 		if (pshift == PUD_SHIFT)
 			return (pte_t *)pu;
 		else if (pshift > PMD_SHIFT) {
@@ -161,7 +161,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 			hpdp = (hugepd_t *)pu;
 		} else {
 			pdshift = PMD_SHIFT;
-			pm = pmd_alloc(mm, pu, addr);
+			pm = pmd_alloc(mm, pu, addr, GFP_KERNEL);
 			if (pshift == PMD_SHIFT)
 				/* 16MB hugepage */
 				return (pte_t *)pm;
@@ -177,13 +177,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 		hpdp = (hugepd_t *)pg;
 	} else {
 		pdshift = PUD_SHIFT;
-		pu = pud_alloc(mm, pg, addr);
+		pu = pud_alloc(mm, pg, addr, GFP_KERNEL);
 		if (pshift >= PUD_SHIFT) {
 			ptl = pud_lockptr(mm, pu);
 			hpdp = (hugepd_t *)pu;
 		} else {
 			pdshift = PMD_SHIFT;
-			pm = pmd_alloc(mm, pu, addr);
+			pm = pmd_alloc(mm, pu, addr, GFP_KERNEL);
 			ptl = pmd_lockptr(mm, pm);
 			hpdp = (hugepd_t *)pm;
 		}
diff --git a/arch/powerpc/mm/pgtable-book3e.c b/arch/powerpc/mm/pgtable-book3e.c
index 1032ef7aaf..43bcc3bc8a 100644
--- a/arch/powerpc/mm/pgtable-book3e.c
+++ b/arch/powerpc/mm/pgtable-book3e.c
@@ -84,13 +84,13 @@ int map_kernel_page(unsigned long ea, unsigned long pa, pgprot_t prot)
 	BUILD_BUG_ON(TASK_SIZE_USER64 > PGTABLE_RANGE);
 	if (slab_is_available()) {
 		pgdp = pgd_offset_k(ea);
-		pudp = pud_alloc(&init_mm, pgdp, ea);
+		pudp = pud_alloc(&init_mm, pgdp, ea, GFP_KERNEL);
 		if (!pudp)
 			return -ENOMEM;
-		pmdp = pmd_alloc(&init_mm, pudp, ea);
+		pmdp = pmd_alloc(&init_mm, pudp, ea, GFP_KERNEL);
 		if (!pmdp)
 			return -ENOMEM;
-		ptep = pte_alloc_kernel(pmdp, ea);
+		ptep = pte_alloc_kernel(pmdp, ea, GFP_KERNEL);
 		if (!ptep)
 			return -ENOMEM;
 	} else {
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index a4341aba0a..cfb417ce6a 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -262,15 +262,14 @@ static pmd_t *get_pmd_from_cache(struct mm_struct *mm)
 	return (pmd_t *)ret;
 }
 
-static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
+static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm, gfp_t gfp)
 {
 	void *ret = NULL;
 	struct page *page;
-	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
 
-	if (mm == &init_mm)
-		gfp &= ~__GFP_ACCOUNT;
-	page = alloc_page(gfp);
+	if (mm != &init_mm)
+		gfp |= __GFP_ACCOUNT;
+	page = alloc_page(gfp|__GFP_ZERO);
 	if (!page)
 		return NULL;
 	if (!pgtable_pmd_page_ctor(page)) {
@@ -303,7 +302,8 @@ static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 	return (pmd_t *)ret;
 }
 
-pmd_t *pmd_fragment_alloc(struct mm_struct *mm, unsigned long vmaddr)
+pmd_t *pmd_fragment_alloc(struct mm_struct *mm, unsigned long vmaddr,
+			  gfp_t gfp)
 {
 	pmd_t *pmd;
 
@@ -311,7 +311,7 @@ pmd_t *pmd_fragment_alloc(struct mm_struct *mm, unsigned long vmaddr)
 	if (pmd)
 		return pmd;
 
-	return __alloc_for_pmdcache(mm);
+	return __alloc_for_pmdcache(mm, gfp);
 }
 
 void pmd_fragment_free(unsigned long *pmd)
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index c08d49046a..d90deb67d8 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -152,13 +152,13 @@ int hash__map_kernel_page(unsigned long ea, unsigned long pa, pgprot_t prot)
 	BUILD_BUG_ON(TASK_SIZE_USER64 > H_PGTABLE_RANGE);
 	if (slab_is_available()) {
 		pgdp = pgd_offset_k(ea);
-		pudp = pud_alloc(&init_mm, pgdp, ea);
+		pudp = pud_alloc(&init_mm, pgdp, ea, GFP_KERNEL);
 		if (!pudp)
 			return -ENOMEM;
-		pmdp = pmd_alloc(&init_mm, pudp, ea);
+		pmdp = pmd_alloc(&init_mm, pudp, ea, GFP_KERNEL);
 		if (!pmdp)
 			return -ENOMEM;
-		ptep = pte_alloc_kernel(pmdp, ea);
+		ptep = pte_alloc_kernel(pmdp, ea, GFP_KERNEL);
 		if (!ptep)
 			return -ENOMEM;
 		set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, prot));
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 154472a28c..0fbc67a090 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -145,21 +145,21 @@ static int __map_kernel_page(unsigned long ea, unsigned long pa,
 	 * boot.
 	 */
 	pgdp = pgd_offset_k(ea);
-	pudp = pud_alloc(&init_mm, pgdp, ea);
+	pudp = pud_alloc(&init_mm, pgdp, ea, GFP_KERNEL);
 	if (!pudp)
 		return -ENOMEM;
 	if (map_page_size == PUD_SIZE) {
 		ptep = (pte_t *)pudp;
 		goto set_the_pte;
 	}
-	pmdp = pmd_alloc(&init_mm, pudp, ea);
+	pmdp = pmd_alloc(&init_mm, pudp, ea, GFP_KERNEL);
 	if (!pmdp)
 		return -ENOMEM;
 	if (map_page_size == PMD_SIZE) {
 		ptep = pmdp_ptep(pmdp);
 		goto set_the_pte;
 	}
-	ptep = pte_alloc_kernel(pmdp, ea);
+	ptep = pte_alloc_kernel(pmdp, ea, GFP_KERNEL);
 	if (!ptep)
 		return -ENOMEM;
 
@@ -194,21 +194,21 @@ void radix__change_memory_range(unsigned long start, unsigned long end,
 
 	for (idx = start; idx < end; idx += PAGE_SIZE) {
 		pgdp = pgd_offset_k(idx);
-		pudp = pud_alloc(&init_mm, pgdp, idx);
+		pudp = pud_alloc(&init_mm, pgdp, idx, GFP_KERNEL);
 		if (!pudp)
 			continue;
 		if (pud_huge(*pudp)) {
 			ptep = (pte_t *)pudp;
 			goto update_the_pte;
 		}
-		pmdp = pmd_alloc(&init_mm, pudp, idx);
+		pmdp = pmd_alloc(&init_mm, pudp, idx, GFP_KERNEL);
 		if (!pmdp)
 			continue;
 		if (pmd_huge(*pmdp)) {
 			ptep = pmdp_ptep(pmdp);
 			goto update_the_pte;
 		}
-		ptep = pte_alloc_kernel(pmdp, idx);
+		ptep = pte_alloc_kernel(pmdp, idx, GFP_KERNEL);
 		if (!ptep)
 			continue;
 update_the_pte:
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 6e56a6240b..eb474a99f0 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -43,12 +43,12 @@ EXPORT_SYMBOL(ioremap_bot);	/* aka VMALLOC_END */
 
 extern char etext[], _stext[], _sinittext[], _einittext[];
 
-__ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+__ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	if (!slab_is_available())
 		return memblock_alloc(PTE_FRAG_SIZE, PTE_FRAG_SIZE);
 
-	return (pte_t *)pte_fragment_alloc(mm, 1);
+	return (pte_t *)pte_fragment_alloc(mm, 1, gfp);
 }
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
@@ -214,7 +214,7 @@ int map_kernel_page(unsigned long va, phys_addr_t pa, pgprot_t prot)
 	/* Use upper 10 bits of VA to index the first level map */
 	pd = pmd_offset(pud_offset(pgd_offset_k(va), va), va);
 	/* Use middle 10 bits of VA to index the second-level map */
-	pg = pte_alloc_kernel(pd, va);
+	pg = pte_alloc_kernel(pd, va, GFP_KERNEL);
 	if (pg != 0) {
 		err = 0;
 		/* The PTE should never be already set nor present in the
diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 94043cf83c..991c8d268e 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -67,10 +67,10 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 
 #ifndef __PAGETABLE_PMD_FOLDED
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return (pmd_t *)__get_free_page(
-		GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
+	return (pmd_t *)get_zeroed_page(gfp|__GFP_RETRY_MAYFAIL);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -82,10 +82,9 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return (pte_t *)__get_free_page(
-		GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
+	return (pte_t *)get_zeroed_page(gfp | __GFP_RETRY_MAYFAIL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index bccb8f4a63..49fb025627 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -19,10 +19,10 @@
 
 #define CRST_ALLOC_ORDER 2
 
-unsigned long *crst_table_alloc(struct mm_struct *);
+unsigned long *crst_table_alloc(struct mm_struct *, gfp_t);
 void crst_table_free(struct mm_struct *, unsigned long *);
 
-unsigned long *page_table_alloc(struct mm_struct *);
+unsigned long *page_table_alloc(struct mm_struct *, gfp_t);
 struct page *page_table_alloc_pgste(struct mm_struct *mm);
 void page_table_free(struct mm_struct *, unsigned long *);
 void page_table_free_rcu(struct mmu_gather *, unsigned long *, unsigned long);
@@ -48,9 +48,10 @@ static inline unsigned long pgd_entry_type(struct mm_struct *mm)
 int crst_table_upgrade(struct mm_struct *mm, unsigned long limit);
 void crst_table_downgrade(struct mm_struct *);
 
-static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long address,
+				   gfp_t gfp)
 {
-	unsigned long *table = crst_table_alloc(mm);
+	unsigned long *table = crst_table_alloc(mm, gfp);
 
 	if (table)
 		crst_table_init(table, _REGION2_ENTRY_EMPTY);
@@ -58,18 +59,20 @@ static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long address)
 }
 #define p4d_free(mm, p4d) crst_table_free(mm, (unsigned long *) p4d)
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address,
+				   gfp_t gfp)
 {
-	unsigned long *table = crst_table_alloc(mm);
+	unsigned long *table = crst_table_alloc(mm, gfp);
 	if (table)
 		crst_table_init(table, _REGION3_ENTRY_EMPTY);
 	return (pud_t *) table;
 }
 #define pud_free(mm, pud) crst_table_free(mm, (unsigned long *) pud)
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long vmaddr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long vmaddr,
+				   gfp_t gfp)
 {
-	unsigned long *table = crst_table_alloc(mm);
+	unsigned long *table = crst_table_alloc(mm, gfp);
 
 	if (!table)
 		return NULL;
@@ -104,7 +107,7 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	unsigned long *table = crst_table_alloc(mm);
+	unsigned long *table = crst_table_alloc(mm, GFP_KERNEL);
 
 	if (!table)
 		return NULL;
@@ -139,8 +142,8 @@ static inline void pmd_populate(struct mm_struct *mm,
 /*
  * page table entry allocation/free routines.
  */
-#define pte_alloc_one_kernel(mm) ((pte_t *)page_table_alloc(mm))
-#define pte_alloc_one(mm) ((pte_t *)page_table_alloc(mm))
+#define pte_alloc_one_kernel(mm, gfp) ((pte_t *)page_table_alloc(mm, gfp))
+#define pte_alloc_one(mm) ((pte_t *)page_table_alloc(mm, GFP_KERNEL))
 
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index b0246c705a..eeb1468369 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -192,14 +192,14 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	pmd_t *pmdp = NULL;
 
 	pgdp = pgd_offset(mm, addr);
-	p4dp = p4d_alloc(mm, pgdp, addr);
+	p4dp = p4d_alloc(mm, pgdp, addr, GFP_KERNEL);
 	if (p4dp) {
-		pudp = pud_alloc(mm, p4dp, addr);
+		pudp = pud_alloc(mm, p4dp, addr, GFP_KERNEL);
 		if (pudp) {
 			if (sz == PUD_SIZE)
 				return (pte_t *) pudp;
 			else if (sz == PMD_SIZE)
-				pmdp = pmd_alloc(mm, pudp, addr);
+				pmdp = pmd_alloc(mm, pudp, addr, GFP_KERNEL);
 		}
 	}
 	return (pte_t *) pmdp;
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index db6bb2f97a..b8c309de98 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -53,9 +53,9 @@ __initcall(page_table_register_sysctl);
 
 #endif /* CONFIG_PGSTE */
 
-unsigned long *crst_table_alloc(struct mm_struct *mm)
+unsigned long *crst_table_alloc(struct mm_struct *mm, gfp_t gfp)
 {
-	struct page *page = alloc_pages(GFP_KERNEL, 2);
+	struct page *page = alloc_pages(gfp, 2);
 
 	if (!page)
 		return NULL;
@@ -87,7 +87,7 @@ int crst_table_upgrade(struct mm_struct *mm, unsigned long end)
 	rc = 0;
 	notify = 0;
 	while (mm->context.asce_limit < end) {
-		table = crst_table_alloc(mm);
+		table = crst_table_alloc(mm, GFP_KERNEL);
 		if (!table) {
 			rc = -ENOMEM;
 			break;
@@ -179,7 +179,7 @@ void page_table_free_pgste(struct page *page)
 /*
  * page table entry allocation/free routines.
  */
-unsigned long *page_table_alloc(struct mm_struct *mm)
+unsigned long *page_table_alloc(struct mm_struct *mm, gfp_t gfp)
 {
 	unsigned long *table;
 	struct page *page;
@@ -209,7 +209,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 			return table;
 	}
 	/* Allocate a fresh page */
-	page = alloc_page(GFP_KERNEL);
+	page = alloc_page(gfp);
 	if (!page)
 		return NULL;
 	if (!pgtable_page_ctor(page)) {
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 8485d6dc27..0bc3249927 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -418,13 +418,13 @@ static pmd_t *pmd_alloc_map(struct mm_struct *mm, unsigned long addr)
 	pmd_t *pmd;
 
 	pgd = pgd_offset(mm, addr);
-	p4d = p4d_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return NULL;
-	pud = pud_alloc(mm, p4d, addr);
+	pud = pud_alloc(mm, p4d, addr, GFP_KERNEL);
 	if (!pud)
 		return NULL;
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	return pmd;
 }
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 0472e27feb..47ffefab75 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -54,7 +54,7 @@ pte_t __ref *vmem_pte_alloc(void)
 	pte_t *pte;
 
 	if (slab_is_available())
-		pte = (pte_t *) page_table_alloc(&init_mm);
+		pte = (pte_t *) page_table_alloc(&init_mm, GFP_KERNEL);
 	else
 		pte = (pte_t *) memblock_phys_alloc(size, size);
 	if (!pte)
diff --git a/arch/sh/include/asm/pgalloc.h b/arch/sh/include/asm/pgalloc.h
index 8ad73cb311..bd51502e8b 100644
--- a/arch/sh/include/asm/pgalloc.h
+++ b/arch/sh/include/asm/pgalloc.h
@@ -12,7 +12,8 @@ extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 
 #if PAGETABLE_LEVELS > 2
 extern void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd);
-extern pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address);
+extern pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address,
+			    gfp_t gfp);
 extern void pmd_free(struct mm_struct *mm, pmd_t *pmd);
 #endif
 
@@ -32,9 +33,9 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 /*
  * Allocate and free page tables.
  */
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
+	return quicklist_alloc(QUICK_PT, gfp, NULL);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 960deb1f24..1eb4932cdb 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -32,9 +32,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 
 	pgd = pgd_offset(mm, addr);
 	if (pgd) {
-		pud = pud_alloc(mm, pgd, addr);
+		pud = pud_alloc(mm, pgd, addr, GFP_KERNEL);
 		if (pud) {
-			pmd = pmd_alloc(mm, pud, addr);
+			pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 			if (pmd)
 				pte = pte_alloc_map(mm, pmd, addr);
 		}
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 70621324db..4bd118c32e 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -53,13 +53,13 @@ static pte_t *__get_pte_phys(unsigned long addr)
 		return NULL;
 	}
 
-	pud = pud_alloc(NULL, pgd, addr);
+	pud = pud_alloc(NULL, pgd, addr, GFP_KERNEL);
 	if (unlikely(!pud)) {
 		pud_ERROR(*pud);
 		return NULL;
 	}
 
-	pmd = pmd_alloc(NULL, pud, addr);
+	pmd = pmd_alloc(NULL, pud, addr, GFP_KERNEL);
 	if (unlikely(!pmd)) {
 		pmd_ERROR(*pmd);
 		return NULL;
diff --git a/arch/sh/mm/pgtable.c b/arch/sh/mm/pgtable.c
index 5c8f9247c3..972f54fa09 100644
--- a/arch/sh/mm/pgtable.c
+++ b/arch/sh/mm/pgtable.c
@@ -2,8 +2,6 @@
 #include <linux/mm.h>
 #include <linux/slab.h>
 
-#define PGALLOC_GFP GFP_KERNEL | __GFP_ZERO
-
 static struct kmem_cache *pgd_cachep;
 #if PAGETABLE_LEVELS > 2
 static struct kmem_cache *pmd_cachep;
@@ -32,7 +30,7 @@ void pgtable_cache_init(void)
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return kmem_cache_alloc(pgd_cachep, PGALLOC_GFP);
+	return kmem_cache_alloc(pgd_cachep, GFP_KERNEL|__GFP_ZERO);
 }
 
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
@@ -46,9 +44,9 @@ void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 	set_pud(pud, __pud((unsigned long)pmd));
 }
 
-pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
+pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address, gfp_t gfp)
 {
-	return kmem_cache_alloc(pmd_cachep, PGALLOC_GFP);
+	return kmem_cache_alloc(pmd_cachep, gfp|__GFP_ZERO);
 }
 
 void pmd_free(struct mm_struct *mm, pmd_t *pmd)
diff --git a/arch/sparc/include/asm/pgalloc_32.h b/arch/sparc/include/asm/pgalloc_32.h
index 282be50a4a..51dea1c004 100644
--- a/arch/sparc/include/asm/pgalloc_32.h
+++ b/arch/sparc/include/asm/pgalloc_32.h
@@ -38,7 +38,8 @@ static inline void pgd_set(pgd_t * pgdp, pmd_t * pmdp)
 #define pgd_populate(MM, PGD, PMD)      pgd_set(PGD, PMD)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm,
-				   unsigned long address)
+				   unsigned long address,
+				   gfp_t gfp)
 {
 	return srmmu_get_nocache(SRMMU_PMD_TABLE_SIZE,
 				 SRMMU_PMD_TABLE_SIZE);
@@ -60,12 +61,11 @@ void pmd_set(pmd_t *pmdp, pte_t *ptep);
 
 pgtable_t pte_alloc_one(struct mm_struct *mm);
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	return srmmu_get_nocache(PTE_SIZE, PTE_SIZE);
 }
 
-
 static inline void free_pte_fast(pte_t *pte)
 {
 	srmmu_free_nocache(pte, PTE_SIZE);
diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 48abccba49..e772ee60ee 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -40,9 +40,10 @@ static inline void __pud_populate(pud_t *pud, pmd_t *pmd)
 
 #define pud_populate(MM, PUD, PMD)	__pud_populate(PUD, PMD)
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return kmem_cache_alloc(pgtable_cache, GFP_KERNEL);
+	return kmem_cache_alloc(pgtable_cache, gfp);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
@@ -50,9 +51,10 @@ static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 	kmem_cache_free(pgtable_cache, pud);
 }
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return kmem_cache_alloc(pgtable_cache, GFP_KERNEL);
+	return kmem_cache_alloc(pgtable_cache, gfp);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -60,7 +62,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 	kmem_cache_free(pgtable_cache, pmd);
 }
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp);
 pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
 void pte_free(struct mm_struct *mm, pgtable_t ptepage);
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index f78793a06b..aeacfb0aab 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -281,12 +281,12 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	pmd_t *pmd;
 
 	pgd = pgd_offset(mm, addr);
-	pud = pud_alloc(mm, pgd, addr);
+	pud = pud_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!pud)
 		return NULL;
 	if (sz >= PUD_SIZE)
 		return (pte_t *)pud;
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	if (!pmd)
 		return NULL;
 	if (sz >= PMD_SIZE)
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index f2d70ff7a2..bd81b148f4 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2933,15 +2933,9 @@ void __flush_tlb_all(void)
 			     : : "r" (pstate));
 }
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO);
-	pte_t *pte = NULL;
-
-	if (page)
-		pte = (pte_t *) page_address(page);
-
-	return pte;
+	return (pte_t *) get_zeroed_page(gfp);
 }
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/sparc/mm/srmmu.c b/arch/sparc/mm/srmmu.c
index aaebbc00d2..143a5bc7ce 100644
--- a/arch/sparc/mm/srmmu.c
+++ b/arch/sparc/mm/srmmu.c
@@ -375,7 +375,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
 	unsigned long pte;
 	struct page *page;
 
-	if ((pte = (unsigned long)pte_alloc_one_kernel(mm)) == 0)
+	if ((pte = (unsigned long)pte_alloc_one_kernel(mm, GFP_KERNEL)) == 0)
 		return NULL;
 	page = pfn_to_page(__nocache_pa(pte) >> PAGE_SHIFT);
 	if (!pgtable_page_ctor(page)) {
diff --git a/arch/um/include/asm/pgalloc.h b/arch/um/include/asm/pgalloc.h
index 99eb568279..71090e43d0 100644
--- a/arch/um/include/asm/pgalloc.h
+++ b/arch/um/include/asm/pgalloc.h
@@ -25,7 +25,7 @@
 extern pgd_t *pgd_alloc(struct mm_struct *);
 extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *, gfp_t);
 extern pgtable_t pte_alloc_one(struct mm_struct *);
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h
index c4d876dfb9..7f5fd79234 100644
--- a/arch/um/include/asm/pgtable-3level.h
+++ b/arch/um/include/asm/pgtable-3level.h
@@ -80,7 +80,8 @@ static inline void pgd_mkuptodate(pgd_t pgd) { pgd_val(pgd) &= ~_PAGE_NEWPAGE; }
 #endif
 
 struct mm_struct;
-extern pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address);
+extern pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address,
+			    gfp_t gfp);
 
 static inline void pud_clear (pud_t *pud)
 {
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 99aa11bf53..3e0ce9f645 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -215,12 +215,9 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long) pgd);
 }
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	pte_t *pte;
-
-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
-	return pte;
+	return (pte_t *)get_zeroed_page(gfp);
 }
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
@@ -238,14 +235,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
 }
 
 #ifdef CONFIG_3_LEVEL_PGTABLES
-pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
+pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address,
+		     gfp_t gfp)
 {
-	pmd_t *pmd = (pmd_t *) __get_free_page(GFP_KERNEL);
-
-	if (pmd)
-		memset(pmd, 0, PAGE_SIZE);
-
-	return pmd;
+	return (pmd_t *) get_zeroed_page(gfp);
 }
 #endif
 
diff --git a/arch/um/kernel/skas/mmu.c b/arch/um/kernel/skas/mmu.c
index 7a1f2a936f..b677b615d6 100644
--- a/arch/um/kernel/skas/mmu.c
+++ b/arch/um/kernel/skas/mmu.c
@@ -24,11 +24,11 @@ static int init_stub_pte(struct mm_struct *mm, unsigned long proc,
 	pte_t *pte;
 
 	pgd = pgd_offset(mm, proc);
-	pud = pud_alloc(mm, pgd, proc);
+	pud = pud_alloc(mm, pgd, proc, GFP_KERNEL);
 	if (!pud)
 		goto out;
 
-	pmd = pmd_alloc(mm, pud, proc);
+	pmd = pmd_alloc(mm, pud, proc, GFP_KERNEL);
 	if (!pmd)
 		goto out_pmd;
 
diff --git a/arch/unicore32/include/asm/pgalloc.h b/arch/unicore32/include/asm/pgalloc.h
index 7cceabecf4..e5f6c1ae64 100644
--- a/arch/unicore32/include/asm/pgalloc.h
+++ b/arch/unicore32/include/asm/pgalloc.h
@@ -28,17 +28,15 @@ extern void free_pgd_slow(struct mm_struct *mm, pgd_t *pgd);
 #define pgd_alloc(mm)			get_pgd_slow(mm)
 #define pgd_free(mm, pgd)		free_pgd_slow(mm, pgd)
 
-#define PGALLOC_GFP	(GFP_KERNEL | __GFP_ZERO)
-
 /*
  * Allocate one PTE table.
  */
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm)
+pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *pte;
 
-	pte = (pte_t *)__get_free_page(PGALLOC_GFP);
+	pte = (pte_t *)get_zeroed_page(gfp);
 	if (pte)
 		clean_dcache_area(pte, PTRS_PER_PTE * sizeof(pte_t));
 
@@ -50,7 +48,7 @@ pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
-	pte = alloc_pages(PGALLOC_GFP, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if (!pte)
 		return NULL;
 	if (!PageHighMem(pte)) {
diff --git a/arch/unicore32/mm/pgd.c b/arch/unicore32/mm/pgd.c
index a830a300aa..b9c628a55f 100644
--- a/arch/unicore32/mm/pgd.c
+++ b/arch/unicore32/mm/pgd.c
@@ -50,7 +50,7 @@ pgd_t *get_pgd_slow(struct mm_struct *mm)
 		 * On UniCore, first page must always be allocated since it
 		 * contains the machine vectors.
 		 */
-		new_pmd = pmd_alloc(mm, (pud_t *)new_pgd, 0);
+		new_pmd = pmd_alloc(mm, (pud_t *)new_pgd, 0, GFP_KERNEL);
 		if (!new_pmd)
 			goto no_pmd;
 
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index a281e61ec6..1909a8dfaf 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -47,7 +47,7 @@ extern gfp_t __userpte_alloc_gfp;
 extern pgd_t *pgd_alloc(struct mm_struct *);
 extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *, gfp_t);
 extern pgtable_t pte_alloc_one(struct mm_struct *);
 
 /* Should really implement gc for free page table pages. This could be
@@ -99,14 +99,14 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
 #if CONFIG_PGTABLE_LEVELS > 2
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
 	struct page *page;
-	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
 
-	if (mm == &init_mm)
-		gfp &= ~__GFP_ACCOUNT;
-	page = alloc_pages(gfp, 0);
+	if (mm != &init_mm)
+		gfp |= __GFP_ACCOUNT;
+	page = alloc_pages(gfp|__GFP_ZERO, 0);
 	if (!page)
 		return NULL;
 	if (!pgtable_pmd_page_ctor(page)) {
@@ -160,12 +160,11 @@ static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d, pud_t *pu
 	set_p4d_safe(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
 }
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	gfp_t gfp = GFP_KERNEL_ACCOUNT;
-
-	if (mm == &init_mm)
-		gfp &= ~__GFP_ACCOUNT;
+	if (mm != &init_mm)
+		gfp |= __GFP_ACCOUNT;
 	return (pud_t *)get_zeroed_page(gfp);
 }
 
@@ -200,12 +199,11 @@ static inline void pgd_populate_safe(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4
 	set_pgd_safe(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
 }
 
-static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	gfp_t gfp = GFP_KERNEL_ACCOUNT;
-
-	if (mm == &init_mm)
-		gfp &= ~__GFP_ACCOUNT;
+	if (mm != &init_mm)
+		gfp |= __GFP_ACCOUNT;
 	return (p4d_t *)get_zeroed_page(gfp);
 }
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index aebd0d5bc0..46df9bb51f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -126,7 +126,7 @@ void __init init_espfix_bsp(void)
 
 	/* Install the espfix pud into the kernel page directory */
 	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
-	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
+	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR, GFP_KERNEL);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
 	/* Randomize the locations */
diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 6e5ef8fb8a..9a4f0fa6d6 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -124,13 +124,13 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn,
 	pte_t *pte;
 
 	pgd = pgd_offset(&tboot_mm, vaddr);
-	p4d = p4d_alloc(&tboot_mm, pgd, vaddr);
+	p4d = p4d_alloc(&tboot_mm, pgd, vaddr, GFP_KERNEL);
 	if (!p4d)
 		return -1;
-	pud = pud_alloc(&tboot_mm, p4d, vaddr);
+	pud = pud_alloc(&tboot_mm, p4d, vaddr, GFP_KERNEL);
 	if (!pud)
 		return -1;
-	pmd = pmd_alloc(&tboot_mm, pud, vaddr);
+	pmd = pmd_alloc(&tboot_mm, pud, vaddr, GFP_KERNEL);
 	if (!pmd)
 		return -1;
 	pte = pte_alloc_map(&tboot_mm, pmd, vaddr);
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7bd01709a0..04cb5aec7b 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -23,9 +23,9 @@ EXPORT_SYMBOL(physical_mask);
 
 gfp_t __userpte_alloc_gfp = PGALLOC_GFP | PGALLOC_USER_GFP;
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return (pte_t *)__get_free_page(PGALLOC_GFP & ~__GFP_ACCOUNT);
+	return (pte_t *) get_zeroed_page(gfp);
 }
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index cf0347f61b..9cb455ba28 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -106,7 +106,7 @@ pgd_t * __init efi_call_phys_prolog(void)
 		pgd_efi = pgd_offset_k(addr_pgd);
 		save_pgd[pgd] = *pgd_efi;
 
-		p4d = p4d_alloc(&init_mm, pgd_efi, addr_pgd);
+		p4d = p4d_alloc(&init_mm, pgd_efi, addr_pgd, GFP_KERNEL);
 		if (!p4d) {
 			pr_err("Failed to allocate p4d table!\n");
 			goto out;
@@ -116,7 +116,8 @@ pgd_t * __init efi_call_phys_prolog(void)
 			addr_p4d = addr_pgd + i * P4D_SIZE;
 			p4d_efi = p4d + p4d_index(addr_p4d);
 
-			pud = pud_alloc(&init_mm, p4d_efi, addr_p4d);
+			pud = pud_alloc(&init_mm, p4d_efi, addr_p4d,
+					GFP_KERNEL);
 			if (!pud) {
 				pr_err("Failed to allocate pud table!\n");
 				goto out;
@@ -217,13 +218,13 @@ int __init efi_alloc_page_tables(void)
 		return -ENOMEM;
 
 	pgd = efi_pgd + pgd_index(EFI_VA_END);
-	p4d = p4d_alloc(&init_mm, pgd, EFI_VA_END);
+	p4d = p4d_alloc(&init_mm, pgd, EFI_VA_END, GFP_KERNEL);
 	if (!p4d) {
 		free_page((unsigned long)efi_pgd);
 		return -ENOMEM;
 	}
 
-	pud = pud_alloc(&init_mm, p4d, EFI_VA_END);
+	pud = pud_alloc(&init_mm, p4d, EFI_VA_END, GFP_KERNEL);
 	if (!pud) {
 		if (pgtable_l5_enabled())
 			free_page((unsigned long) pgd_page_vaddr(*pgd));
diff --git a/arch/xtensa/include/asm/pgalloc.h b/arch/xtensa/include/asm/pgalloc.h
index b3b388ff2f..cc7ec6dd09 100644
--- a/arch/xtensa/include/asm/pgalloc.h
+++ b/arch/xtensa/include/asm/pgalloc.h
@@ -38,12 +38,12 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long)pgd);
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *ptep;
 	int i;
 
-	ptep = (pte_t *)__get_free_page(GFP_KERNEL);
+	ptep = (pte_t *)__get_free_page(gfp);
 	if (!ptep)
 		return NULL;
 	for (i = 0; i < 1024; i++)
diff --git a/drivers/staging/media/ipu3/ipu3-dmamap.c b/drivers/staging/media/ipu3/ipu3-dmamap.c
index d978a00e1e..f74221ab2b 100644
--- a/drivers/staging/media/ipu3/ipu3-dmamap.c
+++ b/drivers/staging/media/ipu3/ipu3-dmamap.c
@@ -137,7 +137,7 @@ void *imgu_dmamap_alloc(struct imgu_device *imgu, struct imgu_css_map *map,
 
 	map->vma->pages = pages;
 	/* And map it in KVA */
-	if (map_vm_area(map->vma, PAGE_KERNEL, pages))
+	if (map_vm_area(map->vma, GFP_KERNEL, PAGE_KERNEL, pages))
 		goto out_vunmap;
 
 	map->size = size;
diff --git a/include/asm-generic/4level-fixup.h b/include/asm-generic/4level-fixup.h
index e3667c9a33..652b68f475 100644
--- a/include/asm-generic/4level-fixup.h
+++ b/include/asm-generic/4level-fixup.h
@@ -12,9 +12,9 @@
 
 #define pud_t				pgd_t
 
-#define pmd_alloc(mm, pud, address) \
-	((unlikely(pgd_none(*(pud))) && __pmd_alloc(mm, pud, address))? \
- 		NULL: pmd_offset(pud, address))
+#define pmd_alloc(mm, pud, address, gfp) \
+	((unlikely(pgd_none(*(pud))) && __pmd_alloc(mm, pud, address, gfp)) \
+		? NULL : pmd_offset(pud, address))
 
 #define pud_offset(pgd, start)		(pgd)
 #define pud_none(pud)			0
diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
index bb6cb34701..c6f68f6a9f 100644
--- a/include/asm-generic/5level-fixup.h
+++ b/include/asm-generic/5level-fixup.h
@@ -13,11 +13,11 @@
 
 #define p4d_t				pgd_t
 
-#define pud_alloc(mm, p4d, address) \
-	((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \
+#define pud_alloc(mm, p4d, address, gfp) \
+	((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address, gfp)) ? \
 		NULL : pud_offset(p4d, address))
 
-#define p4d_alloc(mm, pgd, address)	(pgd)
+#define p4d_alloc(mm, pgd, address, gfp) (pgd)
 #define p4d_offset(pgd, start)		(pgd)
 #define p4d_none(p4d)			0
 #define p4d_bad(p4d)			0
diff --git a/include/asm-generic/pgtable-nop4d-hack.h b/include/asm-generic/pgtable-nop4d-hack.h
index 829bdb0d63..3ba3c7e4b9 100644
--- a/include/asm-generic/pgtable-nop4d-hack.h
+++ b/include/asm-generic/pgtable-nop4d-hack.h
@@ -53,7 +53,7 @@ static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
  * allocating and freeing a pud is trivial: the 1-entry pud is
  * inside the pgd, so has no extra memory associated with it.
  */
-#define pud_alloc_one(mm, address)		NULL
+#define pud_alloc_one(mm, address, gfp)		NULL
 #define pud_free(mm, x)				do { } while (0)
 #define __pud_free_tlb(tlb, x, a)		do { } while (0)
 
diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
index aebab905e6..7c9e00e44d 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -48,7 +48,7 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
  * allocating and freeing a p4d is trivial: the 1-entry p4d is
  * inside the pgd, so has no extra memory associated with it.
  */
-#define p4d_alloc_one(mm, address)		NULL
+#define p4d_alloc_one(mm, address, gfp)		NULL
 #define p4d_free(mm, x)				do { } while (0)
 #define __p4d_free_tlb(tlb, x, a)		do { } while (0)
 
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index b85b8271a7..e4a51cbdef 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -56,7 +56,7 @@ static inline pmd_t * pmd_offset(pud_t * pud, unsigned long address)
  * allocating and freeing a pmd is trivial: the 1-entry pmd is
  * inside the pud, so has no extra memory associated with it.
  */
-#define pmd_alloc_one(mm, address)		NULL
+#define pmd_alloc_one(mm, address, gfp)		NULL
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
 }
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index c77a1d3011..e7aacf134c 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -57,7 +57,7 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
  * allocating and freeing a pud is trivial: the 1-entry pud is
  * inside the p4d, so has no extra memory associated with it.
  */
-#define pud_alloc_one(mm, address)		NULL
+#define pud_alloc_one(mm, address, gfp)		NULL
 #define pud_free(mm, x)				do { } while (0)
 #define __pud_free_tlb(tlb, x, a)		do { } while (0)
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6b10c21630..d6f315e106 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1749,17 +1749,18 @@ static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
 
 #ifdef __PAGETABLE_P4D_FOLDED
 static inline int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
-						unsigned long address)
+			      unsigned long address, gfp_t gfp)
 {
 	return 0;
 }
 #else
-int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address);
+int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
+		unsigned long address, gfp_t gfp);
 #endif
 
 #if defined(__PAGETABLE_PUD_FOLDED) || !defined(CONFIG_MMU)
 static inline int __pud_alloc(struct mm_struct *mm, p4d_t *p4d,
-						unsigned long address)
+			      unsigned long address, gfp_t gfp)
 {
 	return 0;
 }
@@ -1767,7 +1768,8 @@ static inline void mm_inc_nr_puds(struct mm_struct *mm) {}
 static inline void mm_dec_nr_puds(struct mm_struct *mm) {}
 
 #else
-int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address);
+int __pud_alloc(struct mm_struct *mm, p4d_t *p4d,
+		unsigned long address, gfp_t gfp);
 
 static inline void mm_inc_nr_puds(struct mm_struct *mm)
 {
@@ -1786,7 +1788,7 @@ static inline void mm_dec_nr_puds(struct mm_struct *mm)
 
 #if defined(__PAGETABLE_PMD_FOLDED) || !defined(CONFIG_MMU)
 static inline int __pmd_alloc(struct mm_struct *mm, pud_t *pud,
-						unsigned long address)
+			      unsigned long address, gfp_t gfp)
 {
 	return 0;
 }
@@ -1795,7 +1797,8 @@ static inline void mm_inc_nr_pmds(struct mm_struct *mm) {}
 static inline void mm_dec_nr_pmds(struct mm_struct *mm) {}
 
 #else
-int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address);
+int __pmd_alloc(struct mm_struct *mm, pud_t *pud,
+		unsigned long address, gfp_t gfp);
 
 static inline void mm_inc_nr_pmds(struct mm_struct *mm)
 {
@@ -1845,7 +1848,7 @@ static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
 #endif
 
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
-int __pte_alloc_kernel(pmd_t *pmd);
+int __pte_alloc_kernel(pmd_t *pmd, gfp_t gfp);
 
 /*
  * The following ifdef needed to get the 4level-fixup.h header to work.
@@ -1855,24 +1858,25 @@ int __pte_alloc_kernel(pmd_t *pmd);
 
 #ifndef __ARCH_HAS_5LEVEL_HACK
 static inline p4d_t *p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
-		unsigned long address)
+		unsigned long address, gfp_t gfp)
 {
-	return (unlikely(pgd_none(*pgd)) && __p4d_alloc(mm, pgd, address)) ?
-		NULL : p4d_offset(pgd, address);
+	return (unlikely(pgd_none(*pgd)) && __p4d_alloc(mm, pgd, address, gfp))
+	    ? NULL : p4d_offset(pgd, address);
 }
 
 static inline pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d,
-		unsigned long address)
+		unsigned long address, gfp_t gfp)
 {
-	return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ?
-		NULL : pud_offset(p4d, address);
+	return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address, gfp))
+	    ? NULL : pud_offset(p4d, address);
 }
 #endif /* !__ARCH_HAS_5LEVEL_HACK */
 
-static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud,
+		unsigned long address, gfp_t gfp)
 {
-	return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))?
-		NULL: pmd_offset(pud, address);
+	return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address, gfp))
+	    ? NULL : pmd_offset(pud, address);
 }
 #endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
 
@@ -1985,8 +1989,8 @@ static inline void pgtable_page_dtor(struct page *page)
 	(pte_alloc(mm, pmd) ?			\
 		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
 
-#define pte_alloc_kernel(pmd, address)			\
-	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
+#define pte_alloc_kernel(pmd, address, gfp)			\
+	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, gfp))? \
 		NULL: pte_offset_kernel(pmd, address))
 
 #if USE_SPLIT_PMD_PTLOCKS
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd..11788d5ba3 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -135,7 +135,7 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size,
 extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
-extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
+extern int map_vm_area(struct vm_struct *area, gfp_t gfp, pgprot_t prot,
 			struct page **pages);
 #ifdef CONFIG_MMU
 extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
diff --git a/lib/ioremap.c b/lib/ioremap.c
index 0632136855..a4e21ef50c 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -65,7 +65,7 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
 	u64 pfn;
 
 	pfn = phys_addr >> PAGE_SHIFT;
-	pte = pte_alloc_kernel(pmd, addr);
+	pte = pte_alloc_kernel(pmd, addr, GFP_KERNEL);
 	if (!pte)
 		return -ENOMEM;
 	do {
@@ -101,7 +101,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	pmd_t *pmd;
 	unsigned long next;
 
-	pmd = pmd_alloc(&init_mm, pud, addr);
+	pmd = pmd_alloc(&init_mm, pud, addr, GFP_KERNEL);
 	if (!pmd)
 		return -ENOMEM;
 	do {
@@ -141,7 +141,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	pud_t *pud;
 	unsigned long next;
 
-	pud = pud_alloc(&init_mm, p4d, addr);
+	pud = pud_alloc(&init_mm, p4d, addr, GFP_KERNEL);
 	if (!pud)
 		return -ENOMEM;
 	do {
@@ -181,7 +181,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	p4d_t *p4d;
 	unsigned long next;
 
-	p4d = p4d_alloc(&init_mm, pgd, addr);
+	p4d = p4d_alloc(&init_mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return -ENOMEM;
 	do {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6cdc7b2d91..245d4a2585 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4683,7 +4683,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	spinlock_t *ptl;
 
 	if (!vma_shareable(vma, addr))
-		return (pte_t *)pmd_alloc(mm, pud, addr);
+		return (pte_t *)pmd_alloc(mm, pud, addr, GFP_KERNEL);
 
 	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
@@ -4714,7 +4714,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	}
 	spin_unlock(ptl);
 out:
-	pte = (pte_t *)pmd_alloc(mm, pud, addr);
+	pte = (pte_t *)pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	i_mmap_unlock_write(mapping);
 	return pte;
 }
@@ -4776,10 +4776,10 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	pte_t *pte = NULL;
 
 	pgd = pgd_offset(mm, addr);
-	p4d = p4d_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return NULL;
-	pud = pud_alloc(mm, p4d, addr);
+	pud = pud_alloc(mm, p4d, addr, GFP_KERNEL);
 	if (pud) {
 		if (sz == PUD_SIZE) {
 			pte = (pte_t *)pud;
@@ -4788,7 +4788,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 			if (want_pmd_share() && pud_none(*pud))
 				pte = huge_pmd_share(mm, addr, pud);
 			else
-				pte = (pte_t *)pmd_alloc(mm, pud, addr);
+				pte = (pte_t *)pmd_alloc(mm, pud, addr,
+							 GFP_KERNEL);
 		}
 	}
 	BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte));
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index ce45c491eb..3ed63dcb7a 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -129,7 +129,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 			pte_t *p;
 
 			if (slab_is_available())
-				p = pte_alloc_one_kernel(&init_mm);
+				p = pte_alloc_one_kernel(&init_mm, GFP_KERNEL);
 			else
 				p = early_alloc(PAGE_SIZE, NUMA_NO_NODE);
 			if (!p)
@@ -166,7 +166,7 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 			pmd_t *p;
 
 			if (slab_is_available()) {
-				p = pmd_alloc(&init_mm, pud, addr);
+				p = pmd_alloc(&init_mm, pud, addr, GFP_KERNEL);
 				if (!p)
 					return -ENOMEM;
 			} else {
@@ -207,7 +207,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 			pud_t *p;
 
 			if (slab_is_available()) {
-				p = pud_alloc(&init_mm, p4d, addr);
+				p = pud_alloc(&init_mm, p4d, addr, GFP_KERNEL);
 				if (!p)
 					return -ENOMEM;
 			} else {
@@ -280,7 +280,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
 			p4d_t *p;
 
 			if (slab_is_available()) {
-				p = p4d_alloc(&init_mm, pgd, addr);
+				p = p4d_alloc(&init_mm, pgd, addr, GFP_KERNEL);
 				if (!p)
 					return -ENOMEM;
 			} else {
diff --git a/mm/memory.c b/mm/memory.c
index ab650c21bc..f599cdd1bc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -435,9 +435,9 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
 	return 0;
 }
 
-int __pte_alloc_kernel(pmd_t *pmd)
+int __pte_alloc_kernel(pmd_t *pmd, gfp_t gfp)
 {
-	pte_t *new = pte_alloc_one_kernel(&init_mm);
+	pte_t *new = pte_alloc_one_kernel(&init_mm, gfp);
 	if (!new)
 		return -ENOMEM;
 
@@ -884,7 +884,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 	pmd_t *src_pmd, *dst_pmd;
 	unsigned long next;
 
-	dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
+	dst_pmd = pmd_alloc(dst_mm, dst_pud, addr, GFP_KERNEL);
 	if (!dst_pmd)
 		return -ENOMEM;
 	src_pmd = pmd_offset(src_pud, addr);
@@ -918,7 +918,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 	pud_t *src_pud, *dst_pud;
 	unsigned long next;
 
-	dst_pud = pud_alloc(dst_mm, dst_p4d, addr);
+	dst_pud = pud_alloc(dst_mm, dst_p4d, addr, GFP_KERNEL);
 	if (!dst_pud)
 		return -ENOMEM;
 	src_pud = pud_offset(src_p4d, addr);
@@ -952,7 +952,7 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src
 	p4d_t *src_p4d, *dst_p4d;
 	unsigned long next;
 
-	dst_p4d = p4d_alloc(dst_mm, dst_pgd, addr);
+	dst_p4d = p4d_alloc(dst_mm, dst_pgd, addr, GFP_KERNEL);
 	if (!dst_p4d)
 		return -ENOMEM;
 	src_p4d = p4d_offset(src_pgd, addr);
@@ -1422,13 +1422,13 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
 	pmd_t *pmd;
 
 	pgd = pgd_offset(mm, addr);
-	p4d = p4d_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return NULL;
-	pud = pud_alloc(mm, p4d, addr);
+	pud = pud_alloc(mm, p4d, addr, GFP_KERNEL);
 	if (!pud)
 		return NULL;
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	if (!pmd)
 		return NULL;
 
@@ -1768,7 +1768,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	if (!pmd)
 		return -ENOMEM;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
@@ -1791,7 +1791,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
-	pud = pud_alloc(mm, p4d, addr);
+	pud = pud_alloc(mm, p4d, addr, GFP_KERNEL);
 	if (!pud)
 		return -ENOMEM;
 	do {
@@ -1813,7 +1813,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
-	p4d = p4d_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return -ENOMEM;
 	do {
@@ -1956,7 +1956,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	spinlock_t *uninitialized_var(ptl);
 
 	pte = (mm == &init_mm) ?
-		pte_alloc_kernel(pmd, addr) :
+		pte_alloc_kernel(pmd, addr, GFP_KERNEL) :
 		pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
@@ -1990,7 +1990,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 
 	BUG_ON(pud_huge(*pud));
 
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	if (!pmd)
 		return -ENOMEM;
 	do {
@@ -2010,7 +2010,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	unsigned long next;
 	int err;
 
-	pud = pud_alloc(mm, p4d, addr);
+	pud = pud_alloc(mm, p4d, addr, GFP_KERNEL);
 	if (!pud)
 		return -ENOMEM;
 	do {
@@ -2030,7 +2030,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	unsigned long next;
 	int err;
 
-	p4d = p4d_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return -ENOMEM;
 	do {
@@ -3868,11 +3868,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	vm_fault_t ret;
 
 	pgd = pgd_offset(mm, address);
-	p4d = p4d_alloc(mm, pgd, address);
+	p4d = p4d_alloc(mm, pgd, address, GFP_KERNEL);
 	if (!p4d)
 		return VM_FAULT_OOM;
 
-	vmf.pud = pud_alloc(mm, p4d, address);
+	vmf.pud = pud_alloc(mm, p4d, address, GFP_KERNEL);
 	if (!vmf.pud)
 		return VM_FAULT_OOM;
 	if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
@@ -3898,7 +3898,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		}
 	}
 
-	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
+	vmf.pmd = pmd_alloc(mm, vmf.pud, address, GFP_KERNEL);
 	if (!vmf.pmd)
 		return VM_FAULT_OOM;
 	if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
@@ -3991,9 +3991,10 @@ EXPORT_SYMBOL_GPL(handle_mm_fault);
  * Allocate p4d page table.
  * We've already handled the fast-path in-line.
  */
-int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address,
+		gfp_t gfp)
 {
-	p4d_t *new = p4d_alloc_one(mm, address);
+	p4d_t *new = p4d_alloc_one(mm, address, gfp);
 	if (!new)
 		return -ENOMEM;
 
@@ -4014,9 +4015,10 @@ int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
  * Allocate page upper directory.
  * We've already handled the fast-path in-line.
  */
-int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address)
+int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address,
+		gfp_t gfp)
 {
-	pud_t *new = pud_alloc_one(mm, address);
+	pud_t *new = pud_alloc_one(mm, address, gfp);
 	if (!new)
 		return -ENOMEM;
 
@@ -4046,10 +4048,11 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address)
  * Allocate page middle directory.
  * We've already handled the fast-path in-line.
  */
-int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
+int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address,
+		gfp_t gfp)
 {
 	spinlock_t *ptl;
-	pmd_t *new = pmd_alloc_one(mm, address);
+	pmd_t *new = pmd_alloc_one(mm, address, gfp);
 	if (!new)
 		return -ENOMEM;
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 663a544936..917ff0b3f7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2616,13 +2616,13 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto abort;
 
 	pgdp = pgd_offset(mm, addr);
-	p4dp = p4d_alloc(mm, pgdp, addr);
+	p4dp = p4d_alloc(mm, pgdp, addr, GFP_KERNEL);
 	if (!p4dp)
 		goto abort;
-	pudp = pud_alloc(mm, p4dp, addr);
+	pudp = pud_alloc(mm, p4dp, addr, GFP_KERNEL);
 	if (!pudp)
 		goto abort;
-	pmdp = pmd_alloc(mm, pudp, addr);
+	pmdp = pmd_alloc(mm, pudp, addr, GFP_KERNEL);
 	if (!pmdp)
 		goto abort;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index e3edef6b7a..b1f9605fad 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -65,14 +65,14 @@ static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 	pmd_t *pmd;
 
 	pgd = pgd_offset(mm, addr);
-	p4d = p4d_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return NULL;
-	pud = pud_alloc(mm, p4d, addr);
+	pud = pud_alloc(mm, p4d, addr, GFP_KERNEL);
 	if (!pud)
 		return NULL;
 
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	if (!pmd)
 		return NULL;
 
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d59b5a73df..9bb9d44834 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -153,10 +153,10 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 	pud_t *pud;
 
 	pgd = pgd_offset(mm, address);
-	p4d = p4d_alloc(mm, pgd, address);
+	p4d = p4d_alloc(mm, pgd, address, GFP_KERNEL);
 	if (!p4d)
 		return NULL;
-	pud = pud_alloc(mm, p4d, address);
+	pud = pud_alloc(mm, p4d, address, GFP_KERNEL);
 	if (!pud)
 		return NULL;
 	/*
@@ -164,7 +164,7 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 	 * missing, the *pmd may be already established and in
 	 * turn it may also be a trans_huge_pmd.
 	 */
-	return pmd_alloc(mm, pud, address);
+	return pmd_alloc(mm, pud, address, GFP_KERNEL);
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e86ba6e74b..288d078ab6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -132,7 +132,8 @@ static void vunmap_page_range(unsigned long addr, unsigned long end)
 }
 
 static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, gfp_t gfp, pgprot_t prot,
+		struct page **pages, int *nr)
 {
 	pte_t *pte;
 
@@ -141,7 +142,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
 	 * callers keep track of where we're up to.
 	 */
 
-	pte = pte_alloc_kernel(pmd, addr);
+	pte = pte_alloc_kernel(pmd, addr, gfp);
 	if (!pte)
 		return -ENOMEM;
 	do {
@@ -158,51 +159,54 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
 }
 
 static int vmap_pmd_range(pud_t *pud, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, gfp_t gfp, pgprot_t prot,
+		struct page **pages, int *nr)
 {
 	pmd_t *pmd;
 	unsigned long next;
 
-	pmd = pmd_alloc(&init_mm, pud, addr);
+	pmd = pmd_alloc(&init_mm, pud, addr, gfp);
 	if (!pmd)
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
-		if (vmap_pte_range(pmd, addr, next, prot, pages, nr))
+		if (vmap_pte_range(pmd, addr, next, gfp, prot, pages, nr))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
 }
 
 static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, gfp_t gfp, pgprot_t prot,
+		struct page **pages, int *nr)
 {
 	pud_t *pud;
 	unsigned long next;
 
-	pud = pud_alloc(&init_mm, p4d, addr);
+	pud = pud_alloc(&init_mm, p4d, addr, gfp);
 	if (!pud)
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (vmap_pmd_range(pud, addr, next, prot, pages, nr))
+		if (vmap_pmd_range(pud, addr, next, gfp, prot, pages, nr))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
 	return 0;
 }
 
 static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, gfp_t gfp, pgprot_t prot,
+		struct page **pages, int *nr)
 {
 	p4d_t *p4d;
 	unsigned long next;
 
-	p4d = p4d_alloc(&init_mm, pgd, addr);
+	p4d = p4d_alloc(&init_mm, pgd, addr, gfp);
 	if (!p4d)
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		if (vmap_pud_range(p4d, addr, next, prot, pages, nr))
+		if (vmap_pud_range(p4d, addr, next, gfp, prot, pages, nr))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
 	return 0;
@@ -215,7 +219,8 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
  * Ie. pte at addr+N*PAGE_SIZE shall point to pfn corresponding to pages[N]
  */
 static int vmap_page_range_noflush(unsigned long start, unsigned long end,
-				   pgprot_t prot, struct page **pages)
+				   gfp_t gfp, pgprot_t prot,
+				   struct page **pages)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -227,7 +232,7 @@ static int vmap_page_range_noflush(unsigned long start, unsigned long end,
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr);
+		err = vmap_p4d_range(pgd, addr, next, gfp, prot, pages, &nr);
 		if (err)
 			return err;
 	} while (pgd++, addr = next, addr != end);
@@ -236,11 +241,11 @@ static int vmap_page_range_noflush(unsigned long start, unsigned long end,
 }
 
 static int vmap_page_range(unsigned long start, unsigned long end,
-			   pgprot_t prot, struct page **pages)
+			   gfp_t gfp, pgprot_t prot, struct page **pages)
 {
 	int ret;
 
-	ret = vmap_page_range_noflush(start, end, prot, pages);
+	ret = vmap_page_range_noflush(start, end, gfp, prot, pages);
 	flush_cache_vmap(start, end);
 	return ret;
 }
@@ -1182,7 +1187,7 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro
 		addr = va->va_start;
 		mem = (void *)addr;
 	}
-	if (vmap_page_range(addr, addr + size, prot, pages) < 0) {
+	if (vmap_page_range(addr, addr + size, GFP_KERNEL, prot, pages) < 0) {
 		vm_unmap_ram(mem, count);
 		return NULL;
 	}
@@ -1298,7 +1303,8 @@ void __init vmalloc_init(void)
 int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 			     pgprot_t prot, struct page **pages)
 {
-	return vmap_page_range_noflush(addr, addr + size, prot, pages);
+	return vmap_page_range_noflush(addr, addr + size, GFP_KERNEL, prot,
+				       pages);
 }
 
 /**
@@ -1339,13 +1345,14 @@ void unmap_kernel_range(unsigned long addr, unsigned long size)
 }
 EXPORT_SYMBOL_GPL(unmap_kernel_range);
 
-int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages)
+int map_vm_area(struct vm_struct *area, gfp_t gfp,
+		pgprot_t prot, struct page **pages)
 {
 	unsigned long addr = (unsigned long)area->addr;
 	unsigned long end = addr + get_vm_area_size(area);
 	int err;
 
-	err = vmap_page_range(addr, end, prot, pages);
+	err = vmap_page_range(addr, end, gfp, prot, pages);
 
 	return err > 0 ? 0 : err;
 }
@@ -1661,7 +1668,7 @@ void *vmap(struct page **pages, unsigned int count,
 	if (!area)
 		return NULL;
 
-	if (map_vm_area(area, prot, pages)) {
+	if (map_vm_area(area, GFP_KERNEL, prot, pages)) {
 		vunmap(area->addr);
 		return NULL;
 	}
@@ -1720,7 +1727,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 			cond_resched();
 	}
 
-	if (map_vm_area(area, prot, pages))
+	if (map_vm_area(area, gfp_mask, prot, pages))
 		goto fail;
 	return area->addr;
 
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 0787d33b80..d369e5bf27 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1151,7 +1151,7 @@ static inline void __zs_cpu_down(struct mapping_area *area)
 static inline void *__zs_map_object(struct mapping_area *area,
 				struct page *pages[2], int off, int size)
 {
-	BUG_ON(map_vm_area(area->vm, PAGE_KERNEL, pages));
+	BUG_ON(map_vm_area(area->vm, GFP_KERNEL, PAGE_KERNEL, pages));
 	area->vm_addr = area->vm->addr;
 	return area->vm_addr + off;
 }
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index a39dcfdbcc..0829eefb61 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -645,7 +645,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 		BUG_ON(pmd_sect(*pmd));
 
 		if (pmd_none(*pmd)) {
-			pte = pte_alloc_one_kernel(NULL);
+			pte = pte_alloc_one_kernel(NULL, GFP_KERNEL);
 			if (!pte) {
 				kvm_err("Cannot allocate Hyp pte\n");
 				return -ENOMEM;
@@ -677,7 +677,7 @@ static int create_hyp_pud_mappings(pgd_t *pgd, unsigned long start,
 		pud = pud_offset(pgd, addr);
 
 		if (pud_none_or_clear_bad(pud)) {
-			pmd = pmd_alloc_one(NULL, addr);
+			pmd = pmd_alloc_one(NULL, addr, GFP_KERNEL);
 			if (!pmd) {
 				kvm_err("Cannot allocate Hyp pmd\n");
 				return -ENOMEM;
@@ -712,7 +712,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 		pgd = pgdp + kvm_pgd_index(addr, ptrs_per_pgd);
 
 		if (pgd_none(*pgd)) {
-			pud = pud_alloc_one(NULL, addr);
+			pud = pud_alloc_one(NULL, addr, GFP_KERNEL);
 			if (!pud) {
 				kvm_err("Cannot allocate Hyp pud\n");
 				err = -ENOMEM;
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 08/12] block: Add some exports for bcachefs
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (6 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 07/12] Propagate gfp_t when allocating pte entries from __vmalloc Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 09/12] bcache: optimize continue_at_nobarrier() Kent Overstreet
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 block/bio.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 716510ecd7..a67aa6e0de 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -958,6 +958,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
 static void submit_bio_wait_endio(struct bio *bio)
 {
@@ -1658,6 +1659,7 @@ void bio_set_pages_dirty(struct bio *bio)
 			set_page_dirty_lock(bvec->bv_page);
 	}
 }
+EXPORT_SYMBOL_GPL(bio_set_pages_dirty);
 
 static void bio_release_pages(struct bio *bio)
 {
@@ -1731,6 +1733,7 @@ void bio_check_pages_dirty(struct bio *bio)
 	spin_unlock_irqrestore(&bio_dirty_lock, flags);
 	schedule_work(&bio_dirty_work);
 }
+EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
 
 void update_io_ticks(struct hd_struct *part, unsigned long now)
 {
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 09/12] bcache: optimize continue_at_nobarrier()
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (7 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 08/12] block: Add some exports for bcachefs Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 10/12] bcache: move closures to lib/ Kent Overstreet
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 drivers/md/bcache/closure.h | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
index c88cdc4ae4..376c5e659c 100644
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -245,7 +245,7 @@ static inline void closure_queue(struct closure *cl)
 		     != offsetof(struct work_struct, func));
 	if (wq) {
 		INIT_WORK(&cl->work, cl->work.func);
-		BUG_ON(!queue_work(wq, &cl->work));
+		queue_work(wq, &cl->work);
 	} else
 		cl->fn(cl);
 }
@@ -340,8 +340,13 @@ do {									\
  */
 #define continue_at_nobarrier(_cl, _fn, _wq)				\
 do {									\
-	set_closure_fn(_cl, _fn, _wq);					\
-	closure_queue(_cl);						\
+	closure_set_ip(_cl);						\
+	if (_wq) {							\
+		INIT_WORK(&(_cl)->work, (void *) _fn);			\
+		queue_work((_wq), &(_cl)->work);			\
+	} else {							\
+		(_fn)(_cl);						\
+	}								\
 } while (0)
 
 /**
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 10/12] bcache: move closures to lib/
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (8 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 09/12] bcache: optimize continue_at_nobarrier() Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-11 10:25   ` Coly Li
  2019-06-13  7:28   ` Christoph Hellwig
  2019-06-10 19:14 ` [PATCH 11/12] closures: closure_wait_event() Kent Overstreet
                   ` (4 subsequent siblings)
  14 siblings, 2 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Prep work for bcachefs - being a fork of bcache it also uses closures

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 drivers/md/bcache/Kconfig                     | 10 +------
 drivers/md/bcache/Makefile                    |  6 ++--
 drivers/md/bcache/bcache.h                    |  2 +-
 drivers/md/bcache/super.c                     |  1 -
 drivers/md/bcache/util.h                      |  3 +-
 .../md/bcache => include/linux}/closure.h     | 17 ++++++-----
 lib/Kconfig                                   |  3 ++
 lib/Kconfig.debug                             |  9 ++++++
 lib/Makefile                                  |  2 ++
 {drivers/md/bcache => lib}/closure.c          | 28 ++++++-------------
 10 files changed, 37 insertions(+), 44 deletions(-)
 rename {drivers/md/bcache => include/linux}/closure.h (97%)
 rename {drivers/md/bcache => lib}/closure.c (89%)

diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
index f6e0a8b3a6..3dd1d48987 100644
--- a/drivers/md/bcache/Kconfig
+++ b/drivers/md/bcache/Kconfig
@@ -2,6 +2,7 @@
 config BCACHE
 	tristate "Block device as cache"
 	select CRC64
+	select CLOSURES
 	help
 	Allows a block device to be used as cache for other devices; uses
 	a btree for indexing and the layout is optimized for SSDs.
@@ -16,12 +17,3 @@ config BCACHE_DEBUG
 
 	Enables extra debugging tools, allows expensive runtime checks to be
 	turned on.
-
-config BCACHE_CLOSURES_DEBUG
-	bool "Debug closures"
-	depends on BCACHE
-	select DEBUG_FS
-	help
-	Keeps all active closures in a linked list and provides a debugfs
-	interface to list them, which makes it possible to see asynchronous
-	operations that get stuck.
diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
index d26b351958..2b790fb813 100644
--- a/drivers/md/bcache/Makefile
+++ b/drivers/md/bcache/Makefile
@@ -2,8 +2,8 @@
 
 obj-$(CONFIG_BCACHE)	+= bcache.o
 
-bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
-	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
-	util.o writeback.o
+bcache-y		:= alloc.o bset.o btree.o debug.o extents.o io.o\
+	journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o util.o\
+	writeback.o
 
 CFLAGS_request.o	+= -Iblock
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index fdf75352e1..ced9f1526c 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -180,6 +180,7 @@
 
 #include <linux/bcache.h>
 #include <linux/bio.h>
+#include <linux/closure.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
@@ -192,7 +193,6 @@
 
 #include "bset.h"
 #include "util.h"
-#include "closure.h"
 
 struct bucket {
 	atomic_t	pin;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index a697a3a923..da6803f280 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2487,7 +2487,6 @@ static int __init bcache_init(void)
 		goto err;
 
 	bch_debug_init();
-	closure_debug_init();
 
 	return 0;
 err:
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 00aab6abcf..8a75100c0b 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@
 #define _BCACHE_UTIL_H
 
 #include <linux/blkdev.h>
+#include <linux/closure.h>
 #include <linux/errno.h>
 #include <linux/kernel.h>
 #include <linux/sched/clock.h>
@@ -13,8 +14,6 @@
 #include <linux/workqueue.h>
 #include <linux/crc64.h>
 
-#include "closure.h"
-
 #define PAGE_SECTORS		(PAGE_SIZE / 512)
 
 struct closure;
diff --git a/drivers/md/bcache/closure.h b/include/linux/closure.h
similarity index 97%
rename from drivers/md/bcache/closure.h
rename to include/linux/closure.h
index 376c5e659c..308e38028c 100644
--- a/drivers/md/bcache/closure.h
+++ b/include/linux/closure.h
@@ -155,7 +155,7 @@ struct closure {
 
 	atomic_t		remaining;
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 #define CLOSURE_MAGIC_DEAD	0xc054dead
 #define CLOSURE_MAGIC_ALIVE	0xc054a11e
 
@@ -184,15 +184,13 @@ static inline void closure_sync(struct closure *cl)
 		__closure_sync(cl);
 }
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
-void closure_debug_init(void);
 void closure_debug_create(struct closure *cl);
 void closure_debug_destroy(struct closure *cl);
 
 #else
 
-static inline void closure_debug_init(void) {}
 static inline void closure_debug_create(struct closure *cl) {}
 static inline void closure_debug_destroy(struct closure *cl) {}
 
@@ -200,21 +198,21 @@ static inline void closure_debug_destroy(struct closure *cl) {}
 
 static inline void closure_set_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->ip = _THIS_IP_;
 #endif
 }
 
 static inline void closure_set_ret_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->ip = _RET_IP_;
 #endif
 }
 
 static inline void closure_set_waiting(struct closure *cl, unsigned long f)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->waiting_on = f;
 #endif
 }
@@ -243,6 +241,7 @@ static inline void closure_queue(struct closure *cl)
 	 */
 	BUILD_BUG_ON(offsetof(struct closure, fn)
 		     != offsetof(struct work_struct, func));
+
 	if (wq) {
 		INIT_WORK(&cl->work, cl->work.func);
 		queue_work(wq, &cl->work);
@@ -255,7 +254,7 @@ static inline void closure_queue(struct closure *cl)
  */
 static inline void closure_get(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	BUG_ON((atomic_inc_return(&cl->remaining) &
 		CLOSURE_REMAINING_MASK) <= 1);
 #else
@@ -271,7 +270,7 @@ static inline void closure_get(struct closure *cl)
  */
 static inline void closure_init(struct closure *cl, struct closure *parent)
 {
-	memset(cl, 0, sizeof(struct closure));
+	cl->fn = NULL;
 	cl->parent = parent;
 	if (parent)
 		closure_get(parent);
diff --git a/lib/Kconfig b/lib/Kconfig
index a9e56539bd..09a25af0d0 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -427,6 +427,9 @@ config ASSOCIATIVE_ARRAY
 
 	  for more information.
 
+config CLOSURES
+	bool
+
 config HAS_IOMEM
 	bool
 	depends on !NO_IOMEM
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d5a4a4036d..6d97985e7e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1397,6 +1397,15 @@ config DEBUG_CREDENTIALS
 
 source "kernel/rcu/Kconfig.debug"
 
+config DEBUG_CLOSURES
+	bool "Debug closures (bcache async widgits)"
+	depends on CLOSURES
+	select DEBUG_FS
+	help
+	Keeps all active closures in a linked list and provides a debugfs
+	interface to list them, which makes it possible to see asynchronous
+	operations that get stuck.
+
 config DEBUG_WQ_FORCE_RR_CPU
 	bool "Force round-robin CPU selection for unbound work items"
 	depends on DEBUG_KERNEL
diff --git a/lib/Makefile b/lib/Makefile
index 18c2be516a..2003eda127 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -193,6 +193,8 @@ obj-$(CONFIG_ATOMIC64_SELFTEST) += atomic64_test.o
 
 obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o
 
+obj-$(CONFIG_CLOSURES) += closure.o
+
 obj-$(CONFIG_CORDIC) += cordic.o
 
 obj-$(CONFIG_DQL) += dynamic_queue_limits.o
diff --git a/drivers/md/bcache/closure.c b/lib/closure.c
similarity index 89%
rename from drivers/md/bcache/closure.c
rename to lib/closure.c
index 73f5319295..46cfe4c382 100644
--- a/drivers/md/bcache/closure.c
+++ b/lib/closure.c
@@ -6,13 +6,12 @@
  * Copyright 2012 Google, Inc.
  */
 
+#include <linux/closure.h>
 #include <linux/debugfs.h>
-#include <linux/module.h>
+#include <linux/export.h>
 #include <linux/seq_file.h>
 #include <linux/sched/debug.h>
 
-#include "closure.h"
-
 static inline void closure_put_after_sub(struct closure *cl, int flags)
 {
 	int r = flags & CLOSURE_REMAINING_MASK;
@@ -127,7 +126,7 @@ void __sched __closure_sync(struct closure *cl)
 }
 EXPORT_SYMBOL(__closure_sync);
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
 static LIST_HEAD(closure_list);
 static DEFINE_SPINLOCK(closure_list_lock);
@@ -158,8 +157,6 @@ void closure_debug_destroy(struct closure *cl)
 }
 EXPORT_SYMBOL(closure_debug_destroy);
 
-static struct dentry *closure_debug;
-
 static int debug_seq_show(struct seq_file *f, void *data)
 {
 	struct closure *cl;
@@ -182,7 +179,7 @@ static int debug_seq_show(struct seq_file *f, void *data)
 			seq_printf(f, " W %pS\n",
 				   (void *) cl->waiting_on);
 
-		seq_printf(f, "\n");
+		seq_puts(f, "\n");
 	}
 
 	spin_unlock_irq(&closure_list_lock);
@@ -201,18 +198,11 @@ static const struct file_operations debug_ops = {
 	.release	= single_release
 };
 
-void  __init closure_debug_init(void)
+static int __init closure_debug_init(void)
 {
-	if (!IS_ERR_OR_NULL(bcache_debug))
-		/*
-		 * it is unnecessary to check return value of
-		 * debugfs_create_file(), we should not care
-		 * about this.
-		 */
-		closure_debug = debugfs_create_file(
-			"closures", 0400, bcache_debug, NULL, &debug_ops);
+	debugfs_create_file("closures", 0400, NULL, NULL, &debug_ops);
+	return 0;
 }
-#endif
+late_initcall(closure_debug_init)
 
-MODULE_AUTHOR("Kent Overstreet <koverstreet@google.com>");
-MODULE_LICENSE("GPL");
+#endif
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 11/12] closures: closure_wait_event()
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (9 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 10/12] bcache: move closures to lib/ Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-11 10:25   ` Coly Li
  2019-06-12 17:17   ` Greg KH
  2019-06-10 19:14 ` [PATCH 12/12] closures: fix a race on wakeup from closure_sync Kent Overstreet
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 include/linux/closure.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/closure.h b/include/linux/closure.h
index 308e38028c..abacb91c35 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -379,4 +379,26 @@ static inline void closure_call(struct closure *cl, closure_fn fn,
 	continue_at_nobarrier(cl, fn, wq);
 }
 
+#define __closure_wait_event(waitlist, _cond)				\
+do {									\
+	struct closure cl;						\
+									\
+	closure_init_stack(&cl);					\
+									\
+	while (1) {							\
+		closure_wait(waitlist, &cl);				\
+		if (_cond)						\
+			break;						\
+		closure_sync(&cl);					\
+	}								\
+	closure_wake_up(waitlist);					\
+	closure_sync(&cl);						\
+} while (0)
+
+#define closure_wait_event(waitlist, _cond)				\
+do {									\
+	if (!(_cond))							\
+		__closure_wait_event(waitlist, _cond);			\
+} while (0)
+
 #endif /* _LINUX_CLOSURE_H */
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 12/12] closures: fix a race on wakeup from closure_sync
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (10 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 11/12] closures: closure_wait_event() Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-07-16 10:47   ` Coly Li
  2019-06-10 20:46 ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 lib/closure.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/lib/closure.c b/lib/closure.c
index 46cfe4c382..3e6366c262 100644
--- a/lib/closure.c
+++ b/lib/closure.c
@@ -104,8 +104,14 @@ struct closure_syncer {
 
 static void closure_sync_fn(struct closure *cl)
 {
-	cl->s->done = 1;
-	wake_up_process(cl->s->task);
+	struct closure_syncer *s = cl->s;
+	struct task_struct *p;
+
+	rcu_read_lock();
+	p = READ_ONCE(s->task);
+	s->done = 1;
+	wake_up_process(p);
+	rcu_read_unlock();
 }
 
 void __sched __closure_sync(struct closure *cl)
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (11 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 12/12] closures: fix a race on wakeup from closure_sync Kent Overstreet
@ 2019-06-10 20:46 ` Linus Torvalds
  2019-06-11  1:17   ` Kent Overstreet
  2019-06-11  4:10   ` Dave Chinner
  2019-07-03  5:59 ` Stefan K
  2020-06-10 11:02 ` Stefan K
  14 siblings, 2 replies; 42+ messages in thread
From: Linus Torvalds @ 2019-06-10 20:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Dave Chinner, Darrick J . Wong, Zach Brown, Peter Zijlstra,
	Jens Axboe, Josef Bacik, Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 9:14 AM Kent Overstreet
<kent.overstreet@gmail.com> wrote:
>
> So. Here's my bcachefs-for-review branch - this has the minimal set of patches
> outside of fs/bcachefs/. My master branch has some performance optimizations for
> the core buffered IO paths, but those are fairly tricky and invasive so I want
> to hold off on those for now - this branch is intended to be more or less
> suitable for merging as is.

Honestly, it really isn't.

There are obvious things wrong with it - like the fact that you've
rebased it so that the original history is gone, yet you've not
actually *fixed* the history, so you find things like reverts of
commits that should simply have been removed, and fixes for things
that should just have been fixed in the original commit the fix is
for.

And this isn't just "if you rebase, just fix things". You have actual
bogus commit messages as a result of this all.

So to point at the revert, for example. The commit message is

    This reverts commit 36f389604294dfc953e6f5624ceb683818d32f28.

which is wrong to begin with - you should always explain *why* the
revert was done, not just state that it's a revert.

But since you rebased things, that commit 36f3896042 doesn't exist any
more to begin with.  So now you have a revert commit that doesn't
explain why it reverts things, but the only thing it *does* say is
simply wrong and pointless.

Some of the "fixes" commits have the exact same issue - they say things like

    gc lock must be held while invalidating buckets - fixes
    "1f7a95698e bcachefs: Invalidate buckets when writing to alloc btree"

and

    fixes b0f3e786995cb3b12975503f963e469db5a4f09b

both of which are dead and stale git object pointers since the commits
in question got rebased and have some other hash these days.

But note that the cleanup should go further than just fix those kinds
of technical issues. If you rebase, and you have fixes in your tree
for things you rebase, just fix things as you rewrite history anyway
(there are cases where the fix may be informative in itself and it's
worth leaving around, but that's rare).

Anyway, aside from that, I only looked at the non-bcachefs parts. Some
of those are not acceptable either, like

    struct pagecache_lock add_lock
        ____cacheline_aligned_in_smp; /* protects adding new pages */

in 'struct address_space', which is completely bogus, since that
forces not only a potentially huge amount of padding, it also requires
alignment that that struct simply fundamentally does not have, and
_will_ not have.

You can only use ____cacheline_aligned_in_smp for top-level objects,
and honestly, it's almost never a win. That lock shouldn't be so hot.

That lock is somewhat questionable in the first place, and no, we
don't do those hacky recursive things anyway. A recursive lock is
almost always a buggy and mis-designed one.

Why does the regular page lock (at a finer granularity) not suffice?

And no, nobody has ever cared. The dio people just don't care about
page cache anyway. They have their own thing going.

Similarly, no, we're not starting to do vmalloc in non-process context. Stop it.

And the commit comments are very sparse. And not always signed off.

I also get the feeling that the "intent" part of the six-locks could
just be done as a slight extension of the rwsem, where an "intent" is
the same as a write-lock, but without waiting for existing readers,
and then the write-lock part is just the "wait for readers to be
done".

Have you talked to Waiman Long about that?

                    Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 20:46 ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
@ 2019-06-11  1:17   ` Kent Overstreet
  2019-06-11  4:33     ` Dave Chinner
  2019-06-11  4:55     ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
  2019-06-11  4:10   ` Dave Chinner
  1 sibling, 2 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-11  1:17 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner, Waiman Long, Peter Zijlstra
  Cc: Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 10:46:35AM -1000, Linus Torvalds wrote:
> On Mon, Jun 10, 2019 at 9:14 AM Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> >
> > So. Here's my bcachefs-for-review branch - this has the minimal set of patches
> > outside of fs/bcachefs/. My master branch has some performance optimizations for
> > the core buffered IO paths, but those are fairly tricky and invasive so I want
> > to hold off on those for now - this branch is intended to be more or less
> > suitable for merging as is.
> 
> Honestly, it really isn't.

Heh, I suppose that's what review is for :)

> There are obvious things wrong with it - like the fact that you've
> rebased it so that the original history is gone, yet you've not
> actually *fixed* the history, so you find things like reverts of
> commits that should simply have been removed, and fixes for things
> that should just have been fixed in the original commit the fix is
> for.

Yeah, I suppose I have dropped the ball on that lately. 
 
> But note that the cleanup should go further than just fix those kinds
> of technical issues. If you rebase, and you have fixes in your tree
> for things you rebase, just fix things as you rewrite history anyway
> (there are cases where the fix may be informative in itself and it's
> worth leaving around, but that's rare).

Yeah that has historically been my practice, I've just been moving away from
that kind of history editing as bcachefs has been getting more users. Hence the
in-between, worst of both workflows state of the current tree.

But, I can certainly go through and clean things up like that one last time and
make everything bisectable again - I'll go through and write proper commit
messages too. Unless you'd be ok with just squashing most of the history down to
one commit - which would you prefer?

> Anyway, aside from that, I only looked at the non-bcachefs parts. Some
> of those are not acceptable either, like
> 
>     struct pagecache_lock add_lock
>         ____cacheline_aligned_in_smp; /* protects adding new pages */
> 
> in 'struct address_space', which is completely bogus, since that
> forces not only a potentially huge amount of padding, it also requires
> alignment that that struct simply fundamentally does not have, and
> _will_ not have.

Oh, good point.

> You can only use ____cacheline_aligned_in_smp for top-level objects,
> and honestly, it's almost never a win. That lock shouldn't be so hot.
> 
> That lock is somewhat questionable in the first place, and no, we
> don't do those hacky recursive things anyway. A recursive lock is
> almost always a buggy and mis-designed one.

You're preaching to the choir there, I still feel dirty about that code and I'd
love nothing more than for someone else to come along and point out how stupid
I've been with a much better way of doing it. 

> Why does the regular page lock (at a finer granularity) not suffice?

Because the lock needs to prevent pages from being _added_ to the page cache -
to do it with a page granularity lock it'd have to be part of the radix tree, 

> And no, nobody has ever cared. The dio people just don't care about
> page cache anyway. They have their own thing going.

It's not just dio, it's even worse with the various fallocate operations. And
the xfs people care, but IIRC even they don't have locking for pages being
faulted in. This is an issue I've talked to other filesystem people quite a bit
about - especially Dave Chinner, maybe we can get him to weigh in here.

And this inconsistency does result in _real_ bugs. It goes something like this:
 - dio write shoots down the range of the page cache for the file it's writing
   to, using invalidate_inode_pages_range2
 - After the page cache shoot down, but before the write actually happens,
   another process pulls those pages back in to the page cache
 - Now the write happens: if that write was e.g. an allocating write, you're
   going to have page cache state (buffer heads) that say that page doesn't have
   anything on disk backing it, but it actually does because of the dio write.

xfs has additional locking (that the vfs does _not_ do) around both the buffered
and dio IO paths to prevent this happening because of a buffered read pulling
the pages back in, but no one has a solution for pages getting _faulted_ back in
- either because of mmap or gup().

And there are some filesystem people who do know about this race, because at
some point the dio code has been changed to shoot down the page cache _again_
after the write completes. But that doesn't eliminate the race, it just makes it
harder to trigger.

And dio writes actually aren't the worst of it, it's even worse with fallocate
FALLOC_FL_INSERT_RANGE/COLLAPSE_RANGE. Last time I looked at the ext4 fallocate
code, it looked _completely_ broken to me - the code seemed to think it was
using the same mechanism truncate uses for shooting down the page cache and
keeping pages from being readded - but that only works for truncate because it's
changing i_size and shooting down pages above i_size. Fallocate needs to shoot
down pages that are still within i_size, so... yeah...

The recursiveness is needed because otherwise, if you mmap a file, then do a dio
write where you pass the address you mmapped to pwrite(), gup() from the dio
write path will be trying to fault in the exact pages it's blocking from being
added.

A better solution would be for gup() to detect that and return an error, so we
can just fall back to buffered writes. Or just return an error to userspace
because fuck anyone who would actually do that.

But I fear plumbing that through gup() is going to be a hell of a lot uglier
than this patch.

I would really like Dave to weigh in here.

> Similarly, no, we're not starting to do vmalloc in non-process context. Stop it.

I don't want to do vmalloc in non process context - but I do need to call
vmalloc when reading in btree nodes, from the filesystem IO path.

But I just learned today about this new memalloc_nofs_save() thing, so if that
works I'm more than happy to drop that patch.

> And the commit comments are very sparse. And not always signed off.

Yeah, I'll fix that.

> I also get the feeling that the "intent" part of the six-locks could
> just be done as a slight extension of the rwsem, where an "intent" is
> the same as a write-lock, but without waiting for existing readers,
> and then the write-lock part is just the "wait for readers to be
> done".
> 
> Have you talked to Waiman Long about that?

No, I haven't, but I'm adding him to the list.

I really hate the idea of adding these sorts of special case features to the
core locking primitives though - I mean, look what's happened to the mutex code,
and the intent state isn't the only special feature they have. As is, they're
small and clean and they do their job well, I'd really prefer to have them just
remain their own thing instead of trying to cram it all into the the hyper
optimized rw semaphore code.

Also, six locks used to be in fs/bcachefs/, but last time I was mailing stuff
out for review Peter Zijlstra was dead set against exporting the osq lock stuff
- moving six locks to kernel/locking/ was actually his idea. 

I can say more about six locks tomorrow when I'm less sleep deprived, if you're
still not convinced.

Cheers.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 20:46 ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
  2019-06-11  1:17   ` Kent Overstreet
@ 2019-06-11  4:10   ` Dave Chinner
  2019-06-11  4:39     ` Linus Torvalds
  1 sibling, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2019-06-11  4:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Dave Chinner, Darrick J . Wong, Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 10:46:35AM -1000, Linus Torvalds wrote:
> I also get the feeling that the "intent" part of the six-locks could
> just be done as a slight extension of the rwsem, where an "intent" is
> the same as a write-lock, but without waiting for existing readers,
> and then the write-lock part is just the "wait for readers to be
> done".

Please, no, let's not make the rwsems even more fragile than they
already are. I'm tired of the ongoing XFS customer escalations that
end up being root caused to yet another rwsem memory barrier bug.

> Have you talked to Waiman Long about that?

Unfortunately, Waiman has been unable to find/debug multiple rwsem
exclusion violations we've seen in XFS bug reports over the past 2-3
years. Those memory barrier bugs have all been fixed by other people
long after Waiman has said "I can't reproduce any problems in my
testing" and essentially walked away from the problem. We've been
left multiple times wondering how the hell we even prove it's a
rwsem bug because there's no way to reproduce the inconsistent rwsem
state we see in the kernel crash dumps.

Hence, as a downstream rwsem user, I have relatively little
confidence in upstream's ability to integrate new functionality into
rwsems without introducing yet more subtle regressions that are only
exposed by heavy rwsem users like XFS. As such, I consider rwsems to
be extremely fragile and are now a prime suspect whenever see some
one-off memory corruption in a structure protected by a rwsem.

As such, please keep SIX locks separate to rwsems to minimise the
merge risk of bcachefs.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  1:17   ` Kent Overstreet
@ 2019-06-11  4:33     ` Dave Chinner
  2019-06-12 16:21       ` Kent Overstreet
  2019-06-11  4:55     ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
  1 sibling, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2019-06-11  4:33 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 09:17:37PM -0400, Kent Overstreet wrote:
> On Mon, Jun 10, 2019 at 10:46:35AM -1000, Linus Torvalds wrote:
> > On Mon, Jun 10, 2019 at 9:14 AM Kent Overstreet
> > <kent.overstreet@gmail.com> wrote:
> > That lock is somewhat questionable in the first place, and no, we
> > don't do those hacky recursive things anyway. A recursive lock is
> > almost always a buggy and mis-designed one.
> 
> You're preaching to the choir there, I still feel dirty about that code and I'd
> love nothing more than for someone else to come along and point out how stupid
> I've been with a much better way of doing it. 
> 
> > Why does the regular page lock (at a finer granularity) not suffice?
> 
> Because the lock needs to prevent pages from being _added_ to the page cache -
> to do it with a page granularity lock it'd have to be part of the radix tree, 
> 
> > And no, nobody has ever cared. The dio people just don't care about
> > page cache anyway. They have their own thing going.
> 
> It's not just dio, it's even worse with the various fallocate operations. And
> the xfs people care, but IIRC even they don't have locking for pages being
> faulted in. This is an issue I've talked to other filesystem people quite a bit
> about - especially Dave Chinner, maybe we can get him to weigh in here.
> 
> And this inconsistency does result in _real_ bugs. It goes something like this:
>  - dio write shoots down the range of the page cache for the file it's writing
>    to, using invalidate_inode_pages_range2
>  - After the page cache shoot down, but before the write actually happens,
>    another process pulls those pages back in to the page cache
>  - Now the write happens: if that write was e.g. an allocating write, you're
>    going to have page cache state (buffer heads) that say that page doesn't have
>    anything on disk backing it, but it actually does because of the dio write.
> 
> xfs has additional locking (that the vfs does _not_ do) around both the buffered
> and dio IO paths to prevent this happening because of a buffered read pulling
> the pages back in, but no one has a solution for pages getting _faulted_ back in
> - either because of mmap or gup().
> 
> And there are some filesystem people who do know about this race, because at
> some point the dio code has been changed to shoot down the page cache _again_
> after the write completes. But that doesn't eliminate the race, it just makes it
> harder to trigger.
> 
> And dio writes actually aren't the worst of it, it's even worse with fallocate
> FALLOC_FL_INSERT_RANGE/COLLAPSE_RANGE. Last time I looked at the ext4 fallocate
> code, it looked _completely_ broken to me - the code seemed to think it was
> using the same mechanism truncate uses for shooting down the page cache and
> keeping pages from being readded - but that only works for truncate because it's
> changing i_size and shooting down pages above i_size. Fallocate needs to shoot
> down pages that are still within i_size, so... yeah...

Yes, that ext4 code is broken, and Jan Kara is trying to work out
how to fix it. His recent patchset fell foul of taking the same lock
either side of the mmap_sem in this path:

> The recursiveness is needed because otherwise, if you mmap a file, then do a dio
> write where you pass the address you mmapped to pwrite(), gup() from the dio
> write path will be trying to fault in the exact pages it's blocking from being
> added.
> 
> A better solution would be for gup() to detect that and return an error, so we
> can just fall back to buffered writes. Or just return an error to userspace
> because fuck anyone who would actually do that.

I just recently said this with reference to the range lock stuff I'm
working on in the background:

	FWIW, it's to avoid problems with stupid userspace stuff
	that nobody really should be doing that I want range locks
	for the XFS inode locks.  If userspace overlaps the ranges
	and deadlocks in that case, they they get to keep all the
	broken bits because, IMO, they are doing something
	monumentally stupid. I'd probably be making it return
	EDEADLOCK back out to userspace in the case rather than
	deadlocking but, fundamentally, I think it's broken
	behaviour that we should be rejecting with an error rather
	than adding complexity trying to handle it.

So I think this recusive locking across a page fault case should
just fail, not add yet more complexity to try to handle a rare
corner case that exists more in theory than in reality. i.e put the
lock context in the current task, then if the page fault requires a
conflicting lock context to be taken, we terminate the page fault,
back out of the IO and return EDEADLOCK out to userspace. This works
for all types of lock contexts - only the filesystem itself needs to
know what the lock context pointer contains....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  4:10   ` Dave Chinner
@ 2019-06-11  4:39     ` Linus Torvalds
  2019-06-11  7:10       ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2019-06-11  4:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Dave Chinner, Darrick J . Wong, Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 6:11 PM Dave Chinner <david@fromorbit.com> wrote:
>
> Please, no, let's not make the rwsems even more fragile than they
> already are. I'm tired of the ongoing XFS customer escalations that
> end up being root caused to yet another rwsem memory barrier bug.
>
> > Have you talked to Waiman Long about that?
>
> Unfortunately, Waiman has been unable to find/debug multiple rwsem
> exclusion violations we've seen in XFS bug reports over the past 2-3
> years.

Inside xfs you can do whatever you want.

But in generic code, no, we're not saying "we don't trust the generic
locking, so we cook our own random locking".

If tghere really are exclusion issues, they should be fairly easy to
try to find with a generic test-suite. Have a bunch of readers that
assert that some shared variable has a particular value, and a bund of
writers that then modify the value and set it back. Add some random
timing and "yield" to them all, and show that the serialization is
wrong.

Some kind of "XFS load Y shows problems" is undebuggable, and not
necessarily due to locking.

Because if the locking issues are real (and we did fix one bug
recently in a9e9bcb45b15: "locking/rwsem: Prevent decrement of reader
count before increment") it needs to be fixed. Some kind of "let's do
something else entirely" is simply not acceptable.

                  Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  1:17   ` Kent Overstreet
  2019-06-11  4:33     ` Dave Chinner
@ 2019-06-11  4:55     ` Linus Torvalds
  2019-06-11 14:26       ` Matthew Wilcox
  1 sibling, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2019-06-11  4:55 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 3:17 PM Kent Overstreet
<kent.overstreet@gmail.com> wrote:
>
>
> > Why does the regular page lock (at a finer granularity) not suffice?
>
> Because the lock needs to prevent pages from being _added_ to the page cache -
> to do it with a page granularity lock it'd have to be part of the radix tree,

No, I understand that part, but I still think we should be able to do
the locking per-page rather than over the whole mapping.

When doing dio, you need to iterate over old existing pages anyway in
that range (otherwise the "no _new_ pages" part is kind of pointless
when there are old pages there), so my gut feel is that you might as
well at that point also "poison" the range you are doin dio on. With
the xarray changes, we might be better at handling ranges. That was
one of the arguments for the xarrays over the old radix tree model,
after all.

And I think the dio code would ideally want to have a range-based lock
anyway, rather than one global one. No?

Anyway, don't get me wrong. I'm not entirely against a "stop adding
pages" model per-mapping if it's just fundamentally simpler and nobody
wants anything fancier. So I'm certainly open to it, assuming it
doesn't add any real overhead to the normal case.

But I *am* against it when it has ad-hoc locking and random
anti-recursion things.

So I'm with Dave on the "I hope we can avoid the recursive hacks" by
making better rules. Even if I disagree with him on the locking thing
- I'd rather put _more_stress on the standard locking and make sure it
really works, over having multiple similar locking models because they
don't trust each other.

               Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  4:39     ` Linus Torvalds
@ 2019-06-11  7:10       ` Dave Chinner
  2019-06-12  2:07         ` Linus Torvalds
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2019-06-11  7:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Dave Chinner, Darrick J . Wong, Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 06:39:00PM -1000, Linus Torvalds wrote:
> On Mon, Jun 10, 2019 at 6:11 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > Please, no, let's not make the rwsems even more fragile than they
> > already are. I'm tired of the ongoing XFS customer escalations that
> > end up being root caused to yet another rwsem memory barrier bug.
> >
> > > Have you talked to Waiman Long about that?
> >
> > Unfortunately, Waiman has been unable to find/debug multiple rwsem
> > exclusion violations we've seen in XFS bug reports over the past 2-3
> > years.
> 
> Inside xfs you can do whatever you want.
>
> But in generic code, no, we're not saying "we don't trust the generic
> locking, so we cook our own random locking".

We use the generic rwsems in XFS, too, and it's the generic
rwsems that have been the cause of the problems I'm talking about.

The same rwsem issues were seen on the mmap_sem, the shrinker rwsem,
in a couple of device drivers, and so on. i.e. This isn't an XFS
issue I'm raising here - I'm raising a concern about the lack of
validation of core infrastructure and it's suitability for
functionality extensions.

> If tghere really are exclusion issues, they should be fairly easy to
> try to find with a generic test-suite.  Have a bunch of readers that
> assert that some shared variable has a particular value, and a bund of
> writers that then modify the value and set it back. Add some random
> timing and "yield" to them all, and show that the serialization is
> wrong.

Writing such a test suite would be the responsibility of the rwsem
maintainers, yes?

> Some kind of "XFS load Y shows problems" is undebuggable, and not
> necessarily due to locking.

Sure, but this wasn't isolated to XFS, and it wasn't one workload.

We had a growing pile of kernel crash dumps all with the same
signatures across multiple subsystems. When this happens, it falls
to the maintainer of that common element to more deeply analyse the
issue. One of the rwsem maintainers was unable to reproduce or find
the root cause of the pile of rwsem state corruptions, and so we've
been left hanging telling people "we think it's rwsems because the
state is valid right up to the rwsem state going bad, but we can't
prove it's a rwsem problem because the debug we've added to the
rwsem code makes the problem go away". Sometime later, a bug has
been found in the upstream rwsem code....

This has played out several times over the past couple of years. No
locking bugs have been found in XFS, with the mmap_sem, the shrinker
rwsem, etc, but 4 or 5 bugs have been found in the rwsem code and
backports of those commits have been proven to solve _all_ the
issues that were reported.

That's the painful reality I'm telling you about here - that poor
upstream core infrastructure quality has had quite severe downstream
knock-on effects that cost a lot of time, resources, money and
stress to diagnose and rectify.  I don't want those same mistakes to
be made again for many reasons, not the least that the stress of
these situations has a direct and adverse impact on my mental
health....

> Because if the locking issues are real (and we did fix one bug
> recently in a9e9bcb45b15: "locking/rwsem: Prevent decrement of reader
> count before increment") it needs to be fixed.

That's just one of the bugs we've tripped over. There's been a
couple of missed wakeups bugs that caused rwsem state hangs (e.g.
readers waiting with no holder), there was a power arch specific
memory barrier bug that caused read/write exclusion bugs, the
optimistic spinning caused some severe performance degradations on
the mmap_sem with some highly threaded workloads, the rwsem bias
changed from read biased to write biased (might be the other way
around, can't remember) some time around 4.10 causing a complete
inversion in mixed read-write IO characteristics, there was a
botched RHEL7 backport that had memory barrier bugs in it that
upstream didn't have that occurred because of the complexity of the
code, etc.

But this is all off-topic for bcachefs review - all we need to do
here is keep the SIX locking in a separate module and everything
rwsem related will be just fine.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 11/12] closures: closure_wait_event()
  2019-06-10 19:14 ` [PATCH 11/12] closures: closure_wait_event() Kent Overstreet
@ 2019-06-11 10:25   ` Coly Li
  2019-06-12 17:17   ` Greg KH
  1 sibling, 0 replies; 42+ messages in thread
From: Coly Li @ 2019-06-11 10:25 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcache

On 2019/6/11 3:14 上午, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

Acked-by: Coly Li <colyli@suse.de>

Thanks.

Coly Li

> ---
>  include/linux/closure.h | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/include/linux/closure.h b/include/linux/closure.h
> index 308e38028c..abacb91c35 100644
> --- a/include/linux/closure.h
> +++ b/include/linux/closure.h
> @@ -379,4 +379,26 @@ static inline void closure_call(struct closure *cl, closure_fn fn,
>  	continue_at_nobarrier(cl, fn, wq);
>  }
>  
> +#define __closure_wait_event(waitlist, _cond)				\
> +do {									\
> +	struct closure cl;						\
> +									\
> +	closure_init_stack(&cl);					\
> +									\
> +	while (1) {							\
> +		closure_wait(waitlist, &cl);				\
> +		if (_cond)						\
> +			break;						\
> +		closure_sync(&cl);					\
> +	}								\
> +	closure_wake_up(waitlist);					\
> +	closure_sync(&cl);						\
> +} while (0)
> +
> +#define closure_wait_event(waitlist, _cond)				\
> +do {									\
> +	if (!(_cond))							\
> +		__closure_wait_event(waitlist, _cond);			\
> +} while (0)
> +
>  #endif /* _LINUX_CLOSURE_H */
> 


-- 

Coly Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/12] bcache: move closures to lib/
  2019-06-10 19:14 ` [PATCH 10/12] bcache: move closures to lib/ Kent Overstreet
@ 2019-06-11 10:25   ` Coly Li
  2019-06-13  7:28   ` Christoph Hellwig
  1 sibling, 0 replies; 42+ messages in thread
From: Coly Li @ 2019-06-11 10:25 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcache

On 2019/6/11 3:14 上午, Kent Overstreet wrote:
> Prep work for bcachefs - being a fork of bcache it also uses closures
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

Acked-by: Coly Li <colyli@suse.de>

Thanks.

Coly Li


> ---
>  drivers/md/bcache/Kconfig                     | 10 +------
>  drivers/md/bcache/Makefile                    |  6 ++--
>  drivers/md/bcache/bcache.h                    |  2 +-
>  drivers/md/bcache/super.c                     |  1 -
>  drivers/md/bcache/util.h                      |  3 +-
>  .../md/bcache => include/linux}/closure.h     | 17 ++++++-----
>  lib/Kconfig                                   |  3 ++
>  lib/Kconfig.debug                             |  9 ++++++
>  lib/Makefile                                  |  2 ++
>  {drivers/md/bcache => lib}/closure.c          | 28 ++++++-------------
>  10 files changed, 37 insertions(+), 44 deletions(-)
>  rename {drivers/md/bcache => include/linux}/closure.h (97%)
>  rename {drivers/md/bcache => lib}/closure.c (89%)
> 
> diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
> index f6e0a8b3a6..3dd1d48987 100644
> --- a/drivers/md/bcache/Kconfig
> +++ b/drivers/md/bcache/Kconfig
> @@ -2,6 +2,7 @@
>  config BCACHE
>  	tristate "Block device as cache"
>  	select CRC64
> +	select CLOSURES
>  	help
>  	Allows a block device to be used as cache for other devices; uses
>  	a btree for indexing and the layout is optimized for SSDs.
> @@ -16,12 +17,3 @@ config BCACHE_DEBUG
>  
>  	Enables extra debugging tools, allows expensive runtime checks to be
>  	turned on.
> -
> -config BCACHE_CLOSURES_DEBUG
> -	bool "Debug closures"
> -	depends on BCACHE
> -	select DEBUG_FS
> -	help
> -	Keeps all active closures in a linked list and provides a debugfs
> -	interface to list them, which makes it possible to see asynchronous
> -	operations that get stuck.
> diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
> index d26b351958..2b790fb813 100644
> --- a/drivers/md/bcache/Makefile
> +++ b/drivers/md/bcache/Makefile
> @@ -2,8 +2,8 @@
>  
>  obj-$(CONFIG_BCACHE)	+= bcache.o
>  
> -bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
> -	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
> -	util.o writeback.o
> +bcache-y		:= alloc.o bset.o btree.o debug.o extents.o io.o\
> +	journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o util.o\
> +	writeback.o
>  
>  CFLAGS_request.o	+= -Iblock
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index fdf75352e1..ced9f1526c 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -180,6 +180,7 @@
>  
>  #include <linux/bcache.h>
>  #include <linux/bio.h>
> +#include <linux/closure.h>
>  #include <linux/kobject.h>
>  #include <linux/list.h>
>  #include <linux/mutex.h>
> @@ -192,7 +193,6 @@
>  
>  #include "bset.h"
>  #include "util.h"
> -#include "closure.h"
>  
>  struct bucket {
>  	atomic_t	pin;
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index a697a3a923..da6803f280 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -2487,7 +2487,6 @@ static int __init bcache_init(void)
>  		goto err;
>  
>  	bch_debug_init();
> -	closure_debug_init();
>  
>  	return 0;
>  err:
> diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
> index 00aab6abcf..8a75100c0b 100644
> --- a/drivers/md/bcache/util.h
> +++ b/drivers/md/bcache/util.h
> @@ -4,6 +4,7 @@
>  #define _BCACHE_UTIL_H
>  
>  #include <linux/blkdev.h>
> +#include <linux/closure.h>
>  #include <linux/errno.h>
>  #include <linux/kernel.h>
>  #include <linux/sched/clock.h>
> @@ -13,8 +14,6 @@
>  #include <linux/workqueue.h>
>  #include <linux/crc64.h>
>  
> -#include "closure.h"
> -
>  #define PAGE_SECTORS		(PAGE_SIZE / 512)
>  
>  struct closure;
> diff --git a/drivers/md/bcache/closure.h b/include/linux/closure.h
> similarity index 97%
> rename from drivers/md/bcache/closure.h
> rename to include/linux/closure.h
> index 376c5e659c..308e38028c 100644
> --- a/drivers/md/bcache/closure.h
> +++ b/include/linux/closure.h
> @@ -155,7 +155,7 @@ struct closure {
>  
>  	atomic_t		remaining;
>  
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  #define CLOSURE_MAGIC_DEAD	0xc054dead
>  #define CLOSURE_MAGIC_ALIVE	0xc054a11e
>  
> @@ -184,15 +184,13 @@ static inline void closure_sync(struct closure *cl)
>  		__closure_sync(cl);
>  }
>  
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  
> -void closure_debug_init(void);
>  void closure_debug_create(struct closure *cl);
>  void closure_debug_destroy(struct closure *cl);
>  
>  #else
>  
> -static inline void closure_debug_init(void) {}
>  static inline void closure_debug_create(struct closure *cl) {}
>  static inline void closure_debug_destroy(struct closure *cl) {}
>  
> @@ -200,21 +198,21 @@ static inline void closure_debug_destroy(struct closure *cl) {}
>  
>  static inline void closure_set_ip(struct closure *cl)
>  {
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  	cl->ip = _THIS_IP_;
>  #endif
>  }
>  
>  static inline void closure_set_ret_ip(struct closure *cl)
>  {
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  	cl->ip = _RET_IP_;
>  #endif
>  }
>  
>  static inline void closure_set_waiting(struct closure *cl, unsigned long f)
>  {
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  	cl->waiting_on = f;
>  #endif
>  }
> @@ -243,6 +241,7 @@ static inline void closure_queue(struct closure *cl)
>  	 */
>  	BUILD_BUG_ON(offsetof(struct closure, fn)
>  		     != offsetof(struct work_struct, func));
> +
>  	if (wq) {
>  		INIT_WORK(&cl->work, cl->work.func);
>  		queue_work(wq, &cl->work);
> @@ -255,7 +254,7 @@ static inline void closure_queue(struct closure *cl)
>   */
>  static inline void closure_get(struct closure *cl)
>  {
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  	BUG_ON((atomic_inc_return(&cl->remaining) &
>  		CLOSURE_REMAINING_MASK) <= 1);
>  #else
> @@ -271,7 +270,7 @@ static inline void closure_get(struct closure *cl)
>   */
>  static inline void closure_init(struct closure *cl, struct closure *parent)
>  {
> -	memset(cl, 0, sizeof(struct closure));
> +	cl->fn = NULL;
>  	cl->parent = parent;
>  	if (parent)
>  		closure_get(parent);
> diff --git a/lib/Kconfig b/lib/Kconfig
> index a9e56539bd..09a25af0d0 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -427,6 +427,9 @@ config ASSOCIATIVE_ARRAY
>  
>  	  for more information.
>  
> +config CLOSURES
> +	bool
> +
>  config HAS_IOMEM
>  	bool
>  	depends on !NO_IOMEM
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index d5a4a4036d..6d97985e7e 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1397,6 +1397,15 @@ config DEBUG_CREDENTIALS
>  
>  source "kernel/rcu/Kconfig.debug"
>  
> +config DEBUG_CLOSURES
> +	bool "Debug closures (bcache async widgits)"
> +	depends on CLOSURES
> +	select DEBUG_FS
> +	help
> +	Keeps all active closures in a linked list and provides a debugfs
> +	interface to list them, which makes it possible to see asynchronous
> +	operations that get stuck.
> +
>  config DEBUG_WQ_FORCE_RR_CPU
>  	bool "Force round-robin CPU selection for unbound work items"
>  	depends on DEBUG_KERNEL
> diff --git a/lib/Makefile b/lib/Makefile
> index 18c2be516a..2003eda127 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -193,6 +193,8 @@ obj-$(CONFIG_ATOMIC64_SELFTEST) += atomic64_test.o
>  
>  obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o
>  
> +obj-$(CONFIG_CLOSURES) += closure.o
> +
>  obj-$(CONFIG_CORDIC) += cordic.o
>  
>  obj-$(CONFIG_DQL) += dynamic_queue_limits.o
> diff --git a/drivers/md/bcache/closure.c b/lib/closure.c
> similarity index 89%
> rename from drivers/md/bcache/closure.c
> rename to lib/closure.c
> index 73f5319295..46cfe4c382 100644
> --- a/drivers/md/bcache/closure.c
> +++ b/lib/closure.c
> @@ -6,13 +6,12 @@
>   * Copyright 2012 Google, Inc.
>   */
>  
> +#include <linux/closure.h>
>  #include <linux/debugfs.h>
> -#include <linux/module.h>
> +#include <linux/export.h>
>  #include <linux/seq_file.h>
>  #include <linux/sched/debug.h>
>  
> -#include "closure.h"
> -
>  static inline void closure_put_after_sub(struct closure *cl, int flags)
>  {
>  	int r = flags & CLOSURE_REMAINING_MASK;
> @@ -127,7 +126,7 @@ void __sched __closure_sync(struct closure *cl)
>  }
>  EXPORT_SYMBOL(__closure_sync);
>  
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  
>  static LIST_HEAD(closure_list);
>  static DEFINE_SPINLOCK(closure_list_lock);
> @@ -158,8 +157,6 @@ void closure_debug_destroy(struct closure *cl)
>  }
>  EXPORT_SYMBOL(closure_debug_destroy);
>  
> -static struct dentry *closure_debug;
> -
>  static int debug_seq_show(struct seq_file *f, void *data)
>  {
>  	struct closure *cl;
> @@ -182,7 +179,7 @@ static int debug_seq_show(struct seq_file *f, void *data)
>  			seq_printf(f, " W %pS\n",
>  				   (void *) cl->waiting_on);
>  
> -		seq_printf(f, "\n");
> +		seq_puts(f, "\n");
>  	}
>  
>  	spin_unlock_irq(&closure_list_lock);
> @@ -201,18 +198,11 @@ static const struct file_operations debug_ops = {
>  	.release	= single_release
>  };
>  
> -void  __init closure_debug_init(void)
> +static int __init closure_debug_init(void)
>  {
> -	if (!IS_ERR_OR_NULL(bcache_debug))
> -		/*
> -		 * it is unnecessary to check return value of
> -		 * debugfs_create_file(), we should not care
> -		 * about this.
> -		 */
> -		closure_debug = debugfs_create_file(
> -			"closures", 0400, bcache_debug, NULL, &debug_ops);
> +	debugfs_create_file("closures", 0400, NULL, NULL, &debug_ops);
> +	return 0;
>  }
> -#endif
> +late_initcall(closure_debug_init)
>  
> -MODULE_AUTHOR("Kent Overstreet <koverstreet@google.com>");
> -MODULE_LICENSE("GPL");
> +#endif
> 


-- 

Coly Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  4:55     ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
@ 2019-06-11 14:26       ` Matthew Wilcox
  0 siblings, 0 replies; 42+ messages in thread
From: Matthew Wilcox @ 2019-06-11 14:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 06:55:15PM -1000, Linus Torvalds wrote:
> On Mon, Jun 10, 2019 at 3:17 PM Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> > > Why does the regular page lock (at a finer granularity) not suffice?
> >
> > Because the lock needs to prevent pages from being _added_ to the page cache -
> > to do it with a page granularity lock it'd have to be part of the radix tree,
> 
> No, I understand that part, but I still think we should be able to do
> the locking per-page rather than over the whole mapping.
> 
> When doing dio, you need to iterate over old existing pages anyway in
> that range (otherwise the "no _new_ pages" part is kind of pointless
> when there are old pages there), so my gut feel is that you might as
> well at that point also "poison" the range you are doin dio on. With
> the xarray changes, we might be better at handling ranges. That was
> one of the arguments for the xarrays over the old radix tree model,
> after all.

We could do that -- if there are pages (or shadow entries) in the XArray,
replace them with "lock entries".  I think we'd want the behaviour of
faults / buffered IO be to wait on those entries being removed.  I think
the DAX code is just about ready to switch over to lock entries instead
of having a special DAX lock bit.

The question is what to do when there are _no_ pages in the tree for a
range that we're about to do DIO on.  This should be the normal case --
as you say, DIO users typically have their own schemes for caching in
userspace, and rather resent the other users starting to cache their
file in the kernel.

Adding lock entries in the page cache for every DIO starts to look pretty
painful in terms of allocating radix tree nodes.  And it gets worse when
you have sub-page-size DIOs -- do we embed a count in the lock entry?
Or delay DIOs which hit in the same page as an existing DIO?

And then I start to question the whole reasoning behind how we do mixed
DIO and buffered IO; if there's a page in the page cache, why are we
writing it out, then doing a direct IO instead of doing a memcpy to the
page first, then writing the page back?

IOW, move to a model where:

 - If i_dio_count is non-zero, buffered I/O waits for i_dio_count to
   drop to zero before bringing pages in.
 - If i_pages is empty, DIOs increment i_dio_count, do the IO and
   decrement i_dio_count.
 - If i_pages is not empty, DIO is implemented by doing buffered I/O
   and waiting for the pages to finish writeback.

(needs a slight tweak to ensure that new DIOs can't hold off a buffered
I/O indefinitely)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  7:10       ` Dave Chinner
@ 2019-06-12  2:07         ` Linus Torvalds
  0 siblings, 0 replies; 42+ messages in thread
From: Linus Torvalds @ 2019-06-12  2:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Dave Chinner, Darrick J . Wong, Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 9:11 PM Dave Chinner <david@fromorbit.com> wrote:
>
> The same rwsem issues were seen on the mmap_sem, the shrinker rwsem,
> in a couple of device drivers, and so on. i.e. This isn't an XFS
> issue I'm raising here - I'm raising a concern about the lack of
> validation of core infrastructure and it's suitability for
> functionality extensions.

I haven't actually seen the reports.

That said, I do think this should be improving. The random
architecture-specific code is largely going away, and we'll have a
unified rwsem.

It might obviously cause some pain initially, but I think long-term we
should be much better off, at least avoiding the "on particular
configurations" issue..

              Linus

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  4:33     ` Dave Chinner
@ 2019-06-12 16:21       ` Kent Overstreet
  2019-06-12 23:02         ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2019-06-12 16:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Tue, Jun 11, 2019 at 02:33:36PM +1000, Dave Chinner wrote:
> I just recently said this with reference to the range lock stuff I'm
> working on in the background:
> 
> 	FWIW, it's to avoid problems with stupid userspace stuff
> 	that nobody really should be doing that I want range locks
> 	for the XFS inode locks.  If userspace overlaps the ranges
> 	and deadlocks in that case, they they get to keep all the
> 	broken bits because, IMO, they are doing something
> 	monumentally stupid. I'd probably be making it return
> 	EDEADLOCK back out to userspace in the case rather than
> 	deadlocking but, fundamentally, I think it's broken
> 	behaviour that we should be rejecting with an error rather
> 	than adding complexity trying to handle it.
> 
> So I think this recusive locking across a page fault case should
> just fail, not add yet more complexity to try to handle a rare
> corner case that exists more in theory than in reality. i.e put the
> lock context in the current task, then if the page fault requires a
> conflicting lock context to be taken, we terminate the page fault,
> back out of the IO and return EDEADLOCK out to userspace. This works
> for all types of lock contexts - only the filesystem itself needs to
> know what the lock context pointer contains....

Ok, I'm totally on board with returning EDEADLOCK.

Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
in the same address space as the file being read/written to, even if the buffer
and the IO don't technically overlap?

This would simplify things a lot and eliminate a really nasty corner case - page
faults trigger readahead. Even if the buffer and the direct IO don't overlap,
readahead can pull in pages that do overlap with the dio.

And on getting EDEADLOCK we could fall back to buffered IO, so userspace would
never know...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/12] Compiler Attributes: add __flatten
  2019-06-10 19:14 ` [PATCH 01/12] Compiler Attributes: add __flatten Kent Overstreet
@ 2019-06-12 17:16   ` Greg KH
  0 siblings, 0 replies; 42+ messages in thread
From: Greg KH @ 2019-06-12 17:16 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcache

On Mon, Jun 10, 2019 at 03:14:09PM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
>  include/linux/compiler_attributes.h | 5 +++++
>  1 file changed, 5 insertions(+)

I know I reject patches with no changelog text at all.  You shouldn't
rely on other maintainers being more lax.

You need to describe why you are doing this at the very least, as I sure
do not know...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 11/12] closures: closure_wait_event()
  2019-06-10 19:14 ` [PATCH 11/12] closures: closure_wait_event() Kent Overstreet
  2019-06-11 10:25   ` Coly Li
@ 2019-06-12 17:17   ` Greg KH
  1 sibling, 0 replies; 42+ messages in thread
From: Greg KH @ 2019-06-12 17:17 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcache

On Mon, Jun 10, 2019 at 03:14:19PM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
>  include/linux/closure.h | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)

Again, no changelog?  You are a daring developer... :)

greg k-h

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-12 16:21       ` Kent Overstreet
@ 2019-06-12 23:02         ` Dave Chinner
  2019-06-19  8:21           ` Jan Kara
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2019-06-12 23:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> On Tue, Jun 11, 2019 at 02:33:36PM +1000, Dave Chinner wrote:
> > I just recently said this with reference to the range lock stuff I'm
> > working on in the background:
> > 
> > 	FWIW, it's to avoid problems with stupid userspace stuff
> > 	that nobody really should be doing that I want range locks
> > 	for the XFS inode locks.  If userspace overlaps the ranges
> > 	and deadlocks in that case, they they get to keep all the
> > 	broken bits because, IMO, they are doing something
> > 	monumentally stupid. I'd probably be making it return
> > 	EDEADLOCK back out to userspace in the case rather than
> > 	deadlocking but, fundamentally, I think it's broken
> > 	behaviour that we should be rejecting with an error rather
> > 	than adding complexity trying to handle it.
> > 
> > So I think this recusive locking across a page fault case should
> > just fail, not add yet more complexity to try to handle a rare
> > corner case that exists more in theory than in reality. i.e put the
> > lock context in the current task, then if the page fault requires a
> > conflicting lock context to be taken, we terminate the page fault,
> > back out of the IO and return EDEADLOCK out to userspace. This works
> > for all types of lock contexts - only the filesystem itself needs to
> > know what the lock context pointer contains....
> 
> Ok, I'm totally on board with returning EDEADLOCK.
> 
> Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
> in the same address space as the file being read/written to, even if the buffer
> and the IO don't technically overlap?

I'd say that depends on the lock granularity. For a range lock,
we'd be able to do the IO for non-overlapping ranges. For a normal
mutex or rwsem, then we risk deadlock if the page fault triggers on
the same address space host as we already have locked for IO. That's
the case we currently handle with the second IO lock in XFS, ext4,
btrfs, etc (XFS_MMAPLOCK_* in XFS).

One of the reasons I'm looking at range locks for XFS is to get rid
of the need for this second mmap lock, as there is no reason for it
existing if we can lock ranges and EDEADLOCK inside page faults and
return errors.

> This would simplify things a lot and eliminate a really nasty corner case - page
> faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> readahead can pull in pages that do overlap with the dio.

Page cache readahead needs to be moved under the filesystem IO
locks. There was a recent thread about how readahead can race with
hole punching and other fallocate() operations because page cache
readahead bypasses the filesystem IO locks used to serialise page
cache invalidation.

e.g. Readahead can be directed by userspace via fadvise, so we now
have file->f_op->fadvise() so that filesystems can lock the inode
before calling generic_fadvise() such that page cache instantiation
and readahead dispatch can be serialised against page cache
invalidation. I have a patch for XFS sitting around somewhere that
implements the ->fadvise method.

I think there are some other patches floating around to address the
other readahead mechanisms to only be done under filesytem IO locks,
but I haven't had time to dig into it any further. Readahead from
page faults most definitely needs to be under the MMAPLOCK at
least so it serialises against fallocate()...

> And on getting EDEADLOCK we could fall back to buffered IO, so
> userspace would never know....

Yup, that's a choice that individual filesystems can make.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/12] bcache: move closures to lib/
  2019-06-10 19:14 ` [PATCH 10/12] bcache: move closures to lib/ Kent Overstreet
  2019-06-11 10:25   ` Coly Li
@ 2019-06-13  7:28   ` Christoph Hellwig
  2019-06-13 11:04     ` Kent Overstreet
  1 sibling, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2019-06-13  7:28 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcache

On Mon, Jun 10, 2019 at 03:14:18PM -0400, Kent Overstreet wrote:
> Prep work for bcachefs - being a fork of bcache it also uses closures

NAK.  This obsfucation needs to go away from bcache and not actually be
spread further, especially not as an API with multiple users which will
make it even harder to get rid of it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/12] bcache: move closures to lib/
  2019-06-13  7:28   ` Christoph Hellwig
@ 2019-06-13 11:04     ` Kent Overstreet
  0 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2019-06-13 11:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, linux-bcache

On Thu, Jun 13, 2019 at 12:28:41AM -0700, Christoph Hellwig wrote:
> On Mon, Jun 10, 2019 at 03:14:18PM -0400, Kent Overstreet wrote:
> > Prep work for bcachefs - being a fork of bcache it also uses closures
> 
> NAK.  This obsfucation needs to go away from bcache and not actually be
> spread further, especially not as an API with multiple users which will
> make it even harder to get rid of it.

Christoph, you've made it plenty clear how much you dislike closures in the past
but "I don't like it" is not remotely the kind of objection that is appropriate
or useful for technical discussions, and that's pretty much all you've ever
given.

If you really think that code should be gotten rid of, then maybe you should
actually _look at what they do that's not covered by other kernel
infrastructure_ and figure out something better. Otherwise... no one else seems
to care all that much about closures, and you're not actually giving any
technical feedback, so I'm not sure what you expect.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-12 23:02         ` Dave Chinner
@ 2019-06-19  8:21           ` Jan Kara
  2019-07-03  1:04             ` [PATCH] mm: Support madvise_willneed override by Filesystems Boaz Harrosh
  0 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2019-06-19  8:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Linus Torvalds, Dave Chinner, Waiman Long,
	Peter Zijlstra, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Darrick J . Wong, Zach Brown, Jens Axboe,
	Josef Bacik, Alexander Viro, Andrew Morton, Tejun Heo

On Thu 13-06-19 09:02:24, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> > This would simplify things a lot and eliminate a really nasty corner case - page
> > faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> > readahead can pull in pages that do overlap with the dio.
> 
> Page cache readahead needs to be moved under the filesystem IO
> locks. There was a recent thread about how readahead can race with
> hole punching and other fallocate() operations because page cache
> readahead bypasses the filesystem IO locks used to serialise page
> cache invalidation.
> 
> e.g. Readahead can be directed by userspace via fadvise, so we now
> have file->f_op->fadvise() so that filesystems can lock the inode
> before calling generic_fadvise() such that page cache instantiation
> and readahead dispatch can be serialised against page cache
> invalidation. I have a patch for XFS sitting around somewhere that
> implements the ->fadvise method.
> 
> I think there are some other patches floating around to address the
> other readahead mechanisms to only be done under filesytem IO locks,
> but I haven't had time to dig into it any further. Readahead from
> page faults most definitely needs to be under the MMAPLOCK at
> least so it serialises against fallocate()...

Yes, I have patch to make madvise(MADV_WILLNEED) go through ->fadvise() as
well. I'll post it soon since the rest of the series isn't really dependent
on it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] mm: Support madvise_willneed override by Filesystems
  2019-06-19  8:21           ` Jan Kara
@ 2019-07-03  1:04             ` Boaz Harrosh
  2019-07-03 17:21               ` Jan Kara
  0 siblings, 1 reply; 42+ messages in thread
From: Boaz Harrosh @ 2019-07-03  1:04 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner
  Cc: Kent Overstreet, Linus Torvalds, Dave Chinner, Waiman Long,
	Peter Zijlstra, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Darrick J . Wong, Zach Brown, Jens Axboe,
	Josef Bacik, Alexander Viro, Andrew Morton, Tejun Heo,
	Amir Goldstein

On 19/06/2019 11:21, Jan Kara wrote:
<>
> Yes, I have patch to make madvise(MADV_WILLNEED) go through ->fadvise() as
> well. I'll post it soon since the rest of the series isn't really dependent
> on it.
> 
> 								Honza
> 

Hi Jan

Funny I'm sitting on the same patch since LSF last. I need it too for other
reasons. I have not seen, have you pushed your patch yet?
(Is based on old v4.20)

~~~~~~~~~
From fddb38169e33d23060ddd444ba6f2319f76edc89 Mon Sep 17 00:00:00 2001
From: Boaz Harrosh <boazh@netapp.com>
Date: Thu, 16 May 2019 20:02:14 +0300
Subject: [PATCH] mm: Support madvise_willneed override by Filesystems

In the patchset:
	[b833a3660394] ovl: add ovl_fadvise()
	[3d8f7615319b] vfs: implement readahead(2) using POSIX_FADV_WILLNEED
	[45cd0faae371] vfs: add the fadvise() file operation

Amir Goldstein introduced a way for filesystems to overide fadvise.
Well madvise_willneed is exactly as fadvise_willneed except it always
returns 0.

In this patch we call the FS vector if it exists.

NOTE: I called vfs_fadvise(..,POSIX_FADV_WILLNEED);
      (Which is my artistic preference)

I could also selectively call
	if (file->f_op->fadvise)
		return file->f_op->fadvise(..., POSIX_FADV_WILLNEED);
If we fear theoretical side effects. I don't mind either way.

CC: Amir Goldstein <amir73il@gmail.com>
CC: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 mm/madvise.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6cb1ca93e290..6b84ddcaaaf2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -24,6 +24,7 @@
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/fadvise.h>
 
 #include <asm/tlb.h>
 
@@ -303,7 +304,8 @@ static long madvise_willneed(struct vm_area_struct *vma,
 		end = vma->vm_end;
 	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
-	force_page_cache_readahead(file->f_mapping, file, start, end - start);
+	vfs_fadvise(file, start << PAGE_SHIFT, (end - start) << PAGE_SHIFT,
+		    POSIX_FADV_WILLNEED);
 	return 0;
 }
 
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (12 preceding siblings ...)
  2019-06-10 20:46 ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
@ 2019-07-03  5:59 ` Stefan K
  2020-06-10 11:02 ` Stefan K
  14 siblings, 0 replies; 42+ messages in thread
From: Stefan K @ 2019-07-03  5:59 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcache, Dave Chinner,
	Darrick J . Wong , Zach Brown, Peter Zijlstra, Jens Axboe,
	Josef Bacik, Alexander Viro, Andrew Morton, Linus Torvalds,
	Tejun Heo

Hello,

is there a chance to get this in Kernel 5.3?
And thanks for this fs!


On Monday, June 10, 2019 9:14:08 PM CEST Kent Overstreet wrote:
> Last status update: https://lkml.org/lkml/2018/12/2/46
>
> Current status - I'm pretty much running out of things to polish and excuses to
> keep tinkering. The core featureset is _done_ and the list of known outstanding
> bugs is getting to be short and unexciting. The next big things on my todo list
> are finishing erasure coding and reflink, but there's no reason for merging to
> wait on those.
>
> So. Here's my bcachefs-for-review branch - this has the minimal set of patches
> outside of fs/bcachefs/. My master branch has some performance optimizations for
> the core buffered IO paths, but those are fairly tricky and invasive so I want
> to hold off on those for now - this branch is intended to be more or less
> suitable for merging as is.
>
> https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-for-review
>
> The list of non bcachefs patches is:
>
> closures: fix a race on wakeup from closure_sync
> closures: closure_wait_event()
> bcache: move closures to lib/
> bcache: optimize continue_at_nobarrier()
> block: Add some exports for bcachefs
> Propagate gfp_t when allocating pte entries from __vmalloc
> fs: factor out d_mark_tmpfile()
> fs: insert_inode_locked2()
> mm: export find_get_pages()
> mm: pagecache add lock
> locking: SIX locks (shared/intent/exclusive)
> Compiler Attributes: add __flatten
>
> Most of the patches are pretty small, of the ones that aren't:
>
>  - SIX locks have already been discussed, and seem to be pretty uncontroversial.
>
>  - pagecache add lock: it's kind of ugly, but necessary to rigorously prevent
>    page cache inconsistencies with dio and other operations, in particular
>    racing vs. page faults - honestly, it's criminal that we still don't have a
>    mechanism in the kernel to address this, other filesystems are susceptible to
>    these kinds of bugs too.
>
>    My patch is intentionally ugly in the hopes that someone else will come up
>    with a magical elegant solution, but in the meantime it's an "it's ugly but
>    it works" sort of thing, and I suspect in real world scenarios it's going to
>    beat any kind of range locking performance wise, which is the only
>    alternative I've heard discussed.
>
>  - Propaget gfp_t from __vmalloc() - bcachefs needs __vmalloc() to respect
>    GFP_NOFS, that's all that is.
>
>  - and, moving closures out of drivers/md/bcache to lib/.
>
> The rest of the tree is 62k lines of code in fs/bcachefs. So, I obviously won't
> be mailing out all of that as patches, but if any code reviewers have
> suggestions on what would make that go easier go ahead and speak up. The last
> time I was mailing things out for review the main thing that came up was ioctls,
> but the ioctl interface hasn't really changed since then. I'm pretty confident
> in the on disk format stuff, which was the other thing that was mentioned.
>
> ----------
>
> This has been a monumental effort over a lot of years, and I'm _really_ happy
> with how it's turned out. I'm excited to finally unleash this upon the world.
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm: Support madvise_willneed override by Filesystems
  2019-07-03  1:04             ` [PATCH] mm: Support madvise_willneed override by Filesystems Boaz Harrosh
@ 2019-07-03 17:21               ` Jan Kara
  2019-07-03 18:03                 ` Boaz Harrosh
  0 siblings, 1 reply; 42+ messages in thread
From: Jan Kara @ 2019-07-03 17:21 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Jan Kara, Dave Chinner, Kent Overstreet, Linus Torvalds,
	Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo, Amir Goldstein

On Wed 03-07-19 04:04:57, Boaz Harrosh wrote:
> On 19/06/2019 11:21, Jan Kara wrote:
> <>
> > Yes, I have patch to make madvise(MADV_WILLNEED) go through ->fadvise() as
> > well. I'll post it soon since the rest of the series isn't really dependent
> > on it.
> > 
> > 								Honza
> > 
> 
> Hi Jan
> 
> Funny I'm sitting on the same patch since LSF last. I need it too for other
> reasons. I have not seen, have you pushed your patch yet?
> (Is based on old v4.20)

Your patch is wrong due to lock ordering. You should not call vfs_fadvise()
under mmap_sem. So we need to do a similar dance like madvise_remove(). I
have to get to writing at least XFS fix so that the madvise change gets
used and post the madvise patch with it... Sorry it takes me so long.

								Honza
> 
> ~~~~~~~~~
> From fddb38169e33d23060ddd444ba6f2319f76edc89 Mon Sep 17 00:00:00 2001
> From: Boaz Harrosh <boazh@netapp.com>
> Date: Thu, 16 May 2019 20:02:14 +0300
> Subject: [PATCH] mm: Support madvise_willneed override by Filesystems
> 
> In the patchset:
> 	[b833a3660394] ovl: add ovl_fadvise()
> 	[3d8f7615319b] vfs: implement readahead(2) using POSIX_FADV_WILLNEED
> 	[45cd0faae371] vfs: add the fadvise() file operation
> 
> Amir Goldstein introduced a way for filesystems to overide fadvise.
> Well madvise_willneed is exactly as fadvise_willneed except it always
> returns 0.
> 
> In this patch we call the FS vector if it exists.
> 
> NOTE: I called vfs_fadvise(..,POSIX_FADV_WILLNEED);
>       (Which is my artistic preference)
> 
> I could also selectively call
> 	if (file->f_op->fadvise)
> 		return file->f_op->fadvise(..., POSIX_FADV_WILLNEED);
> If we fear theoretical side effects. I don't mind either way.
> 
> CC: Amir Goldstein <amir73il@gmail.com>
> CC: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Boaz Harrosh <boazh@netapp.com>
> ---
>  mm/madvise.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6cb1ca93e290..6b84ddcaaaf2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -24,6 +24,7 @@
>  #include <linux/swapops.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/fadvise.h>
>  
>  #include <asm/tlb.h>
>  
> @@ -303,7 +304,8 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  		end = vma->vm_end;
>  	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
>  
> -	force_page_cache_readahead(file->f_mapping, file, start, end - start);
> +	vfs_fadvise(file, start << PAGE_SHIFT, (end - start) << PAGE_SHIFT,
> +		    POSIX_FADV_WILLNEED);
>  	return 0;
>  }
>  
> -- 
> 2.20.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] mm: Support madvise_willneed override by Filesystems
  2019-07-03 17:21               ` Jan Kara
@ 2019-07-03 18:03                 ` Boaz Harrosh
  0 siblings, 0 replies; 42+ messages in thread
From: Boaz Harrosh @ 2019-07-03 18:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Kent Overstreet, Linus Torvalds, Dave Chinner,
	Waiman Long, Peter Zijlstra, Linux List Kernel Mailing,
	linux-fsdevel, linux-bcache, Darrick J . Wong, Zach Brown,
	Jens Axboe, Josef Bacik, Alexander Viro, Andrew Morton, Tejun Heo,
	Amir Goldstein

On 03/07/2019 20:21, Jan Kara wrote:
> On Wed 03-07-19 04:04:57, Boaz Harrosh wrote:
>> On 19/06/2019 11:21, Jan Kara wrote:
>> <>
<>
>> Hi Jan
>>
>> Funny I'm sitting on the same patch since LSF last. I need it too for other
>> reasons. I have not seen, have you pushed your patch yet?
>> (Is based on old v4.20)
> 
> Your patch is wrong due to lock ordering. You should not call vfs_fadvise()
> under mmap_sem. So we need to do a similar dance like madvise_remove(). I
> have to get to writing at least XFS fix so that the madvise change gets
> used and post the madvise patch with it... Sorry it takes me so long.
> 
> 								Honza

Ha Sorry I was not aware of this. Lockdep did not catch it on my setup
because my setup does not have any locking conflicts with mmap_sem on the
WILL_NEED path.

But surly you are right because the all effort is to fix the locking problems.

I will also try in a day or two to do as you suggest, and look at madvise_remove()
once I have a bit of time. Who ever gets to be less busy ...

Thank you for your help
Boaz

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/12] closures: fix a race on wakeup from closure_sync
  2019-06-10 19:14 ` [PATCH 12/12] closures: fix a race on wakeup from closure_sync Kent Overstreet
@ 2019-07-16 10:47   ` Coly Li
  2019-07-18  7:46     ` Coly Li
  0 siblings, 1 reply; 42+ messages in thread
From: Coly Li @ 2019-07-16 10:47 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcache

Hi Kent,

On 2019/6/11 3:14 上午, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Acked-by: Coly Li <colyli@suse.de>

And also I receive report for suspicious closure race condition in
bcache, and people ask for having this patch into Linux v5.3.

So before this patch gets merged into upstream, I plan to rebase it to
drivers/md/bcache/closure.c at this moment. Of cause the author is you.

When lib/closure.c merged into upstream, I will rebase all closure usage
from bcache to use lib/closure.{c,h}.

Thanks in advance.

Coly Li

> ---
>  lib/closure.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/closure.c b/lib/closure.c
> index 46cfe4c382..3e6366c262 100644
> --- a/lib/closure.c
> +++ b/lib/closure.c
> @@ -104,8 +104,14 @@ struct closure_syncer {
>  
>  static void closure_sync_fn(struct closure *cl)
>  {
> -	cl->s->done = 1;
> -	wake_up_process(cl->s->task);
> +	struct closure_syncer *s = cl->s;
> +	struct task_struct *p;
> +
> +	rcu_read_lock();
> +	p = READ_ONCE(s->task);
> +	s->done = 1;
> +	wake_up_process(p);
> +	rcu_read_unlock();
>  }
>  
>  void __sched __closure_sync(struct closure *cl)
> 


-- 

Coly Li

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/12] closures: fix a race on wakeup from closure_sync
  2019-07-16 10:47   ` Coly Li
@ 2019-07-18  7:46     ` Coly Li
  2019-07-22 17:22       ` Kent Overstreet
  0 siblings, 1 reply; 42+ messages in thread
From: Coly Li @ 2019-07-18  7:46 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcache

On 2019/7/16 6:47 下午, Coly Li wrote:
> Hi Kent,
> 
> On 2019/6/11 3:14 上午, Kent Overstreet wrote:
>> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> Acked-by: Coly Li <colyli@suse.de>
> 
> And also I receive report for suspicious closure race condition in
> bcache, and people ask for having this patch into Linux v5.3.
> 
> So before this patch gets merged into upstream, I plan to rebase it to
> drivers/md/bcache/closure.c at this moment. Of cause the author is you.
> 
> When lib/closure.c merged into upstream, I will rebase all closure usage
> from bcache to use lib/closure.{c,h}.

Hi Kent,

The race bug reporter replies me that the closure race bug is very rare
to reproduce, after applying the patch and testing, they are not sure
whether their closure race problem is fixed or not.

And I notice rcu_read_lock()/rcu_read_unlock() is used here, but it is
not clear to me what is the functionality of the rcu read lock in
closure_sync_fn(). I believe you have reason to use the rcu stuffs here,
could you please provide some hints to help me to understand the change
better ?

Thanks in advance.

Coly Li

>> ---
>>  lib/closure.c | 10 ++++++++--
>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/lib/closure.c b/lib/closure.c
>> index 46cfe4c382..3e6366c262 100644
>> --- a/lib/closure.c
>> +++ b/lib/closure.c
>> @@ -104,8 +104,14 @@ struct closure_syncer {
>>  
>>  static void closure_sync_fn(struct closure *cl)
>>  {
>> -	cl->s->done = 1;
>> -	wake_up_process(cl->s->task);
>> +	struct closure_syncer *s = cl->s;
>> +	struct task_struct *p;
>> +
>> +	rcu_read_lock();
>> +	p = READ_ONCE(s->task);
>> +	s->done = 1;
>> +	wake_up_process(p);
>> +	rcu_read_unlock();
>>  }
>>  
>>  void __sched __closure_sync(struct closure *cl)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/12] closures: fix a race on wakeup from closure_sync
  2019-07-18  7:46     ` Coly Li
@ 2019-07-22 17:22       ` Kent Overstreet
  2020-11-24 23:07         ` Marc Smith
  0 siblings, 1 reply; 42+ messages in thread
From: Kent Overstreet @ 2019-07-22 17:22 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-kernel, linux-fsdevel, linux-bcache

On Thu, Jul 18, 2019 at 03:46:46PM +0800, Coly Li wrote:
> On 2019/7/16 6:47 下午, Coly Li wrote:
> > Hi Kent,
> > 
> > On 2019/6/11 3:14 上午, Kent Overstreet wrote:
> >> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> > Acked-by: Coly Li <colyli@suse.de>
> > 
> > And also I receive report for suspicious closure race condition in
> > bcache, and people ask for having this patch into Linux v5.3.
> > 
> > So before this patch gets merged into upstream, I plan to rebase it to
> > drivers/md/bcache/closure.c at this moment. Of cause the author is you.
> > 
> > When lib/closure.c merged into upstream, I will rebase all closure usage
> > from bcache to use lib/closure.{c,h}.
> 
> Hi Kent,
> 
> The race bug reporter replies me that the closure race bug is very rare
> to reproduce, after applying the patch and testing, they are not sure
> whether their closure race problem is fixed or not.
> 
> And I notice rcu_read_lock()/rcu_read_unlock() is used here, but it is
> not clear to me what is the functionality of the rcu read lock in
> closure_sync_fn(). I believe you have reason to use the rcu stuffs here,
> could you please provide some hints to help me to understand the change
> better ?

The race was when a thread using closure_sync() notices cl->s->done == 1 before
the thread calling closure_put() calls wake_up_process(). Then, it's possible
for that thread to return and exit just before wake_up_process() is called - so
we're trying to wake up a process that no longer exists.

rcu_read_lock() is sufficient to protect against this, as there's an rcu barrier
somewhere in the process teardown path.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (13 preceding siblings ...)
  2019-07-03  5:59 ` Stefan K
@ 2020-06-10 11:02 ` Stefan K
  2020-06-10 11:32   ` Kent Overstreet
  14 siblings, 1 reply; 42+ messages in thread
From: Stefan K @ 2020-06-10 11:02 UTC (permalink / raw)
  To: linux-bcache; +Cc: Kent Overstreet

... one year later ...

what is the status now? the latest update on patreon is a few month old.
What are the blockers to merge bcachefs into the kernel?

I hope that the bcachefs thing will be merged soon.

best regards
Stefan


(I'm shortening the cc-list)

On Monday, June 10, 2019 9:14:08 PM CEST Kent Overstreet wrote:
> Last status update: https://lkml.org/lkml/2018/12/2/46
>
> Current status - I'm pretty much running out of things to polish and excuses to
> keep tinkering. The core featureset is _done_ and the list of known outstanding
> bugs is getting to be short and unexciting. The next big things on my todo list
> are finishing erasure coding and reflink, but there's no reason for merging to
> wait on those.
>
> So. Here's my bcachefs-for-review branch - this has the minimal set of patches
> outside of fs/bcachefs/. My master branch has some performance optimizations for
> the core buffered IO paths, but those are fairly tricky and invasive so I want
> to hold off on those for now - this branch is intended to be more or less
> suitable for merging as is.
>
> https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-for-review
>
> The list of non bcachefs patches is:
>
> closures: fix a race on wakeup from closure_sync
> closures: closure_wait_event()
> bcache: move closures to lib/
> bcache: optimize continue_at_nobarrier()
> block: Add some exports for bcachefs
> Propagate gfp_t when allocating pte entries from __vmalloc
> fs: factor out d_mark_tmpfile()
> fs: insert_inode_locked2()
> mm: export find_get_pages()
> mm: pagecache add lock
> locking: SIX locks (shared/intent/exclusive)
> Compiler Attributes: add __flatten
>
> Most of the patches are pretty small, of the ones that aren't:
>
>  - SIX locks have already been discussed, and seem to be pretty uncontroversial.
>
>  - pagecache add lock: it's kind of ugly, but necessary to rigorously prevent
>    page cache inconsistencies with dio and other operations, in particular
>    racing vs. page faults - honestly, it's criminal that we still don't have a
>    mechanism in the kernel to address this, other filesystems are susceptible to
>    these kinds of bugs too.
>
>    My patch is intentionally ugly in the hopes that someone else will come up
>    with a magical elegant solution, but in the meantime it's an "it's ugly but
>    it works" sort of thing, and I suspect in real world scenarios it's going to
>    beat any kind of range locking performance wise, which is the only
>    alternative I've heard discussed.
>
>  - Propaget gfp_t from __vmalloc() - bcachefs needs __vmalloc() to respect
>    GFP_NOFS, that's all that is.
>
>  - and, moving closures out of drivers/md/bcache to lib/.
>
> The rest of the tree is 62k lines of code in fs/bcachefs. So, I obviously won't
> be mailing out all of that as patches, but if any code reviewers have
> suggestions on what would make that go easier go ahead and speak up. The last
> time I was mailing things out for review the main thing that came up was ioctls,
> but the ioctl interface hasn't really changed since then. I'm pretty confident
> in the on disk format stuff, which was the other thing that was mentioned.
>
> ----------
>
> This has been a monumental effort over a lot of years, and I'm _really_ happy
> with how it's turned out. I'm excited to finally unleash this upon the world.
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2020-06-10 11:02 ` Stefan K
@ 2020-06-10 11:32   ` Kent Overstreet
  0 siblings, 0 replies; 42+ messages in thread
From: Kent Overstreet @ 2020-06-10 11:32 UTC (permalink / raw)
  To: Stefan K; +Cc: linux-bcache

On Wed, Jun 10, 2020 at 01:02:10PM +0200, Stefan K wrote:
> ... one year later ...
> 
> what is the status now? the latest update on patreon is a few month old.
> What are the blockers to merge bcachefs into the kernel?
> 
> I hope that the bcachefs thing will be merged soon.

Status update coming. Did journalling of updates to interior btree nodes a few
months ago to get rid of FUA writes and for other reasons and that turned out to
be more work than expected; still debugging the new btree key cache code which
is needed to fully fix recovery from unclean shutdown.

Main blocker for merging is the dio cache coherency thing. That and working
through all known bugs. There's an outstanding data corruption bug someone in
the IRC channel is helping track down. Bleh. Debugging is endless.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/12] closures: fix a race on wakeup from closure_sync
  2019-07-22 17:22       ` Kent Overstreet
@ 2020-11-24 23:07         ` Marc Smith
  2020-11-25 18:10           ` Marc Smith
  0 siblings, 1 reply; 42+ messages in thread
From: Marc Smith @ 2020-11-24 23:07 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Coly Li, linux-bcache

Hi,


On Mon, Jul 22, 2019 at 1:51 PM Kent Overstreet
<kent.overstreet@gmail.com> wrote:
>
> On Thu, Jul 18, 2019 at 03:46:46PM +0800, Coly Li wrote:
> > On 2019/7/16 6:47 下午, Coly Li wrote:
> > > Hi Kent,
> > >
> > > On 2019/6/11 3:14 上午, Kent Overstreet wrote:
> > >> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> > > Acked-by: Coly Li <colyli@suse.de>
> > >
> > > And also I receive report for suspicious closure race condition in
> > > bcache, and people ask for having this patch into Linux v5.3.
> > >
> > > So before this patch gets merged into upstream, I plan to rebase it to
> > > drivers/md/bcache/closure.c at this moment. Of cause the author is you.
> > >
> > > When lib/closure.c merged into upstream, I will rebase all closure usage
> > > from bcache to use lib/closure.{c,h}.
> >
> > Hi Kent,
> >
> > The race bug reporter replies me that the closure race bug is very rare
> > to reproduce, after applying the patch and testing, they are not sure
> > whether their closure race problem is fixed or not.
> >
> > And I notice rcu_read_lock()/rcu_read_unlock() is used here, but it is
> > not clear to me what is the functionality of the rcu read lock in
> > closure_sync_fn(). I believe you have reason to use the rcu stuffs here,
> > could you please provide some hints to help me to understand the change
> > better ?
>
> The race was when a thread using closure_sync() notices cl->s->done == 1 before
> the thread calling closure_put() calls wake_up_process(). Then, it's possible
> for that thread to return and exit just before wake_up_process() is called - so
> we're trying to wake up a process that no longer exists.
>
> rcu_read_lock() is sufficient to protect against this, as there's an rcu barrier
> somewhere in the process teardown path.

Is rcu_read_lock() sufficient even in a kernel that uses
"CONFIG_PREEMPT_NONE=y"? I ask because I hit this panic quite
frequently and it sounds like what you described as this patch is
supposed to fix/prevent:
crash> bt
PID: 0      TASK: ffff88862461ab00  CPU: 2   COMMAND: "swapper/2"
 #0 [ffffc90000100a88] machine_kexec at ffffffff8103d6b5
 #1 [ffffc90000100ad0] __crash_kexec at ffffffff8110d37b
 #2 [ffffc90000100b98] crash_kexec at ffffffff8110e07d
 #3 [ffffc90000100ba8] oops_end at ffffffff8101a9de
 #4 [ffffc90000100bc8] no_context at ffffffff81045e99
 #5 [ffffc90000100c30] async_page_fault at ffffffff81e010cf
    [exception RIP: atomic_try_cmpxchg+2]
    RIP: ffffffff810d3e3b  RSP: ffffc90000100ce8  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: 0000000000000003  RCX: 0000000080080007
    RDX: 0000000000000001  RSI: ffffc90000100cf4  RDI: 0000000100000a5b
    RBP: 00000000ffffffff   R8: 0000000000000001   R9: ffffffffa0422d4e
    R10: ffff888532779000  R11: ffff888532779000  R12: 0000000000000046
    R13: 0000000000000000  R14: ffff88858a062000  R15: 0000000100000a5b
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffffc90000100ce8] _raw_spin_lock_irqsave at ffffffff81cf7d7d
 #7 [ffffc90000100d08] try_to_wake_up at ffffffff810c1624
 #8 [ffffc90000100d58] closure_sync_fn at ffffffffa0419b07 [bcache]
 #9 [ffffc90000100d60] clone_endio at ffffffff81aac48c
#10 [ffffc90000100d90] call_bio_endio at ffffffff81a78e20
#11 [ffffc90000100da8] raid_end_bio_io at ffffffff81a78e69
#12 [ffffc90000100dd8] raid1_end_write_request at ffffffff81a79ad9
#13 [ffffc90000100e48] blk_update_request at ffffffff814c3ab1
#14 [ffffc90000100e88] blk_mq_end_request at ffffffff814caaf2
#15 [ffffc90000100ea0] blk_mq_complete_request at ffffffff814c91c1
#16 [ffffc90000100ec8] nvme_complete_cqes at ffffffffa002fb03 [nvme]
#17 [ffffc90000100f08] nvme_irq at ffffffffa002fb7f [nvme]
#18 [ffffc90000100f30] __handle_irq_event_percpu at ffffffff810e0d60
#19 [ffffc90000100f70] handle_irq_event_percpu at ffffffff810e0e65
#20 [ffffc90000100f98] handle_irq_event at ffffffff810e0ecb
#21 [ffffc90000100fb0] handle_edge_irq at ffffffff810e494d
#22 [ffffc90000100fc8] do_IRQ at ffffffff81e01900
--- <IRQ stack> ---
#23 [ffffc90000077e38] ret_from_intr at ffffffff81e00a0f
    [exception RIP: default_idle+24]
    RIP: ffffffff81cf7938  RSP: ffffc90000077ee0  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffff88862461ab00  RCX: ffff888627a96000
    RDX: 0000000000000001  RSI: 0000000000000002  RDI: 0000000000000001
    RBP: 0000000000000000   R8: 000000cd42e4dffb   R9: 00023d75261d1b20
    R10: 00000000000001f0  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000002  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffdb  CS: 0010  SS: 0018
#24 [ffffc90000077ee0] do_idle at ffffffff810c4ee4
#25 [ffffc90000077f20] cpu_startup_entry at ffffffff810c514f
#26 [ffffc90000077f30] start_secondary at ffffffff81037045
#27 [ffffc90000077f50] secondary_startup_64 at ffffffff810000d4
crash> dis closure_sync_fn
0xffffffffa0419af4 <closure_sync_fn>:   mov    0x8(%rdi),%rax
0xffffffffa0419af8 <closure_sync_fn+4>: movl   $0x1,0x8(%rax)
0xffffffffa0419aff <closure_sync_fn+11>:        mov    (%rax),%rdi
0xffffffffa0419b02 <closure_sync_fn+14>:        callq
0xffffffff810c185f <wake_up_process>
0xffffffffa0419b07 <closure_sync_fn+19>:        retq
crash>

I'm using Linux 5.4.69.

--Marc

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 12/12] closures: fix a race on wakeup from closure_sync
  2020-11-24 23:07         ` Marc Smith
@ 2020-11-25 18:10           ` Marc Smith
  0 siblings, 0 replies; 42+ messages in thread
From: Marc Smith @ 2020-11-25 18:10 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Coly Li, linux-bcache

On Tue, Nov 24, 2020 at 6:07 PM Marc Smith <msmith626@gmail.com> wrote:
>
> Hi,
>
>
> On Mon, Jul 22, 2019 at 1:51 PM Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> >
> > On Thu, Jul 18, 2019 at 03:46:46PM +0800, Coly Li wrote:
> > > On 2019/7/16 6:47 下午, Coly Li wrote:
> > > > Hi Kent,
> > > >
> > > > On 2019/6/11 3:14 上午, Kent Overstreet wrote:
> > > >> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> > > > Acked-by: Coly Li <colyli@suse.de>
> > > >
> > > > And also I receive report for suspicious closure race condition in
> > > > bcache, and people ask for having this patch into Linux v5.3.
> > > >
> > > > So before this patch gets merged into upstream, I plan to rebase it to
> > > > drivers/md/bcache/closure.c at this moment. Of cause the author is you.
> > > >
> > > > When lib/closure.c merged into upstream, I will rebase all closure usage
> > > > from bcache to use lib/closure.{c,h}.
> > >
> > > Hi Kent,
> > >
> > > The race bug reporter replies me that the closure race bug is very rare
> > > to reproduce, after applying the patch and testing, they are not sure
> > > whether their closure race problem is fixed or not.
> > >
> > > And I notice rcu_read_lock()/rcu_read_unlock() is used here, but it is
> > > not clear to me what is the functionality of the rcu read lock in
> > > closure_sync_fn(). I believe you have reason to use the rcu stuffs here,
> > > could you please provide some hints to help me to understand the change
> > > better ?
> >
> > The race was when a thread using closure_sync() notices cl->s->done == 1 before
> > the thread calling closure_put() calls wake_up_process(). Then, it's possible
> > for that thread to return and exit just before wake_up_process() is called - so
> > we're trying to wake up a process that no longer exists.
> >
> > rcu_read_lock() is sufficient to protect against this, as there's an rcu barrier
> > somewhere in the process teardown path.
>
> Is rcu_read_lock() sufficient even in a kernel that uses
> "CONFIG_PREEMPT_NONE=y"? I ask because I hit this panic quite
> frequently and it sounds like what you described as this patch is
> supposed to fix/prevent:
> crash> bt
> PID: 0      TASK: ffff88862461ab00  CPU: 2   COMMAND: "swapper/2"
>  #0 [ffffc90000100a88] machine_kexec at ffffffff8103d6b5
>  #1 [ffffc90000100ad0] __crash_kexec at ffffffff8110d37b
>  #2 [ffffc90000100b98] crash_kexec at ffffffff8110e07d
>  #3 [ffffc90000100ba8] oops_end at ffffffff8101a9de
>  #4 [ffffc90000100bc8] no_context at ffffffff81045e99
>  #5 [ffffc90000100c30] async_page_fault at ffffffff81e010cf
>     [exception RIP: atomic_try_cmpxchg+2]
>     RIP: ffffffff810d3e3b  RSP: ffffc90000100ce8  RFLAGS: 00010046
>     RAX: 0000000000000000  RBX: 0000000000000003  RCX: 0000000080080007
>     RDX: 0000000000000001  RSI: ffffc90000100cf4  RDI: 0000000100000a5b
>     RBP: 00000000ffffffff   R8: 0000000000000001   R9: ffffffffa0422d4e
>     R10: ffff888532779000  R11: ffff888532779000  R12: 0000000000000046
>     R13: 0000000000000000  R14: ffff88858a062000  R15: 0000000100000a5b
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #6 [ffffc90000100ce8] _raw_spin_lock_irqsave at ffffffff81cf7d7d
>  #7 [ffffc90000100d08] try_to_wake_up at ffffffff810c1624
>  #8 [ffffc90000100d58] closure_sync_fn at ffffffffa0419b07 [bcache]
>  #9 [ffffc90000100d60] clone_endio at ffffffff81aac48c
> #10 [ffffc90000100d90] call_bio_endio at ffffffff81a78e20
> #11 [ffffc90000100da8] raid_end_bio_io at ffffffff81a78e69
> #12 [ffffc90000100dd8] raid1_end_write_request at ffffffff81a79ad9
> #13 [ffffc90000100e48] blk_update_request at ffffffff814c3ab1
> #14 [ffffc90000100e88] blk_mq_end_request at ffffffff814caaf2
> #15 [ffffc90000100ea0] blk_mq_complete_request at ffffffff814c91c1
> #16 [ffffc90000100ec8] nvme_complete_cqes at ffffffffa002fb03 [nvme]
> #17 [ffffc90000100f08] nvme_irq at ffffffffa002fb7f [nvme]
> #18 [ffffc90000100f30] __handle_irq_event_percpu at ffffffff810e0d60
> #19 [ffffc90000100f70] handle_irq_event_percpu at ffffffff810e0e65
> #20 [ffffc90000100f98] handle_irq_event at ffffffff810e0ecb
> #21 [ffffc90000100fb0] handle_edge_irq at ffffffff810e494d
> #22 [ffffc90000100fc8] do_IRQ at ffffffff81e01900
> --- <IRQ stack> ---
> #23 [ffffc90000077e38] ret_from_intr at ffffffff81e00a0f
>     [exception RIP: default_idle+24]
>     RIP: ffffffff81cf7938  RSP: ffffc90000077ee0  RFLAGS: 00000246
>     RAX: 0000000000000000  RBX: ffff88862461ab00  RCX: ffff888627a96000
>     RDX: 0000000000000001  RSI: 0000000000000002  RDI: 0000000000000001
>     RBP: 0000000000000000   R8: 000000cd42e4dffb   R9: 00023d75261d1b20
>     R10: 00000000000001f0  R11: 0000000000000000  R12: 0000000000000000
>     R13: 0000000000000002  R14: 0000000000000000  R15: 0000000000000000
>     ORIG_RAX: ffffffffffffffdb  CS: 0010  SS: 0018
> #24 [ffffc90000077ee0] do_idle at ffffffff810c4ee4
> #25 [ffffc90000077f20] cpu_startup_entry at ffffffff810c514f
> #26 [ffffc90000077f30] start_secondary at ffffffff81037045
> #27 [ffffc90000077f50] secondary_startup_64 at ffffffff810000d4
> crash> dis closure_sync_fn
> 0xffffffffa0419af4 <closure_sync_fn>:   mov    0x8(%rdi),%rax
> 0xffffffffa0419af8 <closure_sync_fn+4>: movl   $0x1,0x8(%rax)
> 0xffffffffa0419aff <closure_sync_fn+11>:        mov    (%rax),%rdi
> 0xffffffffa0419b02 <closure_sync_fn+14>:        callq
> 0xffffffff810c185f <wake_up_process>
> 0xffffffffa0419b07 <closure_sync_fn+19>:        retq
> crash>
>
> I'm using Linux 5.4.69.

Alright, I was a bit overzealous with that statement... it's nothing
to do with CONFIG_PREEMPT_NONE / RCU... the problem is it seems GCC is
optimizing the code somehow and changing the assembly from what is
expected in looking at the code:

I'm using GCC 9.1.0; when examining closure_sync_fn() it looks like this:
0000000000008af4 <closure_sync_fn>:
    8af4:       48 8b 47 08             mov    0x8(%rdi),%rax
    8af8:       c7 40 08 01 00 00 00    movl   $0x1,0x8(%rax)
    8aff:       48 8b 38                mov    (%rax),%rdi
    8b02:       e8 00 00 00 00          callq  8b07 <closure_sync_fn+0x13>
    8b07:       c3                      retq

This is what the function looks like in bcache:
static void closure_sync_fn(struct closure *cl)
{
        struct closure_syncer *s = cl->s;
        struct task_struct *p;

        rcu_read_lock();
        p = READ_ONCE(s->task);
        s->done = 1;
        wake_up_process(p);
        rcu_read_unlock();
}

In the assembly shown above "p = READ_ONCE(s->task);" and "s->done =
1;" are swapped, making them effectively in this order:
s->done = 1;
p = READ_ONCE(s->task);

I believe the crash is then caused because in some cases p is invalid
since it finishes in another thread (because done was set to 1 before
reading the value for p), then we call wake_up_process() with a bad p
value. Please let me know if I'm misunderstanding that.

I applied the following patch to bcache in 5.4.69:
--- closure.c.orig 2020-11-25 01:03:03.899962395 -0500
+++ closure.c 2020-11-24 23:59:29.555465288 -0500
@@ -110,7 +110,7 @@

  rcu_read_lock();
  p = READ_ONCE(s->task);
- s->done = 1;
+ WRITE_ONCE(s->done, 1);
  wake_up_process(p);
  rcu_read_unlock();
 }
@@ -123,8 +123,10 @@
  continue_at(cl, closure_sync_fn, NULL);

  while (1) {
+ int done;
  set_current_state(TASK_UNINTERRUPTIBLE);
- if (s.done)
+ done = READ_ONCE(s.done);
+ if (done)
  break;
  schedule();
  }


(I modified __closure_sync() as well to be sure using READ_ONCE() but
that may not be needed.)


Then after applying that change and building with GCC 9.1.0 the
closure_sync_fn() function looks like this (correct):
0000000000008af4 <closure_sync_fn>:
    8af4:       48 8b 47 08             mov    0x8(%rdi),%rax
    8af8:       48 8b 38                mov    (%rax),%rdi
    8afb:       c7 40 08 01 00 00 00    movl   $0x1,0x8(%rax)
    8b02:       e8 00 00 00 00          callq  8b07 <closure_sync_fn+0x13>
    8b07:       c3                      retq


I built with an older GCC version and do not see this line swap with
the binary it produces.


--Marc


>
> --Marc

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2020-11-25 18:11 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
2019-06-10 19:14 ` [PATCH 01/12] Compiler Attributes: add __flatten Kent Overstreet
2019-06-12 17:16   ` Greg KH
2019-06-10 19:14 ` [PATCH 02/12] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
2019-06-10 19:14 ` [PATCH 03/12] mm: pagecache add lock Kent Overstreet
2019-06-10 19:14 ` [PATCH 04/12] mm: export find_get_pages() Kent Overstreet
2019-06-10 19:14 ` [PATCH 05/12] fs: insert_inode_locked2() Kent Overstreet
2019-06-10 19:14 ` [PATCH 06/12] fs: factor out d_mark_tmpfile() Kent Overstreet
2019-06-10 19:14 ` [PATCH 07/12] Propagate gfp_t when allocating pte entries from __vmalloc Kent Overstreet
2019-06-10 19:14 ` [PATCH 08/12] block: Add some exports for bcachefs Kent Overstreet
2019-06-10 19:14 ` [PATCH 09/12] bcache: optimize continue_at_nobarrier() Kent Overstreet
2019-06-10 19:14 ` [PATCH 10/12] bcache: move closures to lib/ Kent Overstreet
2019-06-11 10:25   ` Coly Li
2019-06-13  7:28   ` Christoph Hellwig
2019-06-13 11:04     ` Kent Overstreet
2019-06-10 19:14 ` [PATCH 11/12] closures: closure_wait_event() Kent Overstreet
2019-06-11 10:25   ` Coly Li
2019-06-12 17:17   ` Greg KH
2019-06-10 19:14 ` [PATCH 12/12] closures: fix a race on wakeup from closure_sync Kent Overstreet
2019-07-16 10:47   ` Coly Li
2019-07-18  7:46     ` Coly Li
2019-07-22 17:22       ` Kent Overstreet
2020-11-24 23:07         ` Marc Smith
2020-11-25 18:10           ` Marc Smith
2019-06-10 20:46 ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
2019-06-11  1:17   ` Kent Overstreet
2019-06-11  4:33     ` Dave Chinner
2019-06-12 16:21       ` Kent Overstreet
2019-06-12 23:02         ` Dave Chinner
2019-06-19  8:21           ` Jan Kara
2019-07-03  1:04             ` [PATCH] mm: Support madvise_willneed override by Filesystems Boaz Harrosh
2019-07-03 17:21               ` Jan Kara
2019-07-03 18:03                 ` Boaz Harrosh
2019-06-11  4:55     ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
2019-06-11 14:26       ` Matthew Wilcox
2019-06-11  4:10   ` Dave Chinner
2019-06-11  4:39     ` Linus Torvalds
2019-06-11  7:10       ` Dave Chinner
2019-06-12  2:07         ` Linus Torvalds
2019-07-03  5:59 ` Stefan K
2020-06-10 11:02 ` Stefan K
2020-06-10 11:32   ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).