Updated scalable urandom patchkit

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Updated scalable urandom patchkit
@ 2015-09-24 17:19 Andi Kleen
  2015-09-24 17:19 ` [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Andi Kleen @ 2015-09-24 17:19 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, linux

Updated patchkit to make /dev/urandom scale on larger systesm.
I addressed all the review comments. See ChangeLogs in the individual
patches.

-Andi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/3] Make /dev/urandom scalable
  2015-09-24 17:19 Updated scalable urandom patchkit Andi Kleen
@ 2015-09-24 17:19 ` Andi Kleen
  2015-09-30 14:40   ` Rasmus Villemoes
  2015-09-24 17:19 ` [PATCH 2/3] random: Make input to output pool balancing per cpu Andi Kleen
  2015-09-24 17:19 ` [PATCH 3/3] random: Add pool name to urandom_read trace point Andi Kleen
  2 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2015-09-24 17:19 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, linux, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

We had a case where a 4 socket system spent >80% of its total CPU time
contending on the global urandom nonblocking pool spinlock. While the
application could probably have used an own PRNG, it may have valid
reasons to use the best possible key for different session keys.

The application still ran acceptable under 2S, but just fell over
the locking cliff on 4S.

Implementation
==============

The non blocking pool is used widely these days, from every execve() (to
set up AT_RANDOM for ld.so randomization), to getrandom(3) and to frequent
/dev/urandom users in user space. Clearly having such a popular resource
under a global lock is bad thing.

This patch changes the random driver to use distributed per NUMA node
nonblocking pools. The basic structure is not changed: entropy is
first fed into the input pool and later from there distributed
round-robin into the blocking and non blocking pools. This patch extends
this to use an dedicated non blocking pool for each node, and distribute
evenly from the input pool into these distributed pools, in
addition to the blocking pool.

Then every urandom/getrandom user fetches data from its node local
pool. At boot time when users may be still waiting for the non
blocking pool initialization we use the node 0 non blocking pool,
to avoid the need for different wake up queues.

For single node systems (like the vast majority of non server systems)
nothing changes. There is still only a single non blocking pool.

For other systems the original nonblocking pool is used until this
pool has 128 bits worth of entropy. After that it is used
to initialize the other pools. This already gives them
different init states, so they don't run in lock-step
to avoid "replay" attacks.

Since we still have a global input pool there are no problems
with load balancing entropy data between nodes. Any node that never
runs any interrupts would still get the same amount of entropy as
other nodes.

Entropy is fed preferably to nodes that need it more using
the existing 75% threshold.

For saving/restoring /dev/urandom, there is currently no mechanism
to access the non local node pool (short of setting task affinity).
This implies that currently the standard init/exit random save/restore
scripts would only save node 0. On restore all pools are updates.
So the entropy of non 0 gets lost over reboot. That seems acceptable
to me for now (fixing this would need a new separate save/restore interface)

Scalability
===========

I tested the patch with a simple will-it-scale test banging
on get_random() in parallel on more and more CPUs. Of course
that is not a realistic scenario, as real programs should
do some work between getting random numbers. But it's a worst
case for the random scalability.

On a 4S Xeon v3 system _without_ the patchkit the benchmark
maxes out when using all the threads on one node. After
that it quickly settles to about half the throughput of
one node with 2-4 nodes.

(all throughput factors, bigger is better)
Without patchkit:

1 node:  1x
2 nodes: 0.75x
3 nodes: 0.55x
4 nodes: 0.42x

With the patchkit applied:

1 node:  1x
2 nodes: 2x
3 nodes: 2.4x
4 nodes: 3x

So it's not quite linear scalability, but 3x maximum throughput
is already a lot better.

A node can still have a large number of CPUs: on my test system 36
logical software threads (18C * 2T). In principle it may make
sense to split it up further. Per logical CPU would be clearly
overkill. But that would also add more pressure on the input
pools. For now per node seems like a acceptable compromise.

/dev/random still uses a single global lock. For now that seems
acceptable as it normally cannot be used for real high volume
accesses anyways.

The input pool also still uses a global lock. The existing per CPU
fast pool and "give up when busy" mechanism seems to scale well enough
even on larger systems.

v2: Fix name of pool 0. Fix race with interrupts. Make
iteration loops slightly more efficient. Add ifdefs to avoid
any extra code on non-NUMA. Delay other pool use to when
the original pool initialized and initialize the pools from
pool 0. Add comments on memory allocation.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/char/random.c | 185 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 158 insertions(+), 27 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d0da5d8..333a70c 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -156,6 +156,17 @@
  * particular randomness source.  They do this by keeping track of the
  * first and second order deltas of the event timings.
  *
+ * Distributed pools
+ * =================
+ *
+ * On larger systems the locking on the single non blocking pool can
+ * become a bottleneck. To avoid this, we use an own non blocking pool
+ * for each NUMA node. The distributed pools are fed round robin from
+ * the input pool. Each user then only reads entropy from their local
+ * pool.
+ *
+ * For the blocking pool there is still only a single instance.
+ *
  * Ensuring unpredictability at system startup
  * ============================================
  *
@@ -467,7 +478,7 @@ static struct entropy_store blocking_pool = {
 
 static struct entropy_store nonblocking_pool = {
 	.poolinfo = &poolinfo_table[1],
-	.name = "nonblocking",
+	.name = "nonblocking pool 0",
 	.pull = &input_pool,
 	.lock = __SPIN_LOCK_UNLOCKED(nonblocking_pool.lock),
 	.pool = nonblocking_pool_data,
@@ -475,6 +486,41 @@ static struct entropy_store nonblocking_pool = {
 					push_to_pool),
 };
 
+/*
+ * Per NUMA node nonblocking pool. This avoids lock contention
+ * when many processes extract data from /dev/urandom in parallel.
+ * /dev/random stays single instance for now.
+ *
+ * This variable is only set up once all non node 0 pools
+ * are initialized.
+ */
+#ifdef CONFIG_NUMA
+static struct entropy_store **nonblocking_node_pool __read_mostly;
+static struct entropy_store **early_nonblocking_node_pool __read_mostly;
+#else
+#define nonblocking_node_pool ((struct entropy_store **)NULL)
+#endif
+
+/* User must check for zero nonblocking_node_pool */
+#define for_each_nb_pool(i, pool) \
+	{ int num_nodes = num_possible_nodes(); \
+		for (i = 0; i < num_nodes; i++) { \
+			pool = nonblocking_node_pool[i];
+#define end_for_each_nb() } }
+
+static inline struct entropy_store *get_nonblocking_pool(void)
+{
+	struct entropy_store *pool = &nonblocking_pool;
+
+	/*
+	 * Non node 0 pools may take longer to initialize. Keep using
+	 * the boot nonblocking pool while this happens.
+	 */
+	if (nonblocking_node_pool)
+		pool = nonblocking_node_pool[numa_node_id()];
+	return pool;
+}
+
 static __u32 const twist_table[8] = {
 	0x00000000, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
 	0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
@@ -608,6 +654,25 @@ static void process_random_ready_list(void)
 	spin_unlock_irqrestore(&random_ready_list_lock, flags);
 }
 
+#define POOL_INIT_BYTES (128 / 8)
+
+void init_node_pools(void)
+{
+#ifdef CONFIG_NUMA
+	int i;
+	for (i = 0; i < num_possible_nodes(); i++) {
+		struct entropy_store *pool = early_nonblocking_node_pool[i];
+		char buf[POOL_INIT_BYTES];
+
+		get_random_bytes(buf, sizeof buf);
+		mix_pool_bytes(pool, buf, sizeof buf);
+	}
+	/* Initialize first before setting variable */
+	mb();
+	nonblocking_node_pool = early_nonblocking_node_pool;
+#endif
+}
+
 /*
  * Credit (or debit) the entropy store with n bits of entropy.
  * Use credit_entropy_bits_safe() if the value comes from userspace
@@ -681,6 +746,7 @@ retry:
 			prandom_reseed_late();
 			process_random_ready_list();
 			wake_up_all(&urandom_init_wait);
+			init_node_pools();
 			pr_notice("random: %s pool is initialized\n", r->name);
 		}
 	}
@@ -698,18 +764,23 @@ retry:
 			kill_fasync(&fasync, SIGIO, POLL_IN);
 		}
 		/* If the input pool is getting full, send some
-		 * entropy to the two output pools, flipping back and
-		 * forth between them, until the output pools are 75%
+		 * entropy to the other output pools, cycling
+		 * between them, until the output pools are 75%
 		 * full.
+		 * Feed into all pools round robin.
 		 */
 		if (entropy_bits > random_write_wakeup_bits &&
 		    r->initialized &&
 		    r->entropy_total >= 2*random_read_wakeup_bits) {
 			static struct entropy_store *last = &blocking_pool;
+			static int next_pool = -1;
 			struct entropy_store *other = &blocking_pool;
 
-			if (last == &blocking_pool)
-				other = &nonblocking_pool;
+			/* -1: use blocking pool, 0<=max_node: node nb pool */
+			if (next_pool > -1)
+				other = nonblocking_node_pool[next_pool];
+			if (++next_pool >= num_possible_nodes())
+				next_pool = -1;
 			if (other->entropy_count <=
 			    3 * other->poolinfo->poolfracbits / 4)
 				last = other;
@@ -748,6 +819,17 @@ struct timer_rand_state {
 
 #define INIT_TIMER_RAND_STATE { INITIAL_JIFFIES, };
 
+static void pool_add_buf(struct entropy_store *pool,
+			 const void *buf, unsigned int size)
+{
+	unsigned long flags;
+	unsigned long time = random_get_entropy() ^ jiffies;
+	spin_lock_irqsave(&pool->lock, flags);
+	_mix_pool_bytes(pool, buf, size);
+	_mix_pool_bytes(pool, &time, sizeof(time));
+	spin_unlock_irqrestore(&pool->lock, flags);
+}
+
 /*
  * Add device- or boot-specific data to the input and nonblocking
  * pools to help initialize them to unique values.
@@ -758,19 +840,18 @@ struct timer_rand_state {
  */
 void add_device_randomness(const void *buf, unsigned int size)
 {
-	unsigned long time = random_get_entropy() ^ jiffies;
-	unsigned long flags;
+	struct entropy_store *pool;
+	int i;
 
 	trace_add_device_randomness(size, _RET_IP_);
-	spin_lock_irqsave(&input_pool.lock, flags);
-	_mix_pool_bytes(&input_pool, buf, size);
-	_mix_pool_bytes(&input_pool, &time, sizeof(time));
-	spin_unlock_irqrestore(&input_pool.lock, flags);
-
-	spin_lock_irqsave(&nonblocking_pool.lock, flags);
-	_mix_pool_bytes(&nonblocking_pool, buf, size);
-	_mix_pool_bytes(&nonblocking_pool, &time, sizeof(time));
-	spin_unlock_irqrestore(&nonblocking_pool.lock, flags);
+	pool_add_buf(&input_pool, buf, size);
+	if (!nonblocking_node_pool) {
+		pool_add_buf(&nonblocking_pool, buf, size);
+	} else {
+		for_each_nb_pool (i, pool) {
+			pool_add_buf(pool, buf, size);
+		} end_for_each_nb()
+	}
 }
 EXPORT_SYMBOL(add_device_randomness);
 
@@ -1252,15 +1333,16 @@ static ssize_t extract_entropy_user(struct entropy_store *r, void __user *buf,
  */
 void get_random_bytes(void *buf, int nbytes)
 {
+	struct entropy_store *pool = get_nonblocking_pool();
 #if DEBUG_RANDOM_BOOT > 0
-	if (unlikely(nonblocking_pool.initialized == 0))
+	if (unlikely(pool->initialized == 0))
 		printk(KERN_NOTICE "random: %pF get_random_bytes called "
 		       "with %d bits of entropy available\n",
 		       (void *) _RET_IP_,
-		       nonblocking_pool.entropy_total);
+		       pool->entropy_total);
 #endif
 	trace_get_random_bytes(nbytes, _RET_IP_);
-	extract_entropy(&nonblocking_pool, buf, nbytes, 0, 0);
+	extract_entropy(pool, buf, nbytes, 0, 0);
 }
 EXPORT_SYMBOL(get_random_bytes);
 
@@ -1350,7 +1432,7 @@ void get_random_bytes_arch(void *buf, int nbytes)
 	}
 
 	if (nbytes)
-		extract_entropy(&nonblocking_pool, p, nbytes, 0, 0);
+		extract_entropy(get_nonblocking_pool(), p, nbytes, 0, 0);
 }
 EXPORT_SYMBOL(get_random_bytes_arch);
 
@@ -1390,12 +1472,43 @@ static void init_std_data(struct entropy_store *r)
  * initializations complete at compile time. We should also
  * take care not to overwrite the precious per platform data
  * we were given.
+ *
+ * This is early initialization, if memory allocations of a few
+ * bytes fails the kernel is doomed anyways. So use __GFP_NOFAIL.
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+	int i;
+	int num_nodes = num_possible_nodes();
+	struct entropy_store **pool_list;
+#endif
+
 	init_std_data(&input_pool);
 	init_std_data(&blocking_pool);
 	init_std_data(&nonblocking_pool);
+#ifdef CONFIG_NUMA
+	pool_list = kmalloc(num_nodes * sizeof(void *),
+					GFP_KERNEL|__GFP_NOFAIL);
+	pool_list[0] = &nonblocking_pool;
+	for (i = 1; i < num_nodes; i++) {
+		struct entropy_store *pool;
+
+		pool = kzalloc(sizeof(struct entropy_store),
+				GFP_KERNEL|__GFP_NOFAIL);
+		pool_list[i] = pool;
+		pool->poolinfo = &poolinfo_table[1];
+		pool->pull = &input_pool;
+		spin_lock_init(&pool->lock);
+		/* pool data not cleared intentionally */
+		pool->pool = kmalloc(sizeof(nonblocking_pool_data),
+				     GFP_KERNEL|__GFP_NOFAIL);
+		INIT_WORK(&pool->push_work, push_to_pool);
+		pool->name = kasprintf(GFP_KERNEL|__GFP_NOFAIL,
+				"nonblocking pool %d", i);
+	}
+	early_nonblocking_node_pool = pool_list;
+#endif
 	return 0;
 }
 early_initcall(rand_initialize);
@@ -1458,17 +1571,19 @@ static ssize_t
 urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 {
 	int ret;
+	struct entropy_store *pool;
 
-	if (unlikely(nonblocking_pool.initialized == 0))
+	pool = get_nonblocking_pool();
+	if (unlikely(pool->initialized == 0))
 		printk_once(KERN_NOTICE "random: %s urandom read "
 			    "with %d bits of entropy available\n",
 			    current->comm, nonblocking_pool.entropy_total);
 
 	nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3));
-	ret = extract_entropy_user(&nonblocking_pool, buf, nbytes);
+	ret = extract_entropy_user(pool, buf, nbytes);
 
-	trace_urandom_read(8 * nbytes, ENTROPY_BITS(&nonblocking_pool),
-			   ENTROPY_BITS(&input_pool));
+	trace_urandom_read(8 * nbytes, ENTROPY_BITS(pool),
+			   ENTROPY_BITS(pool));
 	return ret;
 }
 
@@ -1513,22 +1628,34 @@ static ssize_t random_write(struct file *file, const char __user *buffer,
 			    size_t count, loff_t *ppos)
 {
 	size_t ret;
+	struct entropy_store *pool;
+	int i;
 
 	ret = write_pool(&blocking_pool, buffer, count);
 	if (ret)
 		return ret;
-	ret = write_pool(&nonblocking_pool, buffer, count);
-	if (ret)
-		return ret;
+	if (nonblocking_node_pool) {
+		for_each_nb_pool (i, pool) {
+			ret = write_pool(pool, buffer, count);
+			if (ret)
+				return ret;
+		} end_for_each_nb()
+	} else {
+		ret = write_pool(&nonblocking_pool, buffer, count);
+		if (ret)
+			return ret;
+	}
 
 	return (ssize_t)count;
 }
 
 static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 {
+	int i;
 	int size, ent_count;
 	int __user *p = (int __user *)arg;
 	int retval;
+	struct entropy_store *pool;
 
 	switch (cmd) {
 	case RNDGETENTCNT:
@@ -1569,6 +1696,10 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
 			return -EPERM;
 		input_pool.entropy_count = 0;
 		nonblocking_pool.entropy_count = 0;
+		if (nonblocking_node_pool)
+			for_each_nb_pool (i, pool) {
+				pool->entropy_count = 0;
+			} end_for_each_nb();
 		blocking_pool.entropy_count = 0;
 		return 0;
 	default:
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/3] random: Make input to output pool balancing per cpu
  2015-09-24 17:19 Updated scalable urandom patchkit Andi Kleen
  2015-09-24 17:19 ` [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
@ 2015-09-24 17:19 ` Andi Kleen
  2015-09-24 17:19 ` [PATCH 3/3] random: Add pool name to urandom_read trace point Andi Kleen
  2 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2015-09-24 17:19 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, linux, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

The load balancing from input pool to output pools was
essentially unlocked. Before it didn't matter much because
there were only two choices (blocking and non blocking).

But now with the distributed non blocking pools we have
a lot more pools, and unlocked access of the counters
may systematically deprive some nodes from their deserved
entropy.

Turn the round-robin state into per CPU variables
to avoid any possibility of races. This code already
runs with preemption disabled.

v2: Check for non initialized pools.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/char/random.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 333a70c..31141b4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -772,15 +772,20 @@ retry:
 		if (entropy_bits > random_write_wakeup_bits &&
 		    r->initialized &&
 		    r->entropy_total >= 2*random_read_wakeup_bits) {
-			static struct entropy_store *last = &blocking_pool;
-			static int next_pool = -1;
-			struct entropy_store *other = &blocking_pool;
+			static DEFINE_PER_CPU(struct entropy_store *, lastp) =
+				&blocking_pool;
+			static DEFINE_PER_CPU(int, next_pool);
+			struct entropy_store *other = &blocking_pool, *last;
+			int np;
 
 			/* -1: use blocking pool, 0<=max_node: node nb pool */
-			if (next_pool > -1)
-				other = nonblocking_node_pool[next_pool];
-			if (++next_pool >= num_possible_nodes())
-				next_pool = -1;
+			np = __this_cpu_read(next_pool);
+			if (np > -1 && nonblocking_node_pool)
+				other = nonblocking_node_pool[np];
+			if (++np >= num_possible_nodes())
+				np = -1;
+			__this_cpu_write(next_pool, np);
+			last = __this_cpu_read(lastp);
 			if (other->entropy_count <=
 			    3 * other->poolinfo->poolfracbits / 4)
 				last = other;
@@ -789,6 +794,7 @@ retry:
 				schedule_work(&last->push_work);
 				r->entropy_total = 0;
 			}
+			__this_cpu_write(lastp, last);
 		}
 	}
 }
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/3] random: Add pool name to urandom_read trace point
  2015-09-24 17:19 Updated scalable urandom patchkit Andi Kleen
  2015-09-24 17:19 ` [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
  2015-09-24 17:19 ` [PATCH 2/3] random: Make input to output pool balancing per cpu Andi Kleen
@ 2015-09-24 17:19 ` Andi Kleen
  2 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2015-09-24 17:19 UTC (permalink / raw)
  To: tytso; +Cc: linux-kernel, linux, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Now that we have multiple nonblocking_pools it makes sense to report
the name of the pool in the urandom_read trace point. Extend
the trace point to report the name too.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/char/random.c         |  2 +-
 include/trace/events/random.h | 10 ++++++----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 31141b4..c44831d 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1588,7 +1588,7 @@ urandom_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 	nbytes = min_t(size_t, nbytes, INT_MAX >> (ENTROPY_SHIFT + 3));
 	ret = extract_entropy_user(pool, buf, nbytes);
 
-	trace_urandom_read(8 * nbytes, ENTROPY_BITS(pool),
+	trace_urandom_read(pool->name, 8 * nbytes, ENTROPY_BITS(pool),
 			   ENTROPY_BITS(pool));
 	return ret;
 }
diff --git a/include/trace/events/random.h b/include/trace/events/random.h
index 4684de3..3b5ab5a 100644
--- a/include/trace/events/random.h
+++ b/include/trace/events/random.h
@@ -288,25 +288,27 @@ TRACE_EVENT(random_read,
 );
 
 TRACE_EVENT(urandom_read,
-	TP_PROTO(int got_bits, int pool_left, int input_left),
+	TP_PROTO(const char *pool, int got_bits, int pool_left, int input_left),
 
-	TP_ARGS(got_bits, pool_left, input_left),
+	TP_ARGS(pool, got_bits, pool_left, input_left),
 
 	TP_STRUCT__entry(
+		__field( const char *,  pool			)
 		__field(	  int,	got_bits		)
 		__field(	  int,	pool_left		)
 		__field(	  int,	input_left		)
 	),
 
 	TP_fast_assign(
+		__entry->pool		= pool;
 		__entry->got_bits	= got_bits;
 		__entry->pool_left	= pool_left;
 		__entry->input_left	= input_left;
 	),
 
 	TP_printk("got_bits %d nonblocking_pool_entropy_left %d "
-		  "input_entropy_left %d", __entry->got_bits,
-		  __entry->pool_left, __entry->input_left)
+		  "input_entropy_left %d from %s", __entry->got_bits,
+		  __entry->pool_left, __entry->input_left, __entry->pool)
 );
 
 #endif /* _TRACE_RANDOM_H */
-- 
2.4.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/3] Make /dev/urandom scalable
  2015-09-24 17:19 ` [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
@ 2015-09-30 14:40   ` Rasmus Villemoes
  0 siblings, 0 replies; 26+ messages in thread
From: Rasmus Villemoes @ 2015-09-30 14:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: tytso, linux-kernel, Andi Kleen

On Thu, Sep 24 2015, Andi Kleen <andi@firstfloor.org> wrote:

>
> v2: Fix name of pool 0. Fix race with interrupts. Make
> iteration loops slightly more efficient. Add ifdefs to avoid
> any extra code on non-NUMA. Delay other pool use to when
> the original pool initialized and initialize the pools from
> pool 0. Add comments on memory allocation.

More bikeshedding. Please ignore unless you do a v3 anyway.

>  static struct entropy_store nonblocking_pool = {
>  	.poolinfo = &poolinfo_table[1],
> -	.name = "nonblocking",
> +	.name = "nonblocking pool 0",

Hm, yeah, but then you should probably also update the blocking pool's
name (or use the format "nonblocking %d").

> +#define POOL_INIT_BYTES (128 / 8)
> +
> +void init_node_pools(void)
> +{
> +#ifdef CONFIG_NUMA
> +	int i;
> +	for (i = 0; i < num_possible_nodes(); i++) {

int num_nodes = num_possible_nodes();
for (i = 0; i < num_nodes; i++) {

>  static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
>  {
> +	int i;
>  	int size, ent_count;
>  	int __user *p = (int __user *)arg;
>  	int retval;
> +	struct entropy_store *pool;
>  
>  	switch (cmd) {
>  	case RNDGETENTCNT:
> @@ -1569,6 +1696,10 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
>  			return -EPERM;
>  		input_pool.entropy_count = 0;
>  		nonblocking_pool.entropy_count = 0;
> +		if (nonblocking_node_pool)
> +			for_each_nb_pool (i, pool) {
> +				pool->entropy_count = 0;
> +			} end_for_each_nb();

extra ;. harmless in this case, but could cause slightly hard-to-understand build
failures if that if grows an else.

Rasmus

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
@ 2015-10-10 18:45 George Spelvin
  2015-10-11  2:31 ` Theodore Ts'o
  0 siblings, 1 reply; 26+ messages in thread
From: George Spelvin @ 2015-10-10 18:45 UTC (permalink / raw)
  To: andi; +Cc: ahferroin7, jepler, linux, linux-kernel, linux, tytso

I'm very very sorry to be so late to the party; I didn't see this thread
until the week-delayed LWN article on the subject drew my attention to it.

There's a fundamental design issue with this patch set that seems
unnecessary to me: multiple nonblocking pools.

I know, I know, that seems like the whole point of it, but hear me out.

Entropy is a holographic property of the pool, not located in any
particular bit position.

And the most basic operation, of reading from the pool, can easily be
done by multiple readers at the same time from the same bit pattern.
They just need to be salted with distinguishing nonces (CPU IDs will do
nicely) to ensure they all get different results.  So a reader doesn't
just get hash(pool[]), but hash(unique_nonce, pool[]).

This in turn lets the spin_lock_irqsave() in extract_buf be moved to
after the sha_transform() calls.

I think the whole thing can be done, more elegantly, using a single pool
and judiciously relaxed locking rules.  You don't even need reader-writer
locks; it's actually beneficial rather than harmful if there's a race
between a pool reader and writer.

In general, fewer larger pools is better for entropy storage.  The only
reason for the current three-pool design is to achieve strict isolation
of the /dev/random pool.

The two operations that require locking are:

1. Entropy accounting.  However, that only has to be strict
   for /dev/random.  For /dev/urandom, it can be late in various ways.
   One possibility is to have separate entropy accounding per NUMA
   node; only the pool proper has to be shared.

2. Add-back of the output to the pool.  This involves writing to pool,
   but for non-invertibility, it only has to be do ne byone of a number of
   concurrent readers.  I think significant cleverness can be applied
   here.

There are a lot of things that can be done about this second point:

2a. The obvious one is to batch the add-back.  For backtracking
    protection, we only need to do one add-back per read, not one per
    10 bytes of output.  Combined with the above change that locks only
    around the add-back and not the pool hashing, this greatly reduces
    both the lock window and the number of times it's taken.

2b. Another idea would be to optimize for 16 bytes.  I like the basic
    concept of having a larger hash internal state than the output, but
    would it be possible to output 16 of the 20 bytes of SHA-1 output
    rather than the current 10?  This is obviously a Ted question (the
    current folding is his idea), but even 32 bits is enough to protect
    against internal collisions in the hash state.

2c. Another possibility would be to add small separate "salt piles"
    which are initialized to the CPU id, and stirred by reads.  They only
    need to be large enough to not experience birthsay collisions even
    assuming a stupid number of reads between stirs of the main pool.
    (So 128 bits would be more than ample.)  A read would consist of:

	hash[5] = { initial values }
	sha1(hash, my_salt)
	sha1(hash, main_pool)
	if (spin_trylock(main_pool)) {
		add_back(hash, main_pool);
		spin_unlock(main_pool);
	} else {
		add_back(hash, my_salt);
	}

2d. A more complete solution involves noting that, if there are multiple
    concurrent readers, only one has to add back its output to prevent
    backtracking, because all of the concurrent reads are equivalent
    in that sense.  (Even though, because they're salted with separate
    nonces like the CPU id, they're not identical.)

I'm thinking in terms of a "generation counter" for the pool.  (Perhaps
implemented as a seqlock.)  The readers of the generation-n pool must
ensure that *someone* is responsible for bumping the generation to n+1
to prevent backtracking of the generation-n state, but only one.

A key point is that the add-back that bumps the generation to n+2 may
actaully be a hash of the generation-n pool state, the generation-n+1
state, or some transient intermedicate state due to races with the
generation-n+1 add-back.  It is not necessary for readers to wait for
the pool to be stable, only that it's a non-invertible transformation
based on some earlier generation.

Now, to actually implement that, a "spin lock while checking extra
condition" would be ideal, but I do not want to add Yet Another Locking
Primitive to the kernel.

One possibility would be to accept the possibility of a race condition
breaking anti-backtracking.  Another reader will come along soom enough
and fix it.

But if that's not okay, consider the following:

We add a second lock, the "responsibility lock", the holder of which
is responsible for doing anti-backtracking.

After hashing the pool, each reader does the following:
1. (Optional) trylock() the pool lock.
   This is a low-load optimization.  If it succeeds, go
   directly to step 5.
2. Trylock() the responsibility lock.
   If this fails, we're done; return.
3. Wait for the pool lock.
4. Drop the responsibility lock.  This must be done before
   updating the pool proper.
5. Add our anti-backtracking bytes to the pool.
6. Release the pool lock.

(I originally thought I'd need to ping-pong between two responsibility
locks based on the generation counter % 2, but now I don't think so.)

Under high load, you have one processor adding back, a second waiting
to add back, and everyone else just sails right through without
waiting at all.

Details to be filled in:
- Each reader needs some sort of nonce to distinguish multiple reads
  if there's no main pool add-back between them.  RDTSC, per-cpu
  counter, or whatever.
- Entropy accounting.  As mentioned, some sloppiness is okay, so I
  think it's solvable, but I haven't looked at it yet.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-10 18:45 Updated scalable urandom patchkit George Spelvin
@ 2015-10-11  2:31 ` Theodore Ts'o
  2015-10-11  2:53   ` Theodore Ts'o
  2015-10-11  4:35   ` George Spelvin
  0 siblings, 2 replies; 26+ messages in thread
From: Theodore Ts'o @ 2015-10-11  2:31 UTC (permalink / raw)
  To: George Spelvin; +Cc: andi, ahferroin7, jepler, linux-kernel, linux

On Sat, Oct 10, 2015 at 02:45:46PM -0400, George Spelvin wrote:
> In general, fewer larger pools is better for entropy storage.  The only
> reason for the current three-pool design is to achieve strict isolation
> of the /dev/random pool.

You're absolutely right, and I would like to see if we can get away
with keeping with a single pool design.  Your idea of using the CPU id
as a nonce is a good one, and I think we can do something similar that
should be good enough for mixing the hash back into the pool.  We can
simply use a hash of the CPU id to change the offset where we do the
mixing, and this should be good enough to avoid collisions when we do
the add-back.

HOWEVER.  One downside of this approach is that, especially on a NUMA
system, the costs of the cache coherency across a single pool which is
constantly being modified and shared by all of the NUMA nodes could be
a killer, even if we go completely lockless.

To that end, Andi, can you try benchmarking the scalability of the
patch below?  I really hope it will be good enough, since besides
using less memory, there are security advantages in not spreading the
entropy across N pools.

If it isn't, we might be able to play a trick where we sample the
r->add_ptr value before we start hashing the pool, and then check to
see if it's changed afterwards.  If it has, we could skip doing the
hash back, which could reduce the cache coherency traffic, since as
you point out:

> 2d. A more complete solution involves noting that, if there are multiple
>     concurrent readers, only one has to add back its output to prevent
>     backtracking, because all of the concurrent reads are equivalent
>     in that sense.  (Even though, because they're salted with separate
>     nonces like the CPU id, they're not identical.)

However, even if we put in that optimization, the primary question is
how good is Intel's cache coherency protocols on their NUMA systems?
I'm pretty sure this would be a disaster on, say, Sequent's old NUMA
machines, but I'm quite confident all of those servers are dead and
buried by now.  :-)

						- Ted

commit 3cb51896deab45bddc1b8f571b1103eae8f50e0e
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Sat Oct 10 22:03:53 2015 -0400

    random: make the reading from the non-blocking pool more scalable
    
    Andi Kleen reported a case where a 4 socket system spent >80% of its
    total CPU time contending on the global urandom nonblocking pool
    spinlock. While the application could probably have used an own PRNG,
    it may have valid reasons to use the best possible key for different
    session keys.
    
    Instead of creating separate multiple per-NUMA node non-blocking
    pools, use a trick suggested by George Spelvin:
    
       Entropy is a holographic property of the pool, not located in any
       particular bit position.
    
       And the most basic operation, of reading from the pool, can easily be
       done by multiple readers at the same time from the same bit pattern.
       They just need to be salted with distinguishing nonces (CPU IDs will do
       nicely) to ensure they all get different results....
    
    We use this trick, and in addition use a hash of the cpu id to change
    where we mix the hash back into the pool to avoid collisions.
    
    Since we are already using a lockless technique (cmpxchg) to update
    the entropy accounting, we don't need to change this around.
    
    This technique won't be quite as scalable since on a NUMA node we will
    still be forcing cache lines to bounce around, but from the
    perspective of entropy storage we're much better using a single pool
    rather than spreading it across multiple pools.
    
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d0da5d8..be6b315 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -260,6 +260,7 @@
 #include <linux/irq.h>
 #include <linux/syscalls.h>
 #include <linux/completion.h>
+#include <linux/hash.h>
 
 #include <asm/processor.h>
 #include <asm/uaccess.h>
@@ -494,7 +495,9 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 {
 	unsigned long i, tap1, tap2, tap3, tap4, tap5;
 	int input_rotate;
+	int is_nonblock = (r == &nonblocking_pool);
 	int wordmask = r->poolinfo->poolwords - 1;
+	int poolbits = r->poolinfo->poolbitshift - 5;
 	const char *bytes = in;
 	__u32 w;
 
@@ -506,6 +509,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 
 	input_rotate = r->input_rotate;
 	i = r->add_ptr;
+	if (is_nonblock)
+		i += hash_32(get_cpu(), poolbits);
 
 	/* mix one byte at a time to simplify size handling and churn faster */
 	while (nbytes--) {
@@ -534,6 +539,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 
 	r->input_rotate = input_rotate;
 	r->add_ptr = i;
+	if (is_nonblock)
+		put_cpu();
 }
 
 static void __mix_pool_bytes(struct entropy_store *r, const void *in,
@@ -1090,6 +1097,7 @@ retry:
 static void extract_buf(struct entropy_store *r, __u8 *out)
 {
 	int i;
+	int is_nonblock = (r == &nonblocking_pool);
 	union {
 		__u32 w[5];
 		unsigned long l[LONGS(20)];
@@ -1108,9 +1116,12 @@ static void extract_buf(struct entropy_store *r, __u8 *out)
 			break;
 		hash.l[i] = v;
 	}
+	if (is_nonblock)
+		hash.w[0] ^= hash_32(get_cpu(), 32);
+	else
+		spin_lock_irqsave(&r->lock, flags);
 
 	/* Generate a hash across the pool, 16 words (512 bits) at a time */
-	spin_lock_irqsave(&r->lock, flags);
 	for (i = 0; i < r->poolinfo->poolwords; i += 16)
 		sha_transform(hash.w, (__u8 *)(r->pool + i), workspace);
 
@@ -1124,7 +1135,10 @@ static void extract_buf(struct entropy_store *r, __u8 *out)
 	 * hash.
 	 */
 	__mix_pool_bytes(r, hash.w, sizeof(hash.w));
-	spin_unlock_irqrestore(&r->lock, flags);
+	if (is_nonblock)
+		put_cpu();
+	else
+		spin_unlock_irqrestore(&r->lock, flags);
 
 	memzero_explicit(workspace, sizeof(workspace));
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-11  2:31 ` Theodore Ts'o
@ 2015-10-11  2:53   ` Theodore Ts'o
  2015-10-11  4:35   ` George Spelvin
  1 sibling, 0 replies; 26+ messages in thread
From: Theodore Ts'o @ 2015-10-11  2:53 UTC (permalink / raw)
  To: George Spelvin, andi, ahferroin7, jepler, linux-kernel, linux

On Sat, Oct 10, 2015 at 10:31:46PM -0400, Theodore Ts'o wrote:
> To that end, Andi, can you try benchmarking the scalability of the
> patch below?  I really hope it will be good enough, since besides
> using less memory, there are security advantages in not spreading the
> entropy across N pools.

Andi, I forgot to menion --- if you could benchmark using both on your
"run get_entropy() in a tight loop" and the actual real-life
application --- since if is CPU cache behavior is going to be a large
factor in the results, a real application is going to be more
representative that a micro-benchmark, that would be much appreciated.

Thanks!

					- Ted

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-11  2:31 ` Theodore Ts'o
  2015-10-11  2:53   ` Theodore Ts'o
@ 2015-10-11  4:35   ` George Spelvin
  2015-10-11 22:25     ` Theodore Ts'o
  1 sibling, 1 reply; 26+ messages in thread
From: George Spelvin @ 2015-10-11  4:35 UTC (permalink / raw)
  To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

Damn, I bow before the master.  That is a much neater solution
than mine; I had assumed a lock was required for writing.

While it's good enough for benchmarking, there are a few leftover
problems I mention so they don't get missed.

One is the final write back of add_ptr on the last line of
_mix_pool_bytes.  It actually writes i, which includes the per-cpu nonce,
and will have it jumping all over without the steady progression that
the mixing polynomial assumes.

(There's a similar, lesser problem with input_rotate.)

The second, less obvious, problem is that by calling _mix_pool_bytes
completely lockless, there's the risk that it will race with and overwrite
the addition of new seed material to the pool.

The add-back is not critical, and races between two writers don't really
do any harm.  But seed entropy is valuable.

And unfortunately, transferring 256 bits (32 bytes) to the output pool
will try to write every word, so *any* concurrent add-back is risky; there's
no "safe" part of the pool that can be accessed lockless.

(The first crude hack that comes to mind is to double the size
of the output pool, without increasing its nominal capacity, and
do seeding and add-back to different halves.  Hopefully there's
something more elegant.)

Two minor suggestions about is_nonblock:
1) Rather than using (r == &nonblocking_pool), how about !r->limit?

2) Would you mind making is_nonblock bool?

   I know back before the sacred text of the prophets Kernighan and
   Ritchie was corrupted by modern heresies, we used "int" for everything,
   and we liked it.  5 miles in the snow, uphill both ways, yadda yadda.

   But I like to document range restrictions as much as possible, and "bool"
   makes it clearer to both the compiler and the reader of the code.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-11  4:35   ` George Spelvin
@ 2015-10-11 22:25     ` Theodore Ts'o
  2015-10-12  0:16       ` George Spelvin
  0 siblings, 1 reply; 26+ messages in thread
From: Theodore Ts'o @ 2015-10-11 22:25 UTC (permalink / raw)
  To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

On Sun, Oct 11, 2015 at 12:35:46AM -0400, George Spelvin wrote:
> 
> One is the final write back of add_ptr on the last line of
> _mix_pool_bytes.  It actually writes i, which includes the per-cpu nonce,
> and will have it jumping all over without the steady progression that
> the mixing polynomial assumes.

Yep, good catch; I need to subtract that off.

> (There's a similar, lesser problem with input_rotate.)

I didn't bother messing with input_rotate at all, and I don't think
that will be a problem.

> The second, less obvious, problem is that by calling _mix_pool_bytes
> completely lockless, there's the risk that it will race with and overwrite
> the addition of new seed material to the pool.
> 
> The add-back is not critical, and races between two writers don't really
> do any harm.  But seed entropy is valuable.

That's true, but I've been going back and forth about how much it's
worth to fix this.  After all, the the non-blocking pool is guaranteed
to have cryptographic randomness, and so it's a question of how much
effort we want to avoid the possibility of losing some seed entropy.

One approach might be to use take a reader lock when mixing back
entropy, but take an exclusive lock in the case when seeding the
non-random pool.

Another approach is to have an alternate mechanism for the
non-blocking pool which uses the LOCK prefix when XOR'ing into the
pool.  This would mean that we would have to drop the twisted LFSR and
replace it with something simpler but so long as the mixing algothm
eventually involves all of the bits in the pool, that will probably be
OK.

Of course, using the LOCK prefix is CPU architecture specific, and in
particular, there is no equivalent on the ARM architecture.

The final thing we could do is to just throw in a smp_mb() before the
write into the pool, and just call it day, and accept that while we
might lose a small amount of entropy due to a race, it will only a
byte's worth each time we lose a race (instead of the whole 32 byte
cache line).

None of the alternatives are all that satisfying, and they will all be
a bit more expensive than the first patch.  The question is how to
weigh these problems against just simply using a separate pool for
each NUMA node, and how much scalability are we really need for
"realistic" workloads?

						- Ted

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-11 22:25     ` Theodore Ts'o
@ 2015-10-12  0:16       ` George Spelvin
  2015-10-12  4:05         ` Theodore Ts'o
  0 siblings, 1 reply; 26+ messages in thread
From: George Spelvin @ 2015-10-12  0:16 UTC (permalink / raw)
  To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

TedTs'o wrote:
> Yep, good catch; I need to subtract that off.

I'm not thrilled with incrementing the pointer from i to len, but mixing
at positions i+k to i+k+len.  The whole LFSR scheme relies on a regular
pass structure.

How about this instead: drop the hashed offset, and instead let each
writer do an atomic_add_return on the index, then iterate over the
"reserved" slots.  Concurrent additions will at least do non-overlapping
writes until the numer equals the pool size.

(Admittedly, if the pool is 32 words and we're adding back 20 bytes
per reader, overlap is hard to avoid.  Could we do seeding a byte at a
time with input rotate, but add-back 32 bits at a time for simplicity
and speed?)

The other one would be to have separate indexes for add-back and seed
addition, with the latter protected by a lock.

>> (There's a similar, lesser problem with input_rotate.)

> I didn't bother messing with input_rotate at all, and I don't think
> that will be a problem.

It was more that it's going to do strange non-deterministic things.

Personally, I hate the input_rotate.  It's not that it's harmful, just
that it doesn't do much good compared to the cost; for the number of cycles
and context space required there are more efficient mixing operations.

But if you want, you can easily compute it on demand.
If you avoid reducing r->add_words mod poolwords, then
	input_rotate = 7*(r->add_words + r->add_words/poolwords)

>> The add-back is not critical, and races between two writers don't really
>> do any harm.  But seed entropy is valuable.

> That's true, but I've been going back and forth about how much it's
> worth to fix this.  After all, the the non-blocking pool is guaranteed
> to have cryptographic randomness, and so it's a question of how much
> effort we want to avoid the possibility of losing some seed entropy.

I worry that with only 32 words of pool, on a large (1K CPUs) machine
that's execv()ing and geerating AT_RANDOM vectors a lot, the traffic
could be pretty heavy, and the odds of stomping a byte of seed material
cound get substantial.

Or a small machine with a couple of concurrent /dev/urandom abusers.
Remember, it's globally readable, so it has to be resistance to malicious
abuse.

> One approach might be to use take a reader lock when mixing back
> entropy, but take an exclusive lock in the case when seeding the
> non-random pool.

Even a reader lock is at least one atomic operation.

> Another approach is to have an alternate mechanism for the
> non-blocking pool which uses the LOCK prefix when XOR'ing into the
> pool.  This would mean that we would have to drop the twisted LFSR and
> replace it with something simpler but so long as the mixing algothm
> eventually involves all of the bits in the pool, that will probably be
> OK.
>
> Of course, using the LOCK prefix is CPU architecture specific, and in
> particular, there is no equivalent on the ARM architecture.

You can add rather than XOR, and we have atomic add primitives.

(You could even, if you wanted to preserve the period proof of the
mixing function, compute the value which, when added to the original
word, would make the XOR change, and then atomically add that.)

> The final thing we could do is to just throw in a smp_mb() before the
> write into the pool, and just call it day, and accept that while we
> might lose a small amount of entropy due to a race, it will only a
> byte's worth each time we lose a race (instead of the whole 32 byte
> cache line).

If you're willing to accept "probably" solutions, just add the seed
material twice.  If one byte is lost, the other copy will probably
survive.

Or, if you want to be cleverer, any form of error-correcting code
(duplication is just a simple for of it) will add enough redundant
information that a few erasures won't lose entropy.

But the atomic add seems safer.  I don't have a good feel for the fraction
of lost entropy a deliberate attack on a large machine could cause and
/dev/urandom is not an area where I'm comfortable hoping.

> None of the alternatives are all that satisfying, and they will all be
> a bit more expensive than the first patch.  The question is how to
> weigh these problems against just simply using a separate pool for
> each NUMA node, and how much scalability are we really need for
> "realistic" workloads?

There are several possible solutions that don't need separate pools
(including separate add-back pools, with a shared seeded pool that
is never touched by add-back), so I don't think it's necessary to
give up yet.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-12  0:16       ` George Spelvin
@ 2015-10-12  4:05         ` Theodore Ts'o
  2015-10-12  7:49           ` George Spelvin
  0 siblings, 1 reply; 26+ messages in thread
From: Theodore Ts'o @ 2015-10-12  4:05 UTC (permalink / raw)
  To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

On Sun, Oct 11, 2015 at 08:16:01PM -0400, George Spelvin wrote:
> 
> I'm not thrilled with incrementing the pointer from i to len, but mixing
> at positions i+k to i+k+len.  The whole LFSR scheme relies on a regular
> pass structure.

That part I'm not worried about.  We still have a regular pass
structure --- since for each CPU, we are still iterating over the pool
in a regular fashion.

> How about this instead: drop the hashed offset, and instead let each
> writer do an atomic_add_return on the index, then iterate over the
> "reserved" slots.  Concurrent additions will at least do non-overlapping
> writes until the numer equals the pool size.

One atomic operation per byte that we're mixing in?  That's quite
expensive.

> Personally, I hate the input_rotate.  It's not that it's harmful, just
> that it doesn't do much good compared to the cost; for the number of cycles
> and context space required there are more efficient mixing operations.

The input_rotate is useful for the input pool, for things like
keyboard code so we can make sure they enter the pool at more points
than just the low bits of each word.  But for the output pools, it
really doesn't make any sense.  And we are getting to the point where
we may end up having different mixing algorithms for the nonblocking
pool, and in that case I have absolutely no trouble dropping the
input_rotate part of the mixing algorithm for the non-blocking pool.

> Or a small machine with a couple of concurrent /dev/urandom abusers.
> Remember, it's globally readable, so it has to be resistance to malicious
> abuse.

One of the other ways we could solve this is by hanging a struct off
the task structure, and if we detect that we have a /dev/urandom
abuser, we give that process its own urandom pool, so any pissing that
it does will be in its own pool.  (So to speak.)

Most processes don't actually use that much randomness, and I'm not
that worried about in-kernel users of the nonblocking pool.  Even with
the most exec-heavy workload, setting up a new exec image is
heavyweight enough that you're really not going to be contending on
the lock.  I also have trouble with someone spending $$$$ on a system
with 1K cpu cores and wasting all of their CPU power with running
shell scripts that fork and exec a lot.  :-)

The reality is that most processes don't use /dev/urandom or
getrandom(2) at all, and those that do, many of them only use it
sparingly.  So maybe the right answer is to do something simple which
takes care of the abusers.

> You can add rather than XOR, and we have atomic add primitives.

Atomic-add primitives aren't portable either.  The representation
isn't guaranteed to be 32-bits, and some platforms an atomic int is
only 24-bits wide (the top 8 bits being used for locking purposes).

> There are several possible solutions that don't need separate pools
> (including separate add-back pools, with a shared seeded pool that
> is never touched by add-back), so I don't think it's necessary to
> give up yet.

Hmm, maybe.  I'm a bit worried about the amount of complexity that
this entails, and the reality is that the urandom pool or pools don't
provide anything other than cryptogaphic randomness.

At this point, I wonder if it might not be simpler to restrict the
current nonblocking pool to kernel users, and for userspace users, the
first time a process reads from /dev/urandom or calls getrandom(2), we
create for them a ChaCha20 CRNG, which hangs off of the task
structure.  This would require about 72 bytes of state per process,
but normally very few processes are reading from /dev/urandom or
calling getrandom(2) from userspace.

The CRNG would be initialized from the non-blocking pool, and is
reseeded after, say, 2**24 cranks or five minutes.  It's essentially
an OpenBSD-style arc4random in the kernel.  (Arguably the right answer
is to put arc4random in libc, where it can automatically handle
forks/clones/pthread automatically, but it seems pretty clear *that*
train has left a long time ago.)

I have a feeling this may be less code and complexity, and it nicely
handles the case where we have a /dev/urandom abuser who feels that
they want to call /dev/urandom in a tight loop, even on a 4 socket
Xeon system.  :-)

							- Ted

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-12  4:05         ` Theodore Ts'o
@ 2015-10-12  7:49           ` George Spelvin
  2015-10-12 13:54             ` Theodore Ts'o
  0 siblings, 1 reply; 26+ messages in thread
From: George Spelvin @ 2015-10-12  7:49 UTC (permalink / raw)
  To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

>> I'm not thrilled with incrementing the pointer from i to len, but mixing
>> at positions i+k to i+k+len.  The whole LFSR scheme relies on a regular
>> pass structure.

> That part I'm not worried about.  We still have a regular pass
> structure --- since for each CPU, we are still iterating over the pool
> in a regular fashion.

Not if I understand what you are doing.

Suppose CPU 0 has an offset of 0, and CPU 1 has an offset of 12.
(I'll also use a count-up index even though I know it's actually
count-down in the code.)

CPU0 writes 5 words at 0..4, leaving add_ptr = 5.
Then CPU 1 writes 5 words at 17..21, leaving add_ptr = 10
The CPU 0 writes its next word at 10.

Words 5..9 never got mixed at all.

>> How about this instead: drop the hashed offset, and instead let each
>> writer do an atomic_add_return on the index, then iterate over the
>> "reserved" slots.  Concurrent additions will at least do non-overlapping
>> writes until the number equals the pool size.

> One atomic operation per byte that we're mixing in?  That's quite
> expensive.

Well, for this part, it's one atomic operation per _mix_pool_bytes:
	i = (unsigned)atomic_sub_return(nbytes, &r->add_ptr) + nbytes;

However, I *also* have to use atomic_add to write r->pool[i],
which is indeed expensive.  PoC diff attached, but I'm not sure if
you'll like it.

> The input_rotate is useful for the input pool, for things like
> keyboard code so we can make sure they enter the pool at more points
> than just the low bits of each word.  But for the output pools, it
> really doesn't make any sense.  And we are getting to the point where
> we may end up having different mixing algorithms for the nonblocking
> pool, and in that case I have absolutely no trouble dropping the
> input_rotate part of the mixing algorithm for the non-blocking pool.

You're forgetting: keyboard codes don't go into the input pool.
They go into the per-CPU fast pool, which does some quite thorough
mixing and delivers 128 bits with the entropy effectively distributed
across it.

These days, the only thing that goes into the large pools are
the output from other pools and /dev/random writes.

But also, that's what the twist_table achieves.  By the time the
pool wraps around, the previous value at a given array posistion
has been read and twisted many times, and is now smeared across
all 32 bits.

If you want larger shifts *and* greater efficiency by removing a table
lookup from the critical path, I'll happily replace it with an Xorshift
structure (will take me a little while; I've generated primitive
polynomials in the past but need to rewrite the code).

>> Or a small machine with a couple of concurrent /dev/urandom abusers.
>> Remember, it's globally readable, so it has to be resistance to malicious
>> abuse.

> One of the other ways we could solve this is by hanging a struct off
> the task structure, and if we detect that we have a /dev/urandom
> abuser, we give that process its own urandom pool, so any pissing that
> it does will be in its own pool.  (So to speak.)

Let me make sure we're talking about the same thing here.

Segregating abusers so they don't cause *performance* problems
for other threads is a reasonable optimization.

But the quoted conversation was talking about a *security*
problem that could be created by /dev/urandom abusers if some
of your lock-relaxing ideas were followed.

I was observing that the failure case you were hoping was rare could
be made common by a malicious user.

And for that, I don't think a cacheing wrapper is the way to solve
the problem.

For security, either the underlying pool is robust or fragile.  If it's
robust, we don't need this extra layer; abusers will get shitty
performance but won't hurt the quality of the output.

If we need the layer, then we have to ask if we understand the
vulnerability to abuse and have successfully segregated all damaging
forms of abuse.  That's more analysis and design effort than simply
eliminating the problem in the first place.

> Most processes don't actually use that much randomness, and I'm not
> that worried about in-kernel users of the nonblocking pool.  Even with
> the most exec-heavy workload, setting up a new exec image is
> heavyweight enough that you're really not going to be contending on
> the lock.  I also have trouble with someone spending $$$$ on a system
> with 1K cpu cores and wasting all of their CPU power with running
> shell scripts that fork and exec a lot.  :-)

But I just re-read Andi's original trouble report, and although he mentions
AT_RANDOM, it's an applicatio reading from /dev/urandom a lot that's
causing the problem *not* execve.  My memory may have gotten confused.

I can imagine that one /dev/urandom abuser can minopolize the locck and
cause problems for a shell script.

But yes, for this performance problem, a segregate-the-abusers solution
is quite attractive.

> Atomic-add primitives aren't portable either.  The representation
> isn't guaranteed to be 32-bits, and some platforms an atomic int is
> only 24-bits wide (the top 8 bits being used for locking purposes).

No longer true; see Documentation/atomic_ops.txt.

But yes, atomic ops are *ridiculously* expensive on SMP 32-bit SPARC.

Do we care?  Is anyone using multiprocessor V7 SPARC in anger these days?

>> There are several possible solutions that don't need separate pools
>> (including separate add-back pools, with a shared seeded pool that
>> is never touched by add-back), so I don't think it's necessary to
>> give up yet.

> Hmm, maybe.  I'm a bit worried about the amount of complexity that
> this entails, and the reality is that the urandom pool or pools don't
> provide anything other than cryptogaphic randomness.

Well, *one* add-back pool would also do, and be a quite simple addition
to the code.

> the urandom pool or pools don't
> provide anything other than cryptogaphic randomness.

Well, there's cryptographic and there's cryptographic.
/dev/urandom is deliberately *ridiculosuly* overengineered;
It's not dependent on the security of a single primitive.

I'd rather not downgrade it to depending on the security of AES, or
ChaCha, or Keccak, or any other performance-optimized primitive.

For my own code, sure.  But /dev/random is sold as "you *really* don't
have to worry about it".

> At this point, I wonder if it might not be simpler to restrict the
> current nonblocking pool to kernel users, and for userspace users, the
> first time a process reads from /dev/urandom or calls getrandom(2), we
> create for them a ChaCha20 CRNG, which hangs off of the task
> structure.  This would require about 72 bytes of state per process,
> but normally very few processes are reading from /dev/urandom or
> calling getrandom(2) from userspace.

Interesting idea, although the loss of backtracking protection for
long-running servers is a significant semantic change.

Remember that making antibacktracking work is the entire problem we're
struggling to solve.  If discarding it is on the table, I can solve
the scalability problems extremely easily.  Even without giving it up
completely, just put the add-back inside if(trylock()) and the problem
goes away.

I'm also not sure where you get 72 bytes from.  You need 32 bytes of
key, 8 bytes of nonce, and 4 or 8 bytes of stream position to form the
ChaCha input block.

Then optionally add 64 bytes of output buffer if you don't want to
discard generated bytes after an unaligned read.

It adds up to 44/48, or 108/112 bytes.  Where does 72 come from?

But why bother with even that much?  If you're willing to discard
antibacktracking, just have *one* key for the kernel, reseeded more
often, and a per-thread nonce and stream position.

> (Arguably the right answer is to put arc4random in libc, where it can
> automatically handle forks/clones/pthread automatically, but it seems
> pretty clear *that* train has left a long time ago.)

I'm not quite which viatical incident you're alluding to, but yes that
would be the preferred solution.

> I have a feeling this may be less code and complexity, and it nicely
> handles the case where we have a /dev/urandom abuser who feels that
> they want to call /dev/urandom in a tight loop, even on a 4 socket
> Xeon system.  :-)

I'm not sanguine about its simplicity; it's a whole other layer.

Might be a good idea anyway.  I have a similar patch set in work, but
it's called /dev/frandom because I wasn't willing to suggest such a
semantic change.

But without interfering with legitimate users of /dev/urandom at all,
I'd be quite willing to say that as soon as you read more than 32 bytes
(current request plus recent history) from /dev/urandom, you get a
private ChaCha20 structure to read from.

(This is just enforcing the usage rules from the random(4) man page.
Since I wrote that text, I'm hardly going to object!)

Here's that atomic_t patch I mentioned above.

diff --git a/drivers/char/random.c b/drivers/char/random.c
index e62b30ba..e65357c4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -260,6 +260,7 @@
 #include <linux/irq.h>
 #include <linux/syscalls.h>
 #include <linux/completion.h>
+#include <linux/hash.h>

 #include <asm/processor.h>
 #include <asm/uaccess.h>
@@ -423,7 +424,7 @@ struct entropy_store;
 struct entropy_store {
 	/* read-only data: */
 	const struct poolinfo *poolinfo;
-	__u32 *pool;
+	atomic_t *pool;
 	const char *name;
 	struct entropy_store *pull;
 	struct work_struct push_work;
@@ -431,20 +432,20 @@ struct entropy_store {
 	/* read-write data: */
 	unsigned long last_pulled;
 	spinlock_t lock;
-	unsigned short add_ptr;
-	unsigned short input_rotate;
+	atomic_t add_ptr;
 	int entropy_count;
 	int entropy_total;
 	unsigned int initialized:1;
 	unsigned int limit:1;
 	unsigned int last_data_init:1;
+	unsigned char input_rotate;
 	__u8 last_data[EXTRACT_SIZE];
 };

 static void push_to_pool(struct work_struct *work);
-static __u32 input_pool_data[INPUT_POOL_WORDS];
-static __u32 blocking_pool_data[OUTPUT_POOL_WORDS];
-static __u32 nonblocking_pool_data[OUTPUT_POOL_WORDS];
+static atomic_t input_pool_data[INPUT_POOL_WORDS];
+static atomic_t blocking_pool_data[OUTPUT_POOL_WORDS];
+static atomic_t nonblocking_pool_data[OUTPUT_POOL_WORDS];

 static struct entropy_store input_pool = {
 	.poolinfo = &poolinfo_table[0],
@@ -488,6 +489,10 @@ static __u32 const twist_table[8] = {
  * degree, and then twisted.  We twist by three bits at a time because
  * it's cheap to do so and helps slightly in the expected case where
  * the entropy is concentrated in the low-order bits.
+ *
+ * This function is designed to be safe to call without locking the
+ * entropy_store.  Race conditions might do strange things, but will
+ * leave the pool valid and not lose entropy.
  */
 static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 			    int nbytes)
@@ -496,7 +501,8 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 	int input_rotate;
 	int wordmask = r->poolinfo->poolwords - 1;
 	const char *bytes = in;
-	__u32 w;
+
+	BUILD_BUG_ON(sizeof(int) != sizeof(__u32));

 	tap1 = r->poolinfo->tap1;
 	tap2 = r->poolinfo->tap2;
@@ -505,23 +511,31 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 	tap5 = r->poolinfo->tap5;

 	input_rotate = r->input_rotate;
-	i = r->add_ptr;
+	i = (unsigned)atomic_sub_return(nbytes, &r->add_ptr) + nbytes;

 	/* mix one byte at a time to simplify size handling and churn faster */
 	while (nbytes--) {
-		w = rol32(*bytes++, input_rotate);
-		i = (i - 1) & wordmask;
+		__u32 v, w = rol32(*bytes++, input_rotate);

 		/* XOR in the various taps */
-		w ^= r->pool[i];
-		w ^= r->pool[(i + tap1) & wordmask];
-		w ^= r->pool[(i + tap2) & wordmask];
-		w ^= r->pool[(i + tap3) & wordmask];
-		w ^= r->pool[(i + tap4) & wordmask];
-		w ^= r->pool[(i + tap5) & wordmask];
+		i = (i - 1) & wordmask;
+		w ^= atomic_read(r->pool + ((i + tap1) & wordmask));
+		w ^= atomic_read(r->pool + ((i + tap2) & wordmask));
+		w ^= atomic_read(r->pool + ((i + tap3) & wordmask));
+		w ^= atomic_read(r->pool + ((i + tap4) & wordmask));
+		w ^= atomic_read(r->pool + ((i + tap5) & wordmask));
+		w ^= v = atomic_read(r->pool + i);
+		/* Twist the result to mix bits within words */
+		w = (w >> 3) ^ twist_table[w & 7];

-		/* Mix the result back in with a twist */
-		r->pool[i] = (w >> 3) ^ twist_table[w & 7];
+		/*
+		 * Now, to put the result back, we don't have an
+		 * atomic_xor operation, but we do have an atomic_add.
+		 * Atomically add the value which would cause the desired
+		 * change.  If there's a race, something strange will happen,
+		 * but we won't lose entropy.
+		 */
+		atomic_add(w - v, r->pool + i);

 		/*
 		 * Normally, we add 7 bits of rotation to the pool.
@@ -532,8 +546,13 @@ static void _mix_pool_bytes(struct entropy_store *r, const void *in,
 		input_rotate = (input_rotate + (i ? 7 : 14)) & 31;
 	}

-	r->input_rotate = input_rotate;
-	r->add_ptr = i;
+	/*
+	 * Access to input_rotate is not thread-safe, because the
+	 * actual value doesn't matter to the security guarantees,
+	 * and the fact that it changes is enough for its purpose.
+	 * So if there's a race condition, it doesn't matter who wins.
+	 */
+	r->input_rotate = (unsigned char)input_rotate;
 }

 static void __mix_pool_bytes(struct entropy_store *r, const void *in,
@@ -1109,17 +1128,26 @@ retry:
  * This function does the actual extraction for extract_entropy and
  * extract_entropy_user.
  *
+ * For the blocking pools, we lock the pool until we feed back the
+ * extracted entropy, so the next extract_buf will return different
+ * results.
+ *
+ * For the non-blocking pool, concurrent access to the pool is allowed,
+ * and the CPU id is used as a salt to ensure that concurrent readers
+ * will get different results.
+ *
  * Note: we assume that .poolwords is a multiple of 16 words.
  */
 static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
 {
 	int i;
+	bool is_blocking = r->limit;
 	union {
 		__u32 w[5];
 		unsigned long l[LONGS(20)];
 	} hash;
 	__u32 workspace[SHA_WORKSPACE_WORDS];
-	unsigned long flags;
+	unsigned long flags = flags;	/* "initialization" shuts up GCC */

 	/*
 	 * If we have an architectural hardware random number
@@ -1133,8 +1161,11 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
 		hash.l[i] = v;
 	}

+	if (is_blocking)
+		spin_lock_irqsave(&r->lock, flags);
+	else
+		hash.w[0] ^= hash_32(get_cpu(), 32);
 	/* Generate a hash across the pool, 16 words (512 bits) at a time */
-	spin_lock_irqsave(&r->lock, flags);
 	for (i = 0; i < r->poolinfo->poolwords; i += 16)
 		sha_transform(hash.w, (__u8 *)(r->pool + i), workspace);

@@ -1148,7 +1179,10 @@ static void extract_buf(struct entropy_store *r, __u8 out[EXTRACT_SIZE])
 	 * hash.
 	 */
 	__mix_pool_bytes(r, hash.w, sizeof(hash.w));
-	spin_unlock_irqrestore(&r->lock, flags);
+	if (is_blocking)
+		spin_unlock_irqrestore(&r->lock, flags);
+	else
+		put_cpu();

 	memzero_explicit(workspace, sizeof(workspace));

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-12  7:49           ` George Spelvin
@ 2015-10-12 13:54             ` Theodore Ts'o
  2015-10-12 20:30               ` George Spelvin
  0 siblings, 1 reply; 26+ messages in thread
From: Theodore Ts'o @ 2015-10-12 13:54 UTC (permalink / raw)
  To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

On Mon, Oct 12, 2015 at 03:49:08AM -0400, George Spelvin wrote:
> 
> Suppose CPU 0 has an offset of 0, and CPU 1 has an offset of 12.
> (I'll also use a count-up index even though I know it's actually
> count-down in the code.)
> 
> CPU0 writes 5 words at 0..4, leaving add_ptr = 5.
> Then CPU 1 writes 5 words at 17..21, leaving add_ptr = 10
> The CPU 0 writes its next word at 10.
> 
> Words 5..9 never got mixed at all.

Hmm, good point.  It *shouldn't* matter if the hash is secure, but the
potentially uneven mixing would be a concern.

> However, I *also* have to use atomic_add to write r->pool[i],
> which is indeed expensive.  PoC diff attached, but I'm not sure if
> you'll like it.

Yeah, that was the part that worried me.  Whether we use an atomic add
or an atomic xor, it's going to be very expensive. Using an atomic add
means that it will work with arm, but there's an explicit loop with
arm, so on an arm system, it effectively turns into a more efficient
spinlock, except you're spinning on each byte, so instead of bouncing
cache lines with the locking granularity of the extract, we would be
bouncing cache lines for every single byte of the mixback operation.

> You're forgetting: keyboard codes don't go into the input pool.
> They go into the per-CPU fast pool, which does some quite thorough
> mixing and delivers 128 bits with the entropy effectively distributed
> across it.

They used to go through the input pool, and when we added the per-CPU
fast pool, we didn't revisit the question of whether the input_rotate
was still needed.  So yes, at this point it might be worthwhile to
remove the input_rotate part of the mixing scheme.

> Segregating abusers so they don't cause *performance* problems
> for other threads is a reasonable optimization.
> 
> But the quoted conversation was talking about a *security*
> problem that could be created by /dev/urandom abusers if some
> of your lock-relaxing ideas were followed.

Segregating abusers solves both problems.  If we do this then we don't
need to drop the locks from the nonblocking pool, which solves the
security problem.  And it also means that each process has its own
CRNG, which is way faster than /dev/urandom even if you don't have the
locking problem, and it provides per-process isolation, which means
that process A can't carry out a backtracking attack on process B.

> > Most processes don't actually use that much randomness, and I'm not
> > that worried about in-kernel users of the nonblocking pool.  Even with
> > the most exec-heavy workload, setting up a new exec image is
> > heavyweight enough that you're really not going to be contending on
> > the lock.  I also have trouble with someone spending $$$$ on a system
> > with 1K cpu cores and wasting all of their CPU power with running
> > shell scripts that fork and exec a lot.  :-)
> 
> But I just re-read Andi's original trouble report, and although he mentions
> AT_RANDOM, it's an applicatio reading from /dev/urandom a lot that's
> causing the problem *not* execve.  My memory may have gotten confused.
> 
> I can imagine that one /dev/urandom abuser can minopolize the locck and
> cause problems for a shell script.

Well, that's my point.  If we restrict the nonblocking pool to kernel
users, then a /dev/urandom abuser won't be able to cause problems for
the shell script.  And even the most inefficient shell script which is
firing off hundreds of processes per second is unlikely to be able to
swamp the spin loop, *and* I would hope that someone who spent $$$ on
a super-expensive 1K cpu system would also have the wit to actually
recode the shell script in a more efficient compiled language, instead
of wasting all of that expensive hardware.

> Remember that making antibacktracking work is the entire problem we're
> struggling to solve.  If discarding it is on the table, I can solve
> the scalability problems extremely easily.  Even without giving it up
> completely, just put the add-back inside if(trylock()) and the problem
> goes away.

The trylock() scheme is a bit dangerous, since we would need to make
sure that *all* times when other callers take the lock, the contents
of the pool would have changed.  Or else, a subseqent call to extract
entropy from that CPU will return the same value.  And even if that
were true today (or we could make it be true) if that ever changed, it
would break the code.  So it makes it pretty fragile.

> I'm also not sure where you get 72 bytes from.  You need 32 bytes of
> key, 8 bytes of nonce, and 4 or 8 bytes of stream position to form the
> ChaCha input block.

The ChaCha 20 state block is 64 bytes, and then I was assuming two 4
byte control variables, for the number of cranks and the time since
last reseed.  We could actually store those control variables into the
IV portion of the ChaCha state structure, which would make it be 64
bytes --- or we could generate a new state structure each time and not
bother storing the 16 bytes of constants in the state structure, which
trades a tiny amount of time and stack space for memory.  These are
all implementation details, though....

> But why bother with even that much?  If you're willing to discard
> antibacktracking, just have *one* key for the kernel, reseeded more
> often, and a per-thread nonce and stream position.

Well, we're not completely discarding backtracking protection.  We are
discarding backtracking protection between successive reads from a
single process, and even there we would be reseeding every five
minutes (and this could be tuned), so there is *some*
anti-backtracking protection.

I'm going to make the claim that if a particular process is reading
from /dev/urandom in a tight loop, you probably don't *need*
backtracking.  You're probably doing something silly such as using
/dev/urandom to overwrite a disk, or some such.

On the flip side, the time when you might care about anti-backtracing
protection is say, when you're generating a new session key for a new
connection.  So perhaps one approach is to use some kind of
ratelimiter algorithm so that if you're using /dev/urandom "carefully"
(say, no more than ten times a second), we'll use the non-blocking
pool.  But once a process exceeds that limit, it will switch over the
the CRNG, and then the only performance that abuser process will hurt
is its own (because it would be even faster if they were running
something like arc4random in userspace).

> But without interfering with legitimate users of /dev/urandom at all,
> I'd be quite willing to say that as soon as you read more than 32 bytes
> (current request plus recent history) from /dev/urandom, you get a
> private ChaCha20 structure to read from.

Yeah, that's what I was thinking, although I was willing to be a bit
more generous about when we switch them over to using the ChaCha pool.
I think doing this automatically is going to be much better than
creating a new /dev/frandom, since it's going to be hard to get
applications to switch over.  We're still several years away from
swtiching people over to the new getrandom(2) system call, after all.

	  	      	     	 	      - Ted

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-12 13:54             ` Theodore Ts'o
@ 2015-10-12 20:30               ` George Spelvin
  2015-10-12 20:34                 ` George Spelvin
  2015-10-13  2:46                 ` Theodore Ts'o
  0 siblings, 2 replies; 26+ messages in thread
From: George Spelvin @ 2015-10-12 20:30 UTC (permalink / raw)
  To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

Theodore Ts'o wrote:
> On Mon, Oct 12, 2015 at 03:49:08AM -0400, George Spelvin wrote:
>> Words 5..9 never got mixed at all.

> Hmm, good point.  It *shouldn't* matter if the hash is secure, but the
> potentially uneven mixing would be a concern.

I don't quite follow the "shouldn't matter" part.  The problem is that
insufficient input mixing can cause collisions, which are loss of entropy.
The output hash(the secure one) can't fix that.

The whole reason for the 5-tap LFSR is to ensure that each input word
affects a significant number of *other* words before getting stored into
by fresh input.

I just realized a worse problem: suppose CPU 1 has an offset of -1.
It will proceed to *undo* the mixing just done by CPU 0.  I don't just
mean that it'll store its input over top, but it will also undo the
LFSR XOR.  Ack!

(Maybe I'll retract that "bow before the master" remark. :-))

> They used to go through the input pool, and when we added the per-CPU
> fast pool, we didn't revisit the question of whether the input_rotate
> was still needed.  So yes, at this point it might be worthwhile to
> remove the input_rotate part of the mixing scheme.

The fast pools actually just extended an issue which the three-pool
design created: when spilling from one pool to the other, the whole
"expensive output, cheap input" optimization makes no sense.

For raw data and interrupt overhead, yes obviously.  But when it's
8 sha_transform() calls to generate 10 bytes, the design of the
subsequent input hash needs some rethinking.

It may make sense to have a separate path for "known high-quality"
input that puts more effort into diffusion.

Idea 1: Given that we do have an entropy estimate with every write to a
        pool, what if we did an input hash that did work proportional
        to the *entropy* of the input.	With some lower bound for
        zero-entropy writes like plain /dev/random writes and mixback.)

Idea 2: For input mixing not in interrupt context, which includes all
        input to the output pools, it would be smaller and simpler code
        to just XOR into the pool, and defer mixing until one whole-pool
        mixing pass after the XOR position reched the end of the pool.

        Concurrent writers could do a single atomic_add_return to
        find an add position, and if that extended past the end of the
        buffer, they'd obtain the lock, stir the pool, decrement the add
        position (which would permit other concurrent writers) and then
        do their XORs.

        There's still the risk of races between delayed non-locking
        inputters and stirrers, but maybe if non-locking inputters
        were limited to non-critical mixbacks, that would be manageable.

>> However, I *also* have to use atomic_add to write r->pool[i],
>> which is indeed expensive.  PoC diff attached, but I'm not sure if
>> you'll like it.

> Yeah, that was the part that worried me.  Whether we use an atomic add
> or an atomic xor, it's going to be very expensive. Using an atomic add
> means that it will work with arm, but there's an explicit loop with
> arm, so on an arm system, it effectively turns into a more efficient
> spinlock, except you're spinning on each byte, so instead of bouncing
> cache lines with the locking granularity of the extract, we would be
> bouncing cache lines for every single byte of the mixback operation.

Yes.  The current ticket spinlocks are little more than a single
atomic_add_return, and the dangers of putting multiple locks in the same
cache line is well known.

> Segregating abusers solves both problems.  If we do this then we don't
> need to drop the locks from the nonblocking pool, which solves the
> security problem.

Er, sort of.  I still think my points were valid, but they're
about a particular optimization suggestion you had.  By avoiding
the need for the optimization, the entire issue is mooted.

I still say a separate firewall is a bad way to *solve* security problems,
but it can avoid preformance pressure to *create* them in the first place.

>> I can imagine that one /dev/urandom abuser can minopolize the locck and
>> cause problems for a shell script.

> Well, that's my point.  If we restrict the nonblocking pool to kernel
> users, then a /dev/urandom abuser won't be able to cause problems for
> the shell script.

And I quite agree with that.  Sorry, sometimes a discussion goes in
multiple directions and it's hard to keep track of the point of any
particular sentence.

I get fixated on one point and intrepret comments about something else
as if they pertained to the issue I'm thinking about.

> The trylock() scheme is a bit dangerous, since we would need to make
> sure that *all* times when other callers take the lock, the contents
> of the pool would have changed.  Or else, a subseqent call to extract
> entropy from that CPU will return the same value.  And even if that
> were true today (or we could make it be true) if that ever changed, it
> would break the code.  So it makes it pretty fragile.

Sorry, yes, I'm prefectly aware of this, I just glossed over it.
Each extraction has to have a unique nonce; that's non-negotiable, but
not hard to achieve.  We'd need to add a per-cpu counter to go along
with the CPU id to make the extraction nonce.

Doing this would be valuable *anyway*, because it would let us
reduce the mixback to once per read() rather than once per 10 bytes.

(Also, what would you say to reducing arch_get_random_long() to 16
bytes, and reserving hash.w[5] for the nonce?  That would ensure that
even a malicious arch_get_random_long() couldn't force a collision.)

> The ChaCha 20 state block is 64 bytes, and then I was assuming two 4
> byte control variables, for the number of cranks and the time since
> last reseed.  We could actually store those control variables into the
> IV portion of the ChaCha state structure, which would make it be 64
> bytes --- or we could generate a new state structure each time and not
> bother storing the 16 bytes of constants in the state structure, which
> trades a tiny amount of time and stack space for memory.  These are
> all implementation details, though....

You have to copy the state *anyway* because you don't want it overwritten
by the ChaCha output, so there's really no point storing the constants.
(Also, ChaCha has a simpler input block structure than Salsa20; the
constants are all adjacent.)

And ChaCha includes a stream position counter itself, so you don't need
a separate one.

(Note: one problem with ChaCha specifically is that is needs 16x32 bits
of registers, and Arm32 doesn't quite have enough.  We may want to provide
an arch CPRNG hook so people can plug in other algorithms with good
platform support, like x86 AES instructions.)

>> But why bother with even that much?  If you're willing to discard
>> antibacktracking, just have *one* key for the kernel, reseeded more
>> often, and a per-thread nonce and stream position.

> Well, we're not completely discarding backtracking protection.  We are
> discarding backtracking protection between successive reads from a single
> process, and even there we would be reseeding every five minutes (and
> this could be tuned), so there is *some* anti-backtracking protection.

Yeah, I realized this a few hours after sending that e-mail.
Anti-backtracking between processes is more important than within.

> I'm going to make the claim that if a particular process is reading
> from /dev/urandom in a tight loop, you probably don't *need*
> backtracking.  You're probably doing something silly such as using
> /dev/urandom to overwrite a disk, or some such.

I fully agree.

> On the flip side, the time when you might care about anti-backtracing
> protection is say, when you're generating a new session key for a new
> connection.  So perhaps one approach is to use some kind of
> ratelimiter algorithm so that if you're using /dev/urandom "carefully"
> (say, no more than ten times a second), we'll use the non-blocking
> pool.  But once a process exceeds that limit, it will switch over the
> the CRNG, and then the only performance that abuser process will hurt
> is its own (because it would be even faster if they were running
> something like arc4random in userspace).

I consider the size of the read at least as much as the rate.
As mentioned in the random(4) man page, I don't think more than 256
bits of computational security even exists.  Human Bitcoin mining is
(per http://bitcoincharts.com/) doing 440051 Thash/s, which is 80 bits
per month.

We can extrapolate computational feasibility beyond what's currently
been done.  I think 128-bit claims are defensible.  But once we get beyond
the square of anything ever attempted (170 bits or so), security claims
really don't know what they're talking about.  256 bits is the *cube*.
That's just crazy talk.

This in turn means that someone asking for more than 256 bits of seed
material is doing something stupid.  (In the more diplomatic language of
the man page, "should be taken as a sign that its cryptography is _not_
skillfully implemented."  I didn't want to totally condmen quick one-time
hacks.)

>> But without interfering with legitimate users of /dev/urandom at all,
>> I'd be quite willing to say that as soon as you read more than 32 bytes
>> (current request plus recent history) from /dev/urandom, you get a
>> private ChaCha20 structure to read from.

> Yeah, that's what I was thinking, although I was willing to be a bit
> more generous about when we switch them over to using the ChaCha pool.
> I think doing this automatically is going to be much better than
> creating a new /dev/frandom, since it's going to be hard to get
> applications to switch over.  We're still several years away from
> swtiching people over to the new getrandom(2) system call, after all.

I worry this might attract the same sort of shitstorm as the SHA-3
proposal to tweak Keccak's preimage resistance (which, for the benefit of
the peanut gallery, was solid cryptographic engineering that belatedly
fixed a stupidity in the original SHA-3 contest rules), but if you're
up for it, I'm not going to object.

I'm not too hung up on the thresholds, as long as we catch bulk
readers on the first read, rather than wait until *after* they've
drained /dev/urandom.  I'd prefer a single 64-byte read request would
move to mitigation mode, but I could be talked into 128.  (Actually,
it would probably make more sense to have the threshold be a multiple
of EXTRACT_SIZE.)

History is probably best kept in a leaky bucket of some sort.  That has a
current level and last update jiffies, and each read, we subtract off
the allowed maximum rate from the level.

(Or would an exponential average be better?  That would make the maximum
read rate more strongly dependent on the evenness of the spacing.)

Then add that (scaled somehow?) to the current read size and compare
with the threshold.

The same variables can be used (with different parameters) to decide if
we want to get out of mitigation mode.  The one thing to watch out for
is that "cat </dev/urandom >/dev/sdX" may have some huge pauses once
the buffer cache fills.  We don't want to forgive after too small a
fixed interval.

Finally, we have the issue of where to attach this rate-limiting structure
and crypto context.  My idea was to use the struct file.  But now that
we have getrandom(2), it's harder.  mm, task_struct, signal_struct, what?

(Post-finally, do we want this feature to be configurable under
CONFIG_EMBEDDED?  I know keeping the /dev/random code size small is
a speficic design goal, and abuse mitigation is optional.)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-12 20:30               ` George Spelvin
@ 2015-10-12 20:34                 ` George Spelvin
  2015-10-13  2:46                 ` Theodore Ts'o
  1 sibling, 0 replies; 26+ messages in thread
From: George Spelvin @ 2015-10-12 20:34 UTC (permalink / raw)
  To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

(BTW, my previous e-mail went out early due to a mis-click.  It was almost
done, but please excuse any unfinished sentences.)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-12 20:30               ` George Spelvin
  2015-10-12 20:34                 ` George Spelvin
@ 2015-10-13  2:46                 ` Theodore Ts'o
  2015-10-13  3:50                   ` Raymond Jennings
                                     ` (2 more replies)
  1 sibling, 3 replies; 26+ messages in thread
From: Theodore Ts'o @ 2015-10-13  2:46 UTC (permalink / raw)
  To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

On Mon, Oct 12, 2015 at 04:30:59PM -0400, George Spelvin wrote:
> > Segregating abusers solves both problems.  If we do this then we don't
> > need to drop the locks from the nonblocking pool, which solves the
> > security problem.
> 
> Er, sort of.  I still think my points were valid, but they're
> about a particular optimization suggestion you had.  By avoiding
> the need for the optimization, the entire issue is mooted.

Sure, I'm not in love with anyone's particular optimization, whether
it's mine, yours, or Andi's.  I'm just trying to solve the scalability
problem while also trying to keep the code maintainable and easy to
understand (and over the years we've actually made things worse, to
the extent that having a single mixing for the input and output pools
is starting to be more of problem than a feature, since we're coding
in a bunch of exceptions when it's the output pool, etc.).

So if we can solve a problem by routing around it, that's fine in my
book.

> You have to copy the state *anyway* because you don't want it overwritten
> by the ChaCha output, so there's really no point storing the constants.
> (Also, ChaCha has a simpler input block structure than Salsa20; the
> constants are all adjacent.)

We're really getting into low-level implementations here, and I think
it's best to worry about these sorts of things when we have a patch to
review.....

> (Note: one problem with ChaCha specifically is that is needs 16x32 bits
> of registers, and Arm32 doesn't quite have enough.  We may want to provide
> an arch CPRNG hook so people can plug in other algorithms with good
> platform support, like x86 AES instructions.)

So while a ChaCha20-based CRNG should be faster than a SHA-1 based
CRNG, and I consider this a good thing, for me speed is **not** more
important than keeping the underlying code maintainable and simple.
This is one of the reasons why I looked at, and then discarded, to use
x86 accelerated AES as the basis for a CRNG.  Setting up AES so that
it can be used easily with or without hardware acceleration looks very
complicated to do in a cross-architectural way, and I don't want to
drag in all of the crypto layer for /dev/random.

> The same variables can be used (with different parameters) to decide if
> we want to get out of mitigation mode.  The one thing to watch out for
> is that "cat </dev/urandom >/dev/sdX" may have some huge pauses once
> the buffer cache fills.  We don't want to forgive after too small a
> fixed interval.

At least initially, once we go into mitigation mode for a particular
process, it's probably safer to simply not exit it.

> Finally, we have the issue of where to attach this rate-limiting structure
> and crypto context.  My idea was to use the struct file.  But now that
> we have getrandom(2), it's harder.  mm, task_struct, signal_struct, what?

I'm personally more inclined to keep it with the task struct, so that
different threads will use different crypto contexts, just from
simplicity point of view since we won't need to worry about locking.

Since many processes don't use /dev/urandom or getrandom(2) at all,
the first time they do, we'd allocate a structure and hang it off the
task_struct.  When the process exits, we would explicitly memzero it
and then release the memory.

> (Post-finally, do we want this feature to be configurable under
> CONFIG_EMBEDDED?  I know keeping the /dev/random code size small is
> a speficic design goal, and abuse mitigation is optional.)

Once we code it up we can see how many bytes this takes, we can have
this discussion.  I'll note that ChaCha20 is much more compact than SHA1:

   text	   data	    bss	    dec	    hex	filename
   4230	      0	      0	   4230	   1086	/build/ext4-64/lib/sha1.o
   1152	    304	      0	   1456	    5b0	/build/ext4-64/crypto/chacha20_generic.o

... and I've thought about this as being the first step towards
potentially replacing SHA1 with something ChaCha20 based, in light of
the SHAppening attack.  Unfortunately, BLAKE2s is similar to ChaCha
only from design perspective, not an implementation perspective.
Still, I suspect the just looking at the crypto primitives, even if we
need to include two independent copies of the ChaCha20 core crypto and
the Blake2s core crypto, it still should be about half the size of the
SHA-1 crypto primitive.

And from the non-plumbing side of things, Andi's patchset increases
the size of /dev/random by a bit over 6%, or 974 bytes from a starting
base of 15719 bytes.  It ought to be possible to implement a ChaCha20
based CRNG (ignoring the crypto primitives) in less than 974 bytes of
x86_64 assembly.  :-)

						- Ted

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13  2:46                 ` Theodore Ts'o
@ 2015-10-13  3:50                   ` Raymond Jennings
  2015-10-13  7:50                     ` George Spelvin
  2015-10-13  6:24                   ` George Spelvin
  2015-10-13 16:20                   ` Andi Kleen
  2 siblings, 1 reply; 26+ messages in thread
From: Raymond Jennings @ 2015-10-13  3:50 UTC (permalink / raw)
  To: Theodore Ts'o, George Spelvin, ahferroin7, andi, jepler,
	linux-kernel, linux



On Mon, Oct 12, 2015 at 7:46 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Oct 12, 2015 at 04:30:59PM -0400, George Spelvin wrote:
>>  > Segregating abusers solves both problems.  If we do this then we 
>> don't
>>  > need to drop the locks from the nonblocking pool, which solves the
>>  > security problem.
>> 
>>  Er, sort of.  I still think my points were valid, but they're
>>  about a particular optimization suggestion you had.  By avoiding
>>  the need for the optimization, the entire issue is mooted.
> 
> Sure, I'm not in love with anyone's particular optimization, whether
> it's mine, yours, or Andi's.  I'm just trying to solve the scalability
> problem while also trying to keep the code maintainable and easy to
> understand (and over the years we've actually made things worse, to
> the extent that having a single mixing for the input and output pools
> is starting to be more of problem than a feature, since we're coding
> in a bunch of exceptions when it's the output pool, etc.).
> 
> So if we can solve a problem by routing around it, that's fine in my
> book.
> 
>>  You have to copy the state *anyway* because you don't want it 
>> overwritten
>>  by the ChaCha output, so there's really no point storing the 
>> constants.
>>  (Also, ChaCha has a simpler input block structure than Salsa20; the
>>  constants are all adjacent.)
> 
> We're really getting into low-level implementations here, and I think
> it's best to worry about these sorts of things when we have a patch to
> review.....
> 
>>  (Note: one problem with ChaCha specifically is that is needs 16x32 
>> bits
>>  of registers, and Arm32 doesn't quite have enough.  We may want to 
>> provide
>>  an arch CPRNG hook so people can plug in other algorithms with good
>>  platform support, like x86 AES instructions.)
> 
> So while a ChaCha20-based CRNG should be faster than a SHA-1 based
> CRNG, and I consider this a good thing, for me speed is **not** more
> important than keeping the underlying code maintainable and simple.
> This is one of the reasons why I looked at, and then discarded, to use
> x86 accelerated AES as the basis for a CRNG.  Setting up AES so that
> it can be used easily with or without hardware acceleration looks very
> complicated to do in a cross-architectural way, and I don't want to
> drag in all of the crypto layer for /dev/random.
> 
>>  The same variables can be used (with different parameters) to 
>> decide if
>>  we want to get out of mitigation mode.  The one thing to watch out 
>> for
>>  is that "cat </dev/urandom >/dev/sdX" may have some huge pauses once
>>  the buffer cache fills.  We don't want to forgive after too small a
>>  fixed interval.
> 
> At least initially, once we go into mitigation mode for a particular
> process, it's probably safer to simply not exit it.
> 
>>  Finally, we have the issue of where to attach this rate-limiting 
>> structure
>>  and crypto context.  My idea was to use the struct file.  But now 
>> that
>>  we have getrandom(2), it's harder.  mm, task_struct, signal_struct, 
>> what?
> 
> I'm personally more inclined to keep it with the task struct, so that
> different threads will use different crypto contexts, just from
> simplicity point of view since we won't need to worry about locking.
> 
> Since many processes don't use /dev/urandom or getrandom(2) at all,
> the first time they do, we'd allocate a structure and hang it off the
> task_struct.  When the process exits, we would explicitly memzero it
> and then release the memory.
> 
>>  (Post-finally, do we want this feature to be configurable under
>>  CONFIG_EMBEDDED?  I know keeping the /dev/random code size small is
>>  a speficic design goal, and abuse mitigation is optional.)
> 
> Once we code it up we can see how many bytes this takes, we can have
> this discussion.  I'll note that ChaCha20 is much more compact than 
> SHA1:
> 
>    text	   data	    bss	    dec	    hex	filename
>    4230	      0	      0	   4230	   1086	/build/ext4-64/lib/sha1.o
>    1152	    304	      0	   1456	    
> 5b0	/build/ext4-64/crypto/chacha20_generic.o
> 
> ... and I've thought about this as being the first step towards
> potentially replacing SHA1 with something ChaCha20 based, in light of
> the SHAppening attack.  Unfortunately, BLAKE2s is similar to ChaCha
> only from design perspective, not an implementation perspective.
> Still, I suspect the just looking at the crypto primitives, even if we
> need to include two independent copies of the ChaCha20 core crypto and
> the Blake2s core crypto, it still should be about half the size of the
> SHA-1 crypto primitive.
> 
> And from the non-plumbing side of things, Andi's patchset increases
> the size of /dev/random by a bit over 6%, or 974 bytes from a starting
> base of 15719 bytes.  It ought to be possible to implement a ChaCha20
> based CRNG (ignoring the crypto primitives) in less than 974 bytes of
> x86_64 assembly.  :-)
> 
> 						- Ted
> 
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

This might be stupid, but could something asynchronous work?  Perhaps 
have the entropy generators dump their entropy into a central pool via 
a cycbuf, and have a background kthread manage the per-cpu or 
per-process entropy pools?




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13  2:46                 ` Theodore Ts'o
  2015-10-13  3:50                   ` Raymond Jennings
@ 2015-10-13  6:24                   ` George Spelvin
  2015-10-13 16:20                   ` Andi Kleen
  2 siblings, 0 replies; 26+ messages in thread
From: George Spelvin @ 2015-10-13  6:24 UTC (permalink / raw)
  To: linux, tytso; +Cc: ahferroin7, andi, jepler, linux-kernel, linux

> We're really getting into low-level implementations here, and I think
> it's best to worry about these sorts of things when we have a patch to
> review.....

> it's probably safer to simply not exit it.

> I'm personally more inclined to keep it with the task struct, so that
> different threads will use different crypto contexts, just from
> simplicity point of view since we won't need to worry about locking.

> Once we code it up we can see how many bytes this takes, we can have
> this discussion.

I'm fine with all of these; thank you.

> This is one of the reasons why I looked at, and then discarded, to use
> x86 accelerated AES as the basis for a CRNG.  Setting up AES so that
> it can be used easily with or without hardware acceleration looks very
> complicated to do in a cross-architectural way, and 

I haven't looked as deeply, but it didn't look too hard.  Is it possible
to briefly explain the problem?

I assumed you'd have an arch-specific capabilities probe function that
would set up an operations structure.  That would provide the various
buffer sizes required, and setup (kernel_fpu_begin() and key scheduling)
CPRNG core, and teardown (kernel_fpu_end()) functions.

It there some huge gotcha I'm overlooking?

> I don't want to drag in all of the crypto layer for /dev/random.

Oh, gods, no; the crypto layer drives me nuts.  Truthfully, the main
hair of the crypto layer is all the modular cipher modes on top of block
ciphers, and the scatterlist stuff to handle arbitrarily fragmented
input and output buffers for the benefit of the network layer, but the
code is horrible reading.

Every time I look, I find something I want to fix (the CTS mode
implementation uses 6 blocks worth of stack buffer; I have a patch to
reduce that to 3) but then I get lost is the morass of structures and
wrappers trying to make the code fit in with the rest.

There's a struct crypto_tfm, crypto_alg, crypto_instance, cipher_desc,
crypto_type, crypto_template, crypto_spawn... I've been trying to read it
and I still have no idea what half of them are for.

And I *still* haven't figured out how to get the self-test code to tell
me that test X was performed and passed.  I ended up writing my own test,
which seems wrong.

> ... and I've thought about this as being the first step towards
> potentially replacing SHA1 with something ChaCha20 based, in light of
> the SHAppening attack.  Unfortunately, BLAKE2s is similar to ChaCha
> only from design perspective, not an implementation perspective.
> Still, I suspect the just looking at the crypto primitives, even if we
> need to include two independent copies of the ChaCha20 core crypto and
> the Blake2s core crypto, it still should be about half the size of the
> SHA-1 crypto primitive.

Well, the SHAppening doesn't really change much except a slight schedule
tweak, but yeah, it's one of those things that would be nice to get
around to.

I'm not sure what you expect to do with ChaCha, though; it's really
an expansion function, not compression, and not easily adapted to be one.

BLAKE2 is a bit ugly.  I'm generally not liking MD5/SHA-like designs
that dribble the message and some round constants in; I'm much preferring
the large-state "add a bunch of input all at once" designs like Keccak,
SipHash and DJB's SHA-3 entry, CubeHash.

Have you seen it?  It's quite similar to Keccak, just using a 32x32 = 1024
bit state rather than Keccak's 25*64=1600.

The reason it got dumped is because, like Keccak, to get n bits of
preimage resistance, it requires 2n bits of "capacity" bits unused each
round.  When you ask for 512 bits of preimage resistance, you can
only import a few bits of message each block.

Keccak has the same problem, but it has a big enough block that it can handle
it.

In Dan's submission, you'll see his usual "this is a stupid request
which I'm only paying lip service to", and his "SHA-3-512-formal"
proposal was dog-slow.  To quote:

	The "SHA-3-512-formal" proposal is aimed at users who are
	(1) concerned with attacks using 2^384 operations,
	(2) unconcerned with quantum attacks that cost far less, and
	(3) unaware that attackers able to carry out 2^256 operations
	    would wreak havoc on the entire SHA-3 landscape, forcing SHA-3
	    to be replaced no matter which function is selected as SHA-3.
	The "SHA-3-512-normal" proposal is aimed at sensible users.
	For all real-world cryptographic applications, the "formal"
	versions here can be ignored, and the tweak amounts to a proposal
	of CubeHash16/32 as SHA-3."

If NIST had proposed changing the preimage resistance rules *before*
the final decision, things would have gone a lot differently.

> And from the non-plumbing side of things, Andi's patchset increases
> the size of /dev/random by a bit over 6%, or 974 bytes from a starting
> base of 15719 bytes.  It ought to be possible to implement a ChaCha20
> based CRNG (ignoring the crypto primitives) in less than 974 bytes of
> x86_64 assembly.  :-)

Yes, not hard.  Are you inspired, or would you like me to put together
a patch?

And should I do moderately evil space-saving things like store the
pointer to the crypto state and the leaky bucket in the same task slot,
distinguished by the pointer lsbit?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13  3:50                   ` Raymond Jennings
@ 2015-10-13  7:50                     ` George Spelvin
  0 siblings, 0 replies; 26+ messages in thread
From: George Spelvin @ 2015-10-13  7:50 UTC (permalink / raw)
  To: ahferroin7, andi, jepler, linux-kernel, linux, linux, shentino,
	tytso

> This might be stupid, but could something asynchronous work?  Perhaps 
> have the entropy generators dump their entropy into a central pool via 
> a cycbuf, and have a background kthread manage the per-cpu or 
> per-process entropy pools?

No for two reasons:

(Minor): One of the functions of the mixback is to ensure that the next
	reader hashes a *different* pool state.  If the mixback is
	delayed, the next reader might hash the *same* pool state and
	return the same numbers.  (There are easy workarounds for this.)
(Major): What do you do when the circular buffer is full?  If it's not safe
	to skip the mixback, then we have to block and get into the same
	lock-contention problem.

But... this suggestion of having a separate thread do the mixback gives
me an idea.  In fact, I think it's a good idea.

Ted (or anyone else listening), what do you think of the following?
I think it would solve Andi's problem and be a smaller code change than
the abuse mitigation mode.  (Which is still a good idea, but is off the
critical path.)

- Associated with a pool is an atomic "mixback needed" flag.
- Also an optional mixback buffer.  (Optional because the mixback
  could just recompute it.)

- Dropping the lock requires the following operations:
  - Test the mixback needed flag.  If set,
    - Copy out and wipe the buffer,
    - smp_wmb()
    - Clear the flag
    - smp_wmb()
    - Do the mixback, and
    - Re-check again before dropping the lock.
    (This check before dropping the lock is technically an optional
    optimization.)
  - Drop the lock.
  - smp_mb()	(Since it's write-to-read, we can't use _rmb or _wmb.)
  - Test the mixback pending flag again.
  - If it's set, trylock().  If that succeeds, go do the mixback as above.
  - If it fails, return.

Each reader uses already-discussed nonce techniques to safely do concurrent
reads from the same pool.  Then, at the end:
- (Optional) trylock() and, if it succeeds,
  do mixback directly.
- Copy our mixback data to the buffer (race conditions be damned)
- smp_wmb()
- set the mixback needed flag
- smp_mb()	(Since it's write-to-read; or use smp_store_mb())
- trylock()
  - If that fails. return
  - If that succeeds (and the flag is still set) do the mixback

This is based on the fact that if there are multiple concurrent reads,
we only need one mixback (thus, only one buffer/flag), but the "last one
out the door" has to do it.

Also, we don't care if we mis-count and end up doing it twice.

Each reader sets the flag and *then* does a trylock.  If the trylock fails,
it's guaranteed that the lock-holder will see the flag and take care of
the mixback for us.

The writers drop the lock and *then* test the flag.

The result is that readers *never* do a blocking acquire of the pool
lock.  Which should solve all the contention problems.  Andi's stupid
application will still be stupid, but won't fall off a locking cliff.

(We could also use w[5] as the "mixback needed" flag and just
force it to 1 on the off chance it's zero with negligible loss
of entropy and zero security loss.)

The one thing I worry about is livelock keeping one thread in the
mixback code indefinitely, which can be mitigated by dropping the lock
and waiting before re-testing and re-acquiring if we loop too often.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13  2:46                 ` Theodore Ts'o
  2015-10-13  3:50                   ` Raymond Jennings
  2015-10-13  6:24                   ` George Spelvin
@ 2015-10-13 16:20                   ` Andi Kleen
  2015-10-13 21:10                     ` George Spelvin
  2 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2015-10-13 16:20 UTC (permalink / raw)
  To: Theodore Ts'o, George Spelvin, ahferroin7, andi, jepler,
	linux-kernel, linux

I tested the two proposed patches from earlier this thread on a 4S
system. This was just with my (worst case) micro.

Unfortunately both patches scale much worse than the duplicated pools,
and can be even worse than the baseline (not sure why).

The base line peaks at slightly above 200K ops/s with less than 20 CPUs.
And the it gets slower until settling around 100K ops/s with more CPUs.

Ted's patch peaks at 350K with four CPUs, but then quickly degrades to
50K ops/s at 20+ CPUs. At 144 CPUs it is slightly faster again at ~80K.

Spelvin's patch peaks at only 140K at 2 CPUs (so it's slower than base line),
stays around 120K upto 20, then degrades quickly to 50K and then slowly
improves again to ~80K.

The duplicated pool patch is ~200K upto 20 CPus, 400K upto 40, 600K at
slightly below 60 CPUs, and then very slowly degrades to 520K at 144. 

-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13 16:20                   ` Andi Kleen
@ 2015-10-13 21:10                     ` George Spelvin
  2015-10-14  2:15                       ` Andi Kleen
  2015-10-21  8:27                       ` George Spelvin
  0 siblings, 2 replies; 26+ messages in thread
From: George Spelvin @ 2015-10-13 21:10 UTC (permalink / raw)
  To: ahferroin7, andi, jepler, linux-kernel, linux, linux, tytso

> Ted's patch peaks at 350K with four CPUs, but then quickly degrades to
> 50K ops/s at 20+ CPUs. At 144 CPUs it is slightly faster again at ~80K.

Good to know, thanks!  With its race conditions, it's basically a "best
case" for that particular design, which tells me that more significant
changes are required.

Off hand, do you know how large a read each operation is?  I want to
reduce mixback from once per 10 bytes to once per read, and the size
ratio will give me some idea of how large an improvement to expect.

> Spelvin's patch peaks at only 140K at 2 CPUs (so it's slower than base line),
> stays around 120K upto 20, then degrades quickly to 50K and then slowly
> improves again to ~80K.

Sorry to make you go to the trouble; I knew from discussions with Ted
that it wasn't going to work.  It was mostly just in the form of a patch
for the sake of a more concrete discussion.

I'll have a patch that I hope will do some good for testing in a couple
of hours.

> The duplicated pool patch is ~200K upto 20 CPus, 400K upto 40, 600K at
> slightly below 60 CPUs, and then very slowly degrades to 520K at 144. 

Shitty performance is practically a design goal of /dev/urandom.  You
are NOT supposed to hit it more than once per minute per thread.

But since we have a real-world problem with it, Ted's "abusse mitigation
mode" idea (where the kernel does what the app should do: seed a private
CPRNG and use that) will provide good security at extremely high access
rates.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13 21:10                     ` George Spelvin
@ 2015-10-14  2:15                       ` Andi Kleen
  2015-10-21  8:27                       ` George Spelvin
  1 sibling, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2015-10-14  2:15 UTC (permalink / raw)
  To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux, tytso

> Off hand, do you know how large a read each operation is?  I want to
> reduce mixback from once per 10 bytes to once per read, and the size
> ratio will give me some idea of how large an improvement to expect.

My test reads 64 bytes using the syscall.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-13 21:10                     ` George Spelvin
  2015-10-14  2:15                       ` Andi Kleen
@ 2015-10-21  8:27                       ` George Spelvin
  2015-10-21 11:47                         ` Andi Kleen
  1 sibling, 1 reply; 26+ messages in thread
From: George Spelvin @ 2015-10-21  8:27 UTC (permalink / raw)
  To: ahferroin7, andi, jepler, linux-kernel, linux, linux, tytso

After having worked out the basics of a non-blocking /dev/urandom read,
(patches poated; I'm hoping someone will comment on them!), I need to
work out the crypto.

Patch 3/4 of that series showed how to have concurrent pool readers not
separated by mixback operations, but it was a stub, not really well
thought through.  Since then, I've been trying to think of how to do
it properly.  That entails figuring out exactly what the design goals
of the mixback are.

The current mixback operation works as follows:
* Generate 160 bits of hash output
* Produce 80 bits of that as user output
* Mix all 160 bits back into the pool.  This achieves two things:
  * Salting.  It changes the pool state so the next hash will
    produce different output.
  * Anti-backtracking protection.  By adding the output back to the input,
    it makes it difficult for an attacker who captures the later pool
    state to figure out the 

Now, the salting goal can be achieved just as well with a simple
unique nonce.  In the example, it's an incrementing counter and CPU ID.

But the anti-backtracking is more complex.

First of all, we note that the current design is only computationally
secure, not information theoretically.  To achieve the latter, we'd
have to actually destroy entropy in the pool (by zeroing it or doing
something like Spritz's crush() operation), which doesn't seem worth it.

The basic idea of having as much mixback as output seems good.
If you're going to do it at all, you might as well do it right,
and you don't want a 256-bit key to have an 160-bit attack from
a captured pool state.

But there is a practical maximum.  I can't think of a reason why you'd
need more than the output pool size (128 bytes = 1024 bits), but maybe
256 bits is good enough.  Thoughts?

(256 bits also corresponds to one full cycle of the input mix
over the pool, FWIW.)

One thing that's questionable is the fact that the entire 20 bytes is mixed
back.  Because the input hash is not very rapid, it's possible for the
mixback to land on a low-entropy part of the pool and not be well hidden.

If that happens, the previous random output (or partial information about
it) would be recoverable, even if the full previous pool state is not.

I'm not sure how big a problem this is, but only mixing back 80
bits (independent of the output bits) might be better.

A change I'm thinking about, which would simplify implementation
considerably, would be to use all 160 bits of hash output for the user
output, and then do additional hashing, to do the mixback.

This greatly simplifies the contended-pool case by only requiring
readers to pass the *amount* of mixback to the lock holder, which
can be done with an atomic_add.  There's no need to pass actual data,
because the lock-holder can just generate it themselves.

Specifically, in the low-contention case, the reader would use any
leftover hash bytes for mixback if it could get the lock, but it wouldn't
be a big performance hit to throw them away if someone else was holding
the lock.

In the case of extremely high contention, the limit on maximum total
mixback would let the lock holder do less mixback than the total
that would be done by the concurrent readers.

But that requires understanding the reasons for only outputting
half of the hash result, to know if it's safe to make the change.
Three possibilities come to mind:

1. To produce enough mixback.  This is preserved by the proposed change.

2. To avoid entropy loss due to collisions in the non-invertible final
   addition in a "narrow-pipe" hash like SHA.

   This is a generic issue with hashes which do not maintain an internal
   state wider than the input.  This includes MD5, SHA1, SHA2, and
   BLAKE/BLAKE2.  (But not Keccak, SipHash, JH, Gr{\o}stl, Skein, etc.)

   Assume the first half of the pool has lots of entropy, so the 160-bit
   output of the first sha_transform() call is perfectly uniformly
   distributed.  (Each of the 2^160 possible output occurs with probably
   2^-160, so 160 bits of entropy by any easure.)

   If the second half of the pool has no entropy (a known, fixed value),
   then sha_transform approximates a 160-to-160 bit random function.
   Such a function has output collisions with probabilities described
   by the Poisson distribution with mean (lambda) 1.  (This number is
   the ratio of input and output state sizes.)

   In particular, 1/e = e^-1 = 36.788% of the outputs do not appear for
   any input.  (If there are k time more inputs than output, all equally
   likely, it's e^-k.  So just a few bits reduces that drastically.)

   If you do the Shannon entropy math, that means that the output
   has log2(e) = 0.827245389... fewer bits of entropy than the input.
   (If there are k states in the second half of the pool, the entropy loss
   is slightly less than 1/k of this value.  So 163 bits in produces 160 -
   0.827/8 = 159.9 bits out.)

   If you're going by min-entropy, it's worse.  Over 2^160 inputs,
   the Poisson distribution predicts 3.5 41-way collisions, and 0.0866
   42-way collisions.  So the loss is log2(41) = 5.358 bits.  (If there
   are 8 states in the second half, then we expect a 77-way collision, for
   a min-entropy of 160 - log2(77/8) = 160 - 3.2668 = 156.73 bits.)

   (For more on this, the work of Andrea R\"ock, e.g. "Collision Attacks
   based on the Entropy Loss caused by Random Functions", is good reading.)

   Although this affects all narrow-pipe hashes, you only need to chop
   a few bits off the state size to be nearly perfect in Shannon terms,
   and 8 bits gets you less than 1 bit of min-entropy lost.  2 bytes
   makes the largest collision that you expect to find at least 1 of among
   2^176 inputs is a 69270-way, resulting in a loss of log2(69270/65536)
   = 0.08 bit of min-entropy.

   So while I can see the point to truncating the hash, cutting it in half
   seems unnecessary.

3. General "I don't trust SHA-1" principles without a more specific
   reason.  If this is all there is, would substituting a different hash
   (I'm looking into BLAKE2 at Ted's suggestion) help?

   Note that the entire SHAppening issue is not particularly relevant
   to /dev/urandom attacks.  Attacking the /dev/urandom output mix is
   more like a pre-image attack, but considerably harder, since you're
   not trying to find not just *a* pre image, but *the* pre-image.

   Using the expected pre-image resistance of the hash is insanely
   conservaitve.  Perhaps not a bad thing, though.  But could we use
   128 bits rather than 80?

Finally, I'm trying to figure out how to use the get_radom_int_hash
array for salt material.  We already have it, so it would be nice
to use.  But the current way it's used (basically, uses the random_int_secret
as a key and runs in OFBmode) leaves the last random number visible.

A pool reader must change its salt if it can't get the mixback lock.
There may be a few bytes of leftover entropy available for the purpose
if the hash size doesn't divide the read size.

I could always do another hash iteration to reseed get_random_int_hash if
I can't get the mixback lock, but is there a cheaper way?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-21  8:27                       ` George Spelvin
@ 2015-10-21 11:47                         ` Andi Kleen
  2015-10-21 18:10                           ` George Spelvin
  0 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2015-10-21 11:47 UTC (permalink / raw)
  To: George Spelvin; +Cc: ahferroin7, andi, jepler, linux-kernel, linux, tytso

On Wed, Oct 21, 2015 at 04:27:43AM -0400, George Spelvin wrote:
> After having worked out the basics of a non-blocking /dev/urandom read,
> (patches poated; I'm hoping someone will comment on them!), I need to
> work out the crypto.

Can we just use my multi pool patch for now? It works and is actually scalable
and does not require any new "cryptographic research" or other risks.

The main concern I heard was code size. I addressed this in the latest
version by making the new code depend on CONFIG_NUMA, which is generally
not set on small systems.

-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Updated scalable urandom patchkit
  2015-10-21 11:47                         ` Andi Kleen
@ 2015-10-21 18:10                           ` George Spelvin
  0 siblings, 0 replies; 26+ messages in thread
From: George Spelvin @ 2015-10-21 18:10 UTC (permalink / raw)
  To: andi, linux; +Cc: ahferroin7, jepler, linux-kernel, linux, tytso

> Can we just use my multi pool patch for now? It works and is actually scalable
> and does not require any new "cryptographic research" or other risks.

To clarify, it's actually quite easy to come up with something that works,
cryptographically.  All the discussion is:
1) Erring on the side of public discussion for a security-sensitive area, and
2) Trying to find "The Right Thing" among the many workable solutions.

Earlier, Ted seemed to be interested in more significant changes like
replacing sha_transform and "abuse mitigation mode", so I've been
thinking expansively.  The immediate plans aren't necessarily quite
as complex.

Mostly, it's an embarrassment of ways to do it, and trying to find
a reason to choose one over the other.  I wrote one very quickly, said
"I can do better", thought about coding that up, thought of an even
better solution, and decided to post the performance-affecting part
while I played donkey-between-two-piles-of-hay.

> or other risks.

Ted gets to decide, but my objection is that the already-sparse entropy
supply is considerably diluted.  That *is* a risk.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-10-21 18:10 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-24 17:19 Updated scalable urandom patchkit Andi Kleen
2015-09-24 17:19 ` [PATCH 1/3] Make /dev/urandom scalable Andi Kleen
2015-09-30 14:40   ` Rasmus Villemoes
2015-09-24 17:19 ` [PATCH 2/3] random: Make input to output pool balancing per cpu Andi Kleen
2015-09-24 17:19 ` [PATCH 3/3] random: Add pool name to urandom_read trace point Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2015-10-10 18:45 Updated scalable urandom patchkit George Spelvin
2015-10-11  2:31 ` Theodore Ts'o
2015-10-11  2:53   ` Theodore Ts'o
2015-10-11  4:35   ` George Spelvin
2015-10-11 22:25     ` Theodore Ts'o
2015-10-12  0:16       ` George Spelvin
2015-10-12  4:05         ` Theodore Ts'o
2015-10-12  7:49           ` George Spelvin
2015-10-12 13:54             ` Theodore Ts'o
2015-10-12 20:30               ` George Spelvin
2015-10-12 20:34                 ` George Spelvin
2015-10-13  2:46                 ` Theodore Ts'o
2015-10-13  3:50                   ` Raymond Jennings
2015-10-13  7:50                     ` George Spelvin
2015-10-13  6:24                   ` George Spelvin
2015-10-13 16:20                   ` Andi Kleen
2015-10-13 21:10                     ` George Spelvin
2015-10-14  2:15                       ` Andi Kleen
2015-10-21  8:27                       ` George Spelvin
2015-10-21 11:47                         ` Andi Kleen
2015-10-21 18:10                           ` George Spelvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).