Re: Revising prandom_32 generator

All of lore.kernel.org
 help / color / mirror / Atom feed

From: George Spelvin <lkml@sdf.org>
To: daniel@iogearbox.net, hannes@stressinduktion.org, lkml@sdf.org
Cc: netdev@vger.kernel.org
Subject: Re: Revising prandom_32 generator
Date: Wed, 27 Mar 2019 21:43:02 GMT	[thread overview]
Message-ID: <201903272143.x2RLh1AP002035@sdf.org> (raw)
In-Reply-To: <0ce9983d-f257-4808-bd84-4385587e1b41@www.fastmail.com>

On Wed, 27 Mar 2019 at 14:32:55 -0400, Hannes Frederic Sowa wrote:
> On Tue, Mar 26, 2019, at 20:07, George Spelvin wrote:
>> On Tue, 26 Mar 2019 at 14:03:55 -0400, Hannes Frederic Sowa wrote:
>>> Are you trying to fix the modulo biases? I think that prandom_u32_max
>>> also has bias, would that be worth fixing as well?
>> 
>> I could if you like.  I submitted a patch (appended) to do that
>> for /dev/urandom (using a very clever new algorithm that does it
>> all with a single multiply in the common case!), but didn't think
>> the prandom users were critical enough to care.
> 
> I am not sure. A quick look shows that it might get used for
> selecting timeouts in the networking stack. Should not matter here
> at all. If used to select an index in pre-allocated arrays, then
> effects might be visible.

I looked around the code and couldn't find anything that looked
critical.

Remember that the bias is very small for small ranges.  You're
choosing between floor(2^32/range) and ceil(2^32/range), which is
a relative difference of 1/(2^32/range) = range/2^32.

If range = 100, that's 12.3 ppb.  It reaches 1 ppm at range=4295.

> Additionally, by adding all the branches it might make sense to
> downgrade the function from 'static inline' to regular function.

If the range is a compile-time constant, the code is

	u64 prod;

	do
		prod = (u64)range * prandom_u32();
	whie (unlikely((u32)prod < lim));
	reurn prod >> 32;

This is almost the same as the biased code, except that one
compare and one branch is added.  (And a mutiply-low on
processors with separate mutiply-high instructions.)

But if the range is variable, it's worth taking out of line.

> All in all, I don't think it is required and I would just add it
> if you think it is worth it as well. The additional comments you
> proposed are definitely worth to add.

I also judged it unnecessary, so until someone speaks up I won't do it.

> I wasn't sure if the non-consecutive extraction of output
> would blow up the Gaussian Elimination at some point (as in taking
> a few hours to reduce), which I would not consider trivial anymore.

No, it doesn't matter where the bits are as long as you know the
spacing.  What might happen with non-consecutive bits is that a
few are linearly dependent and so don't count toward the required
113.  This is easily solved by getting a few more bits.

Also, linear equations are just as good as bits.  For example,
given the output of prandom_u32_max(3), we can always derive an
equation:

Output = 0 -> bit 31 = 0
Output = 1 -> bit 31 ^ bit30 == 1
Output = 2 -> bit 31 = 1

> Anyway, using those functions in those situations would already
> be a bug, as Stephen correctly said, but some safety net would be
> useful (e.g. observing timeouts from a box right after boot up to
> determine the initial state of the fragmentation IDs would be bad
> - but early reseeding etc. makes that already difficult).

If it's attackable, you're not supposed to use prandom().  Dithered
timeouts, for example, are a way for *cooperating* devices to avoid
fighting over a shared bus (or other resource).  A malicious device
can just monopolize the bus and fight all comers.

>> All of the proposed replacements are considerably less linear and
>> harder to invert.  The xoshiro family are the most linear,
>> with an LFSR core and only a very simple non-linear state masking.
> 
> Then all of the proposed prngs are an improvement? Definitely go for it.

>> Here's the patch mentioned above for unbiased range reduction.
>> 
>> I just thought of a great improvement!  The slow computation
>> is "-range % range".  I could use __buitin_constant_p(range)
>> to decide between the following implementation and one that
>> has -range % range pre-computed and passed in.  I.e.
> 
> Heh, if you already made the work and maybe benchmarked it, it
> might make sense to kill the bias in prandom_u32_max anyway? :)
> The proposal below looks great!

I aready wrote and posted the code for get_random_u32().
The question is code size and time, not implementation.
(It's appended for reference.)

>> 	if ((u32)prod < range - 1) {
> 
> Maybe 'unlikely' could help here?

That's in the final version; see below.

In-Reply-To: <201903241244.x2OCiL8P011277@sdf.org>
References: <201903241244.x2OCiL8P011277@sdf.org>
From: George Spelvin <lkml@sdf.org>
Subject: [RFC PATCH v2] random: add get_random_max() function
To: linux-kernel@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>
Cc: : Jason A. Donenfeld <Jason@zx2c4.com>, George Spelvin <lkml@sdf.org>

RFC because:
- Only lightly tested so far, and
- Expecting feedback on the function names

v1->v2: I realized that if the range is a compile-time constant, the
	code can be simplified considerably.  This version has two
	implementations depending on __buitin_constant_p(range).
	(Specifically, the expensive "lim = -range % range" is also a
	compile-time constant, and the "fast path" which tries to avoid
	computing it is unnecessary.)

I got annoyed that randomize_page() wasn't exactly uniform, and used the
slow modulo-based range reduction rather than the fast multiplicative one,
so I fixed it.

I figure, if you're going to the trouble of using cryptographic random
numbers, you don't want biased outputs, even if it's slight.

This includes optimized code for compile-time constant ranges, for the
benefit of future callers, even though that doesn't help randomize_page().

Signed-off-by: George Spelvin <lkml@sdf.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
---
 drivers/char/random.c  | 113 ++++++++++++++++++++++++++++++++++++++---
 include/linux/random.h | 109 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 214 insertions(+), 8 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 9f4a348c7eb5..ffe6af0e165b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -2353,6 +2353,99 @@ static void invalidate_batched_entropy(void)
 	write_unlock_irqrestore(&batched_entropy_reset_lock, flags);
 }
 
+/**
+ * __get_random_max32 - Generate a uniform random number 0 <= x < range < 2^32
+ * @range:	Modulus for random number; must not be zero!
+ *
+ * This generates an exactly uniform distribution over [0, range), where
+ * range is *not* a compile-time constant.
+ *
+ * This uses Lemire's multiplicative algorithm, from
+ *	Fast Random Integer Generation in an Interval
+ *	https://arxiv.org/abs/1805.10941
+ *
+ * The standard multiplicative range technique is (rand32() * range) >> 32.
+ * However, there are either floor((1<<32) / range) or ceil((1<<32) / range)
+ * random values that will result in each output value.  More specifically,
+ * lim = (1 << 32) % range values will be generated one extra time.
+ *
+ * Lemire proves (Lemma 4.1) that those extra values can be identified
+ * by the lsbits of the product.  If you discard and retry whenever the
+ * lsbits are less than lim, you get the floor option always.
+ */
+u32 __get_random_max32(u32 range)
+{
+	u64 prod = mul_u32_u32(get_random_u32(), range);
+
+	/*
+	 * Fast-path check: lim < range, so lsbits < range-1
+	 * implies lsbits < lim.
+	 */
+	if (unlikely((u32)prod < range - 1)) {
+		/* Slow path; we need to divide to check exactly */
+		u32 lim = -range % range;	/* = (1<<32) % range */
+
+		while (unlikely((u32)prod < lim))
+			prod = mul_u32_u32(get_random_u32(), range);
+	}
+	return prod >> 32;
+}
+EXPORT_SYMBOL(__get_random_max32);
+
+#ifdef CONFIG_64BIT
+/**
+ * __get_random_max64 - Generate a uniform random number 0 <= x < range < 2^64
+ * @range:	Modulus for random number; must not be zero!
+ *
+ * Like get_random_max32, but for larger ranges.  There are two
+ * implementations, depending on CONFIG_ARCH_SUPPORTS_INT128.
+ */
+u64 __get_random_max64(u64 range)
+{
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+	/* Same as get_random_max32(), with larger data types. */
+	unsigned __int128 prod;
+
+	if (range >> 32 == 0)
+		return __get_random_max32(range);
+
+	prod = (unsigned __int128)get_random_u64() * range;
+
+	if (likely((u64)prod < range - 1)) {
+		/* Slow path; we need to divide to check exactly */
+		u64 lim = -range % range;	/* = (1<<64) % range */
+
+		while ((u64)prod < lim)
+			prod = (unsigned __int128)get_random_u64() * range;
+	}
+	return prod >> 64;
+#else
+	/*
+	 * We don't have efficient 128-bit products (e.g. SPARCv9), so
+	 * break it into two halves: choose the high bits, then
+	 * choose the low bits depending on them.
+	 */
+	u32 lsbits, msbits:
+
+	if (range >> 32 == 0)
+		return __get_random_max32(range);
+
+	msbits = (range + 0xffffffff) >> 32;
+	/* If range >= 0xffffffff00000001, msbits will wrap to 0 */
+	if (likely(msbits))
+		msbits = __get_random_max32(msbits)
+	else
+		msbits = get_random_u32();
+	if (unlikely(msbits == (u32)(range >> 32)))
+		lsbits = __get_random_max32((u32)range)
+	else
+		lsbits = get_random_u32();
+	return (u64)msbits << 32 | lsbits;
+#endif
+}
+EXPORT_SYMBOL(__get_random_max64);
+#endif /* CONFIG_64BIT */
+
 /**
  * randomize_page - Generate a random, page aligned address
  * @start:	The smallest acceptable address the caller will take.
@@ -2370,20 +2463,24 @@ static void invalidate_batched_entropy(void)
 unsigned long
 randomize_page(unsigned long start, unsigned long range)
 {
-	if (!PAGE_ALIGNED(start)) {
-		range -= PAGE_ALIGN(start) - start;
-		start = PAGE_ALIGN(start);
-	}
+	/* How much to round up to make start page-aligned */
+	unsigned long align = -start & (PAGE_SIZE - 1);
 
-	if (start > ULONG_MAX - range)
-		range = ULONG_MAX - start;
+	if (unlikely(range < align))
+		return start;
+
+	start += align;
+	range -= align;
+
+	if (unlikely(range > -start))
+		range = -start;
 
 	range >>= PAGE_SHIFT;
 
-	if (range == 0)
+	if (unlikely(range == 0))
 		return start;
 
-	return start + (get_random_long() % range << PAGE_SHIFT);
+	return start + (get_random_max(range) << PAGE_SHIFT);
 }
 
 /* Interface for in-kernel drivers of true hardware RNGs.
diff --git a/include/linux/random.h b/include/linux/random.h
index 25c0e3f0152c..fc929369471f 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -60,6 +60,115 @@ static inline unsigned long get_random_long(void)
 #endif
 }
 
+/**
+ * get_random_max32 - return a 32-bit random number in interval [0, range)
+ * @range: Size of random interval [0,range); the first value not included.
+ *
+ * This uses Lemire's algorithm; see drivers/char/random.c for details.
+ *
+ * The expensive part of this is the "(1ull<<32) % range" computation.
+ * If that's a compile-time constant, things get simple enough to inline.
+ */
+/* When the range is a compile-time constant, so is lim, and it's simple. */
+static __always_inline u32 _get_random_max32(u32 range)
+{
+	u32 lim = -range % range;
+	u64 prod;
+
+	do
+		prod = mul_u32_u32(get_random_u32(), range);
+	while (unlikely((u32)prod < lim));
+	return prod >> 32;
+}
+/* Non-constant ranges are done out of line. */
+u32 __get_random_max32(u32 range);
+
+/* Decide between the two.  (Case #3 is power of two.) */
+static __always_inline u32 get_random_max32(u32 range)
+{
+	if (!__builtin_constant_p(range))
+		return __get_random_max32(range);
+	else if (range & (range - 1))
+		return _get_random_max32(range);
+	else
+		return get_random_u32() / (u32)(0x100000000 / range);
+}
+
+#if CONFIG_64BIT
+/*
+ * The 64-bit equivalent is trickier.  If we have 64x64->128-bit multiply,
+ * the same algorithm works.  If we don't (*cough* SPARCv9 *cough*),
+ * we can do it in two halves.
+ */
+static __always_inline u64 _get_random_max64(u64 range)
+{
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+	u64 lim = -range % range;
+	unsigned __int128 prod;
+
+	do
+		prod = (__int128)get_random_u64() * range;
+	while (unlikely((u64)prod < lim));
+	return prod >> 64;
+
+#else /* No 128-bit product - this is messy */
+	u32 lsbits, msbits = (range + 0xffffffff) >> 32;
+	/* If range >= 0xffffffff00000001, msbits will wrap to 0 */
+	if (msbits)
+		msbits = get_random_max32(msbits)
+	else
+		msbits = get_random_u32();
+	if ((u32)range && unlikely(msbits == (u32)(range >> 32)))
+		lsbits = get_random_max32((u32)range);
+	else
+		lsbits = get_random_u32()
+	return (u64)msbits << 32 | lsbits;
+#endif
+}
+
+u64 __get_random_max64(u64 range);
+
+static __always_inline u64 get_random_max64(u64 range)
+{
+	/*
+	 * We may know at compile time that range is 32 bits,
+	 * even if we don't know its exact value.
+	 */
+	if (__builtin_constant_p(range <= 0xffffffff) && range <= 0xffffffff)
+		return get_random_max32(range);
+	else if (!__builtin_constant_p(range))
+		return __get_random_max64(range);
+	else if (range & (range - 1))
+		return _get_random_max64(range);
+	else
+		return get_random_u64() / (0x100000000 / (range >> 32));
+}
+#endif
+
+/**
+ * get_random_max - Generate a uniform random number 0 <= x < range
+ * @range:	Modulus for random number; must not be zero!
+ *
+ * This uses a cryptographic random number generator, for safety
+ * against malicious attackers trying to guess the output.
+ *
+ * This generates exactly uniform random numbers in the range, retrying
+ * occasionally if range is not a power of two.  If we're going to
+ * go to the expense of a cryptographic RNG, we may as well get the
+ * distribution right.
+ *
+ * For less critical applications (self-tests, error injection, dithered
+ * timers), use prandom_u32_max().
+ */
+static __always_inline unsigned long get_random_max(unsigned long range)
+{
+#if BITS_PER_LONG == 64
+	return get_random_max64(range);
+#else
+	return get_random_max32(range);
+#endif
+}
+
 /*
  * On 64-bit architectures, protect against non-terminated C string overflows
  * by zeroing out the first byte of the canary; this leaves 56 bits of entropy.
-- 
2.20.1

next prev parent reply	other threads:[~2019-03-27 21:43 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-26 11:10 Revising prandom_32 generator George Spelvin
2019-03-26 11:17 ` George Spelvin
2019-03-26 18:03   ` Hannes Frederic Sowa
2019-03-26 19:07     ` George Spelvin
2019-03-26 19:23       ` Stephen Hemminger
2019-03-27 18:32       ` Hannes Frederic Sowa
2019-03-27 21:43         ` George Spelvin [this message]
2019-03-26 14:58 ` Stephen Hemminger
2019-03-26 16:24   ` George Spelvin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:9f4a348c7eb dfblob:ffe6af0e165 dfblob:25c0e3f0152
dfblob:fc929369471 )
 OR (
bs:"random: add get_random_max() function" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201903272143.x2RLh1AP002035@sdf.org \
    --to=lkml@sdf.org \
    --cc=daniel@iogearbox.net \
    --cc=hannes@stressinduktion.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.