Linux cryptographic layer development

Linux cryptographic layer development
 help / color / mirror / Atom feed

* [PATCH v7 6/6] siphash: implement HalfSipHash1-3 for hash tables
From: Jason A. Donenfeld @ 2016-12-21 23:02 UTC (permalink / raw)
  To: Netdev, kernel-hardening, LKML, linux-crypto, David Laight,
	Ted Tso, Hannes Frederic Sowa, edumazet, Linus Torvalds,
	Eric Biggers, Tom Herbert, ak, davem, luto,
	Jean-Philippe Aumasson
  Cc: Jason A. Donenfeld
In-Reply-To: <20161221230216.25341-1-Jason@zx2c4.com>

HalfSipHash, or hsiphash, is a shortened version of SipHash, which
generates 32-bit outputs using a weaker 64-bit key. It has *much* lower
security margins, and shouldn't be used for anything too sensitive, but
it could be used as a hashtable key function replacement, if the output
is never exposed, and if the security requirement is not too high.

The goal is to make this something that performance-critical jhash users
would be willing to use.

On 64-bit machines, HalfSipHash1-3 is slower than SipHash1-3, so we alias
SipHash1-3 to HalfSipHash1-3 on those systems.

64-bit x86_64:
[    0.509409] test_siphash:     SipHash2-4 cycles: 4049181
[    0.510650] test_siphash:     SipHash1-3 cycles: 2512884
[    0.512205] test_siphash: HalfSipHash1-3 cycles: 3429920
[    0.512904] test_siphash:    JenkinsHash cycles:  978267
So, we map hsiphash() -> SipHash1-3

32-bit x86:
[    0.509868] test_siphash:     SipHash2-4 cycles: 14812892
[    0.513601] test_siphash:     SipHash1-3 cycles:  9510710
[    0.515263] test_siphash: HalfSipHash1-3 cycles:  3856157
[    0.515952] test_siphash:    JenkinsHash cycles:  1148567
So, we map hsiphash() -> HalfSipHash1-3

hsiphash() is roughly 3 times slower than jhash(), but comes with a
considerable security improvement.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
---
 Documentation/siphash.txt |  75 +++++++++++
 include/linux/siphash.h   |  56 +++++++-
 lib/siphash.c             | 318 +++++++++++++++++++++++++++++++++++++++++++++-
 lib/test_siphash.c        | 139 ++++++++++++++++----
 4 files changed, 561 insertions(+), 27 deletions(-)

diff --git a/Documentation/siphash.txt b/Documentation/siphash.txt
index 39ff7f0438e7..f93c1d7104c4 100644
--- a/Documentation/siphash.txt
+++ b/Documentation/siphash.txt
@@ -77,3 +77,78 @@ Linux implements the "2-4" variant of SipHash.
 
 Read the SipHash paper if you're interested in learning more:
 https://131002.net/siphash/siphash.pdf
+
+
+~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~
+
+HalfSipHash - SipHash's insecure younger cousin
+-----------------------------------------------
+Written by Jason A. Donenfeld <jason@zx2c4.com>
+
+On the off-chance that SipHash is not fast enough for your needs, you might be
+able to justify using HalfSipHash, a terrifying but potentially useful
+possibility. HalfSipHash cuts SipHash's rounds down from "2-4" to "1-3" and,
+even scarier, uses an easily brute-forcable 64-bit key (with a 32-bit output)
+instead of SipHash's 128-bit key. However, this may appeal to some
+high-performance `jhash` users.
+
+Danger!
+
+Do not ever use HalfSipHash except for as a hashtable key function, and only
+then when you can be absolutely certain that the outputs will never be
+transmitted out of the kernel. This is only remotely useful over `jhash` as a
+means of mitigating hashtable flooding denial of service attacks.
+
+1. Generating a key
+
+Keys should always be generated from a cryptographically secure source of
+random numbers, either using get_random_bytes or get_random_once:
+
+hsiphash_key_t key;
+get_random_bytes(key, sizeof(key));
+
+If you're not deriving your key from here, you're doing it wrong.
+
+2. Using the functions
+
+There are two variants of the function, one that takes a list of integers, and
+one that takes a buffer:
+
+u32 hsiphash(const void *data, size_t len, siphash_key_t key);
+
+And:
+
+u32 hsiphash_1u32(u32, hsiphash_key_t key);
+u32 hsiphash_2u32(u32, u32, hsiphash_key_t key);
+u32 hsiphash_3u32(u32, u32, u32, hsiphash_key_t key);
+u32 hsiphash_4u32(u32, u32, u32, u32, hsiphash_key_t key);
+
+If you pass the generic hsiphash function something of a constant length, it
+will constant fold at compile-time and automatically choose one of the
+optimized functions.
+
+3. Hashtable key function usage:
+
+struct some_hashtable {
+	DECLARE_HASHTABLE(hashtable, 8);
+	hsiphash_key_t key;
+};
+
+void init_hashtable(struct some_hashtable *table)
+{
+	get_random_bytes(table->key, sizeof(table->key));
+}
+
+static inline hlist_head *some_hashtable_bucket(struct some_hashtable *table, struct interesting_input *input)
+{
+	return &table->hashtable[hsiphash(input, sizeof(*input), table->key) & (HASH_SIZE(table->hashtable) - 1)];
+}
+
+You may then iterate like usual over the returned hash bucket.
+
+4. Performance
+
+HalfSipHash is roughly 3 times slower than JenkinsHash. For many replacements,
+this will not be a problem, as the hashtable lookup isn't the bottleneck. And
+in general, this is probably a good sacrifice to make for the security and DoS
+resistance of HalfSipHash.
diff --git a/include/linux/siphash.h b/include/linux/siphash.h
index 7aa666eb00d9..efab44c654f3 100644
--- a/include/linux/siphash.h
+++ b/include/linux/siphash.h
@@ -5,7 +5,9 @@
  * SipHash: a fast short-input PRF
  * https://131002.net/siphash/
  *
- * This implementation is specifically for SipHash2-4.
+ * This implementation is specifically for SipHash2-4 for a secure PRF
+ * and HalfSipHash1-3/SipHash1-3 for an insecure PRF only suitable for
+ * hashtables.
  */
 
 #ifndef _LINUX_SIPHASH_H
@@ -76,4 +78,56 @@ static inline u64 siphash(const void *data, size_t len, const siphash_key_t key)
 	return ___siphash_aligned(data, len, key);
 }
 
+#if BITS_PER_LONG == 64
+typedef siphash_key_t hsiphash_key_t;
+#define HSIPHASH_ALIGNMENT SIPHASH_ALIGNMENT
+#else
+typedef u32 hsiphash_key_t[2];
+#define HSIPHASH_ALIGNMENT __alignof__(u32)
+#endif
+
+u32 __hsiphash_aligned(const void *data, size_t len, const hsiphash_key_t key);
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+u32 __hsiphash_unaligned(const void *data, size_t len, const hsiphash_key_t key);
+#endif
+
+u32 hsiphash_1u32(const u32 a, const hsiphash_key_t key);
+u32 hsiphash_2u32(const u32 a, const u32 b, const hsiphash_key_t key);
+u32 hsiphash_3u32(const u32 a, const u32 b, const u32 c,
+		  const hsiphash_key_t key);
+u32 hsiphash_4u32(const u32 a, const u32 b, const u32 c, const u32 d,
+		  const hsiphash_key_t key);
+
+static inline u32 ___hsiphash_aligned(const __le32 *data, size_t len, const hsiphash_key_t key)
+{
+	if (__builtin_constant_p(len) && len == 4)
+		return hsiphash_1u32(le32_to_cpu(data[0]), key);
+	if (__builtin_constant_p(len) && len == 8)
+		return hsiphash_2u32(le32_to_cpu(data[0]), le32_to_cpu(data[1]),
+				     key);
+	if (__builtin_constant_p(len) && len == 12)
+		return hsiphash_3u32(le32_to_cpu(data[0]), le32_to_cpu(data[1]),
+				     le32_to_cpu(data[2]), key);
+	if (__builtin_constant_p(len) && len == 16)
+		return hsiphash_4u32(le32_to_cpu(data[0]), le32_to_cpu(data[1]),
+				     le32_to_cpu(data[2]), le32_to_cpu(data[3]),
+				     key);
+	return __hsiphash_aligned(data, len, key);
+}
+
+/**
+ * hsiphash - compute 32-bit hsiphash PRF value
+ * @data: buffer to hash
+ * @size: size of @data
+ * @key: the hsiphash key
+ */
+static inline u32 hsiphash(const void *data, size_t len, const hsiphash_key_t key)
+{
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+	if (!IS_ALIGNED((unsigned long)data, HSIPHASH_ALIGNMENT))
+		return __hsiphash_unaligned(data, len, key);
+#endif
+	return ___hsiphash_aligned(data, len, key);
+}
+
 #endif /* _LINUX_SIPHASH_H */
diff --git a/lib/siphash.c b/lib/siphash.c
index ff2151313667..e2481226d96c 100644
--- a/lib/siphash.c
+++ b/lib/siphash.c
@@ -5,7 +5,9 @@
  * SipHash: a fast short-input PRF
  * https://131002.net/siphash/
  *
- * This implementation is specifically for SipHash2-4.
+ * This implementation is specifically for SipHash2-4 for a secure PRF
+ * and HalfSipHash1-3/SipHash1-3 for an insecure PRF only suitable for
+ * hashtables.
  */
 
 #include <linux/siphash.h>
@@ -230,3 +232,317 @@ u64 siphash_3u32(const u32 first, const u32 second, const u32 third,
 	POSTAMBLE
 }
 EXPORT_SYMBOL(siphash_3u32);
+
+#if BITS_PER_LONG == 64
+/* Note that this HalfSipHash1-3 implementation on 64-bit
+ * isn't actually HalfSipHash1-3 but rather SipHash1-3. */
+
+#define HSIPROUND SIPROUND
+#define HPREAMBLE(len) PREAMBLE(len)
+#define HPOSTAMBLE \
+	v3 ^= b; \
+	HSIPROUND; \
+	v0 ^= b; \
+	v2 ^= 0xff; \
+	HSIPROUND; \
+	HSIPROUND; \
+	HSIPROUND; \
+	return (v0 ^ v1) ^ (v2 ^ v3);
+
+u32 __hsiphash_aligned(const void *data, size_t len, const hsiphash_key_t key)
+{
+	const u8 *end = data + len - (len % sizeof(u64));
+	const u8 left = len & (sizeof(u64) - 1);
+	u64 m;
+	HPREAMBLE(len)
+	for (; data != end; data += sizeof(u64)) {
+		m = le64_to_cpup(data);
+		v3 ^= m;
+		HSIPROUND;
+		v0 ^= m;
+	}
+#if defined(CONFIG_DCACHE_WORD_ACCESS) && BITS_PER_LONG == 64
+	if (left)
+		b |= le64_to_cpu((__force __le64)(load_unaligned_zeropad(data) &
+						  bytemask_from_count(left)));
+#else
+	switch (left) {
+	case 7: b |= ((u64)end[6]) << 48;
+	case 6: b |= ((u64)end[5]) << 40;
+	case 5: b |= ((u64)end[4]) << 32;
+	case 4: b |= le32_to_cpup(data); break;
+	case 3: b |= ((u64)end[2]) << 16;
+	case 2: b |= le16_to_cpup(data); break;
+	case 1: b |= end[0];
+	}
+#endif
+	HPOSTAMBLE
+}
+EXPORT_SYMBOL(__hsiphash_aligned);
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+u32 __hsiphash_unaligned(const void *data, size_t len, const hsiphash_key_t key)
+{
+	const u8 *end = data + len - (len % sizeof(u64));
+	const u8 left = len & (sizeof(u64) - 1);
+	u64 m;
+	HPREAMBLE(len)
+	for (; data != end; data += sizeof(u64)) {
+		m = get_unaligned_le64(data);
+		v3 ^= m;
+		HSIPROUND;
+		v0 ^= m;
+	}
+#if defined(CONFIG_DCACHE_WORD_ACCESS) && BITS_PER_LONG == 64
+	if (left)
+		b |= le64_to_cpu((__force __le64)(load_unaligned_zeropad(data) &
+						  bytemask_from_count(left)));
+#else
+	switch (left) {
+	case 7: b |= ((u64)end[6]) << 48;
+	case 6: b |= ((u64)end[5]) << 40;
+	case 5: b |= ((u64)end[4]) << 32;
+	case 4: b |= get_unaligned_le32(end); break;
+	case 3: b |= ((u64)end[2]) << 16;
+	case 2: b |= get_unaligned_le16(end); break;
+	case 1: b |= end[0];
+	}
+#endif
+	HPOSTAMBLE
+}
+EXPORT_SYMBOL(__hsiphash_unaligned);
+#endif
+
+/**
+ * hsiphash_1u32 - compute 64-bit hsiphash PRF value of a u32
+ * @first: first u32
+ * @key: the hsiphash key
+ */
+u32 hsiphash_1u32(const u32 first, const hsiphash_key_t key)
+{
+	HPREAMBLE(4)
+	b |= first;
+	HPOSTAMBLE
+}
+EXPORT_SYMBOL(hsiphash_1u32);
+
+/**
+ * hsiphash_2u32 - compute 32-bit hsiphash PRF value of 2 u32
+ * @first: first u32
+ * @second: second u32
+ * @key: the hsiphash key
+ */
+u32 hsiphash_2u32(const u32 first, const u32 second, const hsiphash_key_t key)
+{
+	u64 combined = (u64)second << 32 | first;
+	HPREAMBLE(8)
+	v3 ^= combined;
+	HSIPROUND;
+	v0 ^= combined;
+	HPOSTAMBLE
+}
+EXPORT_SYMBOL(hsiphash_2u32);
+
+/**
+ * hsiphash_3u32 - compute 32-bit hsiphash PRF value of 3 u32
+ * @first: first u32
+ * @second: second u32
+ * @third: third u32
+ * @key: the hsiphash key
+ */
+u32 hsiphash_3u32(const u32 first, const u32 second, const u32 third,
+		  const hsiphash_key_t key)
+{
+	u64 combined = (u64)second << 32 | first;
+	HPREAMBLE(12)
+	v3 ^= combined;
+	HSIPROUND;
+	v0 ^= combined;
+	b |= third;
+	HPOSTAMBLE
+}
+EXPORT_SYMBOL(hsiphash_3u32);
+
+/**
+ * hsiphash_4u32 - compute 32-bit hsiphash PRF value of 4 u32
+ * @first: first u32
+ * @second: second u32
+ * @third: third u32
+ * @forth: forth u32
+ * @key: the hsiphash key
+ */
+u32 hsiphash_4u32(const u32 first, const u32 second, const u32 third,
+		  const u32 forth, const hsiphash_key_t key)
+{
+	u64 combined = (u64)second << 32 | first;
+	HPREAMBLE(16)
+	v3 ^= combined;
+	HSIPROUND;
+	v0 ^= combined;
+	combined = (u64)forth << 32 | third;
+	v3 ^= combined;
+	HSIPROUND;
+	v0 ^= combined;
+	HPOSTAMBLE
+}
+EXPORT_SYMBOL(hsiphash_4u32);
+#else
+#define HSIPROUND \
+	do { \
+	v0 += v1; v1 = rol32(v1, 5); v1 ^= v0; v0 = rol32(v0, 16); \
+	v2 += v3; v3 = rol32(v3, 8); v3 ^= v2; \
+	v0 += v3; v3 = rol32(v3, 7); v3 ^= v0; \
+	v2 += v1; v1 = rol32(v1, 13); v1 ^= v2; v2 = rol32(v2, 16); \
+	} while(0)
+
+#define HPREAMBLE(len) \
+	u32 v0 = 0; \
+	u32 v1 = 0; \
+	u32 v2 = 0x6c796765U; \
+	u32 v3 = 0x74656462U; \
+	u32 b = ((u32)len) << 24; \
+	v3 ^= key[1]; \
+	v2 ^= key[0]; \
+	v1 ^= key[1]; \
+	v0 ^= key[0];
+
+#define HPOSTAMBLE \
+	v3 ^= b; \
+	HSIPROUND; \
+	v0 ^= b; \
+	v2 ^= 0xff; \
+	HSIPROUND; \
+	HSIPROUND; \
+	HSIPROUND; \
+	return v1 ^ v3;
+
+u32 __hsiphash_aligned(const void *data, size_t len, const hsiphash_key_t key)
+{
+	const u8 *end = data + len - (len % sizeof(u32));
+	const u8 left = len & (sizeof(u32) - 1);
+	u32 m;
+	HPREAMBLE(len)
+	for (; data != end; data += sizeof(u32)) {
+		m = le32_to_cpup(data);
+		v3 ^= m;
+		HSIPROUND;
+		v0 ^= m;
+	}
+	switch (left) {
+	case 3: b |= ((u32)end[2]) << 16;
+	case 2: b |= le16_to_cpup(data); break;
+	case 1: b |= end[0];
+	}
+	HPOSTAMBLE
+}
+EXPORT_SYMBOL(__hsiphash_aligned);
+
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+u32 __hsiphash_unaligned(const void *data, size_t len, const hsiphash_key_t key)
+{
+	const u8 *end = data + len - (len % sizeof(u32));
+	const u8 left = len & (sizeof(u32) - 1);
+	u32 m;
+	HPREAMBLE(len)
+	for (; data != end; data += sizeof(u32)) {
+		m = get_unaligned_le32(data);
+		v3 ^= m;
+		HSIPROUND;
+		v0 ^= m;
+	}
+	switch (left) {
+	case 3: b |= ((u32)end[2]) << 16;
+	case 2: b |= get_unaligned_le16(end); break;
+	case 1: b |= end[0];
+	}
+	HPOSTAMBLE
+}
+EXPORT_SYMBOL(__hsiphash_unaligned);
+#endif
+
+/**
+ * hsiphash_1u32 - compute 32-bit hsiphash PRF value of a u32
+ * @first: first u32
+ * @key: the hsiphash key
+ */
+u32 hsiphash_1u32(const u32 first, const hsiphash_key_t key)
+{
+	HPREAMBLE(4)
+	v3 ^= first;
+	HSIPROUND;
+	v0 ^= first;
+	HPOSTAMBLE
+}
+EXPORT_SYMBOL(hsiphash_1u32);
+
+/**
+ * hsiphash_2u32 - compute 32-bit hsiphash PRF value of 2 u32
+ * @first: first u32
+ * @second: second u32
+ * @key: the hsiphash key
+ */
+u32 hsiphash_2u32(const u32 first, const u32 second, const hsiphash_key_t key)
+{
+	HPREAMBLE(8)
+	v3 ^= first;
+	HSIPROUND;
+	v0 ^= first;
+	v3 ^= second;
+	HSIPROUND;
+	v0 ^= second;
+	HPOSTAMBLE
+}
+EXPORT_SYMBOL(hsiphash_2u32);
+
+/**
+ * hsiphash_3u32 - compute 32-bit hsiphash PRF value of 3 u32
+ * @first: first u32
+ * @second: second u32
+ * @third: third u32
+ * @key: the hsiphash key
+ */
+u32 hsiphash_3u32(const u32 first, const u32 second, const u32 third,
+		  const hsiphash_key_t key)
+{
+	HPREAMBLE(12)
+	v3 ^= first;
+	HSIPROUND;
+	v0 ^= first;
+	v3 ^= second;
+	HSIPROUND;
+	v0 ^= second;
+	v3 ^= third;
+	HSIPROUND;
+	v0 ^= third;
+	HPOSTAMBLE
+}
+EXPORT_SYMBOL(hsiphash_3u32);
+
+/**
+ * hsiphash_4u32 - compute 32-bit hsiphash PRF value of 4 u32
+ * @first: first u32
+ * @second: second u32
+ * @third: third u32
+ * @forth: forth u32
+ * @key: the hsiphash key
+ */
+u32 hsiphash_4u32(const u32 first, const u32 second, const u32 third,
+		  const u32 forth, const hsiphash_key_t key)
+{
+	HPREAMBLE(16)
+	v3 ^= first;
+	HSIPROUND;
+	v0 ^= first;
+	v3 ^= second;
+	HSIPROUND;
+	v0 ^= second;
+	v3 ^= third;
+	HSIPROUND;
+	v0 ^= third;
+	v3 ^= forth;
+	HSIPROUND;
+	v0 ^= forth;
+	HPOSTAMBLE
+}
+EXPORT_SYMBOL(hsiphash_4u32);
+#endif
diff --git a/lib/test_siphash.c b/lib/test_siphash.c
index e0ba2cf8dc67..ac291ec27fb6 100644
--- a/lib/test_siphash.c
+++ b/lib/test_siphash.c
@@ -7,7 +7,9 @@
  * SipHash: a fast short-input PRF
  * https://131002.net/siphash/
  *
- * This implementation is specifically for SipHash2-4.
+ * This implementation is specifically for SipHash2-4 for a secure PRF
+ * and HalfSipHash1-3/SipHash1-3 for an insecure PRF only suitable for
+ * hashtables.
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -18,10 +20,16 @@
 #include <linux/errno.h>
 #include <linux/module.h>
 
-/* Test vectors taken from official reference source available at:
- *     https://131002.net/siphash/siphash24.c
+/* Test vectors taken from reference source available at:
+ *     https://github.com/veorq/SipHash
  */
-static const u64 test_vectors[64] = {
+
+
+
+static const siphash_key_t test_key_siphash =
+	{ 0x0706050403020100ULL , 0x0f0e0d0c0b0a0908ULL };
+
+static const u64 test_vectors_siphash[64] = {
 	0x726fdb47dd0e0e31ULL, 0x74f839c593dc67fdULL, 0x0d6c8009d9a94f5aULL,
 	0x85676696d7fb7e2dULL, 0xcf2794e0277187b7ULL, 0x18765564cd99a68dULL,
 	0xcbc9466e58fee3ceULL, 0xab0200f58b01d137ULL, 0x93f5f5799a932462ULL,
@@ -45,9 +53,64 @@ static const u64 test_vectors[64] = {
 	0x6ca4ecb15c5f91e1ULL, 0x9f626da15c9625f3ULL, 0xe51b38608ef25f57ULL,
 	0x958a324ceb064572ULL
 };
-static const siphash_key_t test_key =
+#if BITS_PER_LONG == 64
+static const hsiphash_key_t test_key_hsiphash =
 	{ 0x0706050403020100ULL , 0x0f0e0d0c0b0a0908ULL };
 
+static const u32 test_vectors_hsiphash[64] = {
+	0x050fc4dcU, 0x7d57ca93U, 0x4dc7d44dU,
+	0xe7ddf7fbU, 0x88d38328U, 0x49533b67U,
+	0xc59f22a7U, 0x9bb11140U, 0x8d299a8eU,
+	0x6c063de4U, 0x92ff097fU, 0xf94dc352U,
+	0x57b4d9a2U, 0x1229ffa7U, 0xc0f95d34U,
+	0x2a519956U, 0x7d908b66U, 0x63dbd80cU,
+	0xb473e63eU, 0x8d297d1cU, 0xa6cce040U,
+	0x2b45f844U, 0xa320872eU, 0xdae6c123U,
+	0x67349c8cU, 0x705b0979U, 0xca9913a5U,
+	0x4ade3b35U, 0xef6cd00dU, 0x4ab1e1f4U,
+	0x43c5e663U, 0x8c21d1bcU, 0x16a7b60dU,
+	0x7a8ff9bfU, 0x1f2a753eU, 0xbf186b91U,
+	0xada26206U, 0xa3c33057U, 0xae3a36a1U,
+	0x7b108392U, 0x99e41531U, 0x3f1ad944U,
+	0xc8138825U, 0xc28949a6U, 0xfaf8876bU,
+	0x9f042196U, 0x68b1d623U, 0x8b5114fdU,
+	0xdf074c46U, 0x12cc86b3U, 0x0a52098fU,
+	0x9d292f9aU, 0xa2f41f12U, 0x43a71ed0U,
+	0x73f0bce6U, 0x70a7e980U, 0x243c6d75U,
+	0xfdb71513U, 0xa67d8a08U, 0xb7e8f148U,
+	0xf7a644eeU, 0x0f1837f2U, 0x4b6694e0U,
+	0xb7bbb3a8U
+};
+#else
+static const hsiphash_key_t test_key_hsiphash =
+	{ 0x03020100U, 0x07060504U };
+
+static const u32 test_vectors_hsiphash[64] = {
+	0x5814c896U, 0xe7e864caU, 0xbc4b0e30U,
+	0x01539939U, 0x7e059ea6U, 0x88e3d89bU,
+	0xa0080b65U, 0x9d38d9d6U, 0x577999b1U,
+	0xc839caedU, 0xe4fa32cfU, 0x959246eeU,
+	0x6b28096cU, 0x66dd9cd6U, 0x16658a7cU,
+	0xd0257b04U, 0x8b31d501U, 0x2b1cd04bU,
+	0x06712339U, 0x522aca67U, 0x911bb605U,
+	0x90a65f0eU, 0xf826ef7bU, 0x62512debU,
+	0x57150ad7U, 0x5d473507U, 0x1ec47442U,
+	0xab64afd3U, 0x0a4100d0U, 0x6d2ce652U,
+	0x2331b6a3U, 0x08d8791aU, 0xbc6dda8dU,
+	0xe0f6c934U, 0xb0652033U, 0x9b9851ccU,
+	0x7c46fb7fU, 0x732ba8cbU, 0xf142997aU,
+	0xfcc9aa1bU, 0x05327eb2U, 0xe110131cU,
+	0xf9e5e7c0U, 0xa7d708a6U, 0x11795ab1U,
+	0x65671619U, 0x9f5fff91U, 0xd89c5267U,
+	0x007783ebU, 0x95766243U, 0xab639262U,
+	0x9c7e1390U, 0xc368dda6U, 0x38ddc455U,
+	0xfa13d379U, 0x979ea4e8U, 0x53ecd77eU,
+	0x2ee80657U, 0x33dbb66aU, 0xae3f0577U,
+	0x88b4c4ccU, 0x3e7f480bU, 0x74c1ebf8U,
+	0x87178304U
+};
+#endif
+
 static int __init siphash_test_init(void)
 {
 	u8 in[64] __aligned(SIPHASH_ALIGNMENT);
@@ -58,49 +121,75 @@ static int __init siphash_test_init(void)
 	for (i = 0; i < 64; ++i) {
 		in[i] = i;
 		in_unaligned[i + 1] = i;
-		if (siphash(in, i, test_key) != test_vectors[i]) {
-			pr_info("self-test aligned %u: FAIL\n", i + 1);
+		if (siphash(in, i, test_key_siphash) != test_vectors_siphash[i]) {
+			pr_info("siphash self-test aligned %u: FAIL\n", i + 1);
+			ret = -EINVAL;
+		}
+		if (siphash(in_unaligned + 1, i, test_key_siphash) != test_vectors_siphash[i]) {
+			pr_info("siphash self-test unaligned %u: FAIL\n", i + 1);
 			ret = -EINVAL;
 		}
-		if (siphash(in_unaligned + 1, i, test_key) != test_vectors[i]) {
-			pr_info("self-test unaligned %u: FAIL\n", i + 1);
+		if (hsiphash(in, i, test_key_hsiphash) != test_vectors_hsiphash[i]) {
+			pr_info("hsiphash self-test aligned %u: FAIL\n", i + 1);
+			ret = -EINVAL;
+		}
+		if (hsiphash(in_unaligned + 1, i, test_key_hsiphash) != test_vectors_hsiphash[i]) {
+			pr_info("hsiphash self-test unaligned %u: FAIL\n", i + 1);
 			ret = -EINVAL;
 		}
 	}
-	if (siphash_1u64(0x0706050403020100ULL, test_key) != test_vectors[8]) {
-		pr_info("self-test 1u64: FAIL\n");
+	if (siphash_1u64(0x0706050403020100ULL, test_key_siphash) != test_vectors_siphash[8]) {
+		pr_info("siphash self-test 1u64: FAIL\n");
 		ret = -EINVAL;
 	}
-	if (siphash_2u64(0x0706050403020100ULL, 0x0f0e0d0c0b0a0908ULL, test_key) != test_vectors[16]) {
-		pr_info("self-test 2u64: FAIL\n");
+	if (siphash_2u64(0x0706050403020100ULL, 0x0f0e0d0c0b0a0908ULL, test_key_siphash) != test_vectors_siphash[16]) {
+		pr_info("siphash self-test 2u64: FAIL\n");
 		ret = -EINVAL;
 	}
 	if (siphash_3u64(0x0706050403020100ULL, 0x0f0e0d0c0b0a0908ULL,
-			 0x1716151413121110ULL, test_key) != test_vectors[24]) {
-		pr_info("self-test 3u64: FAIL\n");
+			 0x1716151413121110ULL, test_key_siphash) != test_vectors_siphash[24]) {
+		pr_info("siphash self-test 3u64: FAIL\n");
 		ret = -EINVAL;
 	}
 	if (siphash_4u64(0x0706050403020100ULL, 0x0f0e0d0c0b0a0908ULL,
-			 0x1716151413121110ULL, 0x1f1e1d1c1b1a1918ULL, test_key) != test_vectors[32]) {
-		pr_info("self-test 4u64: FAIL\n");
+			 0x1716151413121110ULL, 0x1f1e1d1c1b1a1918ULL, test_key_siphash) != test_vectors_siphash[32]) {
+		pr_info("siphash self-test 4u64: FAIL\n");
 		ret = -EINVAL;
 	}
-	if (siphash_1u32(0x03020100U, test_key) != test_vectors[4]) {
-		pr_info("self-test 1u32: FAIL\n");
+	if (siphash_1u32(0x03020100U, test_key_siphash) != test_vectors_siphash[4]) {
+		pr_info("siphash self-test 1u32: FAIL\n");
 		ret = -EINVAL;
 	}
-	if (siphash_2u32(0x03020100U, 0x07060504U, test_key) != test_vectors[8]) {
-		pr_info("self-test 2u32: FAIL\n");
+	if (siphash_2u32(0x03020100U, 0x07060504U, test_key_siphash) != test_vectors_siphash[8]) {
+		pr_info("siphash self-test 2u32: FAIL\n");
 		ret = -EINVAL;
 	}
 	if (siphash_3u32(0x03020100U, 0x07060504U,
-			 0x0b0a0908U, test_key) != test_vectors[12]) {
-		pr_info("self-test 3u32: FAIL\n");
+			 0x0b0a0908U, test_key_siphash) != test_vectors_siphash[12]) {
+		pr_info("siphash self-test 3u32: FAIL\n");
 		ret = -EINVAL;
 	}
 	if (siphash_4u32(0x03020100U, 0x07060504U,
-			 0x0b0a0908U, 0x0f0e0d0cU, test_key) != test_vectors[16]) {
-		pr_info("self-test 4u32: FAIL\n");
+			 0x0b0a0908U, 0x0f0e0d0cU, test_key_siphash) != test_vectors_siphash[16]) {
+		pr_info("siphash self-test 4u32: FAIL\n");
+		ret = -EINVAL;
+	}
+	if (hsiphash_1u32(0x03020100U, test_key_hsiphash) != test_vectors_hsiphash[4]) {
+		pr_info("hsiphash self-test 1u32: FAIL\n");
+		ret = -EINVAL;
+	}
+	if (hsiphash_2u32(0x03020100U, 0x07060504U, test_key_hsiphash) != test_vectors_hsiphash[8]) {
+		pr_info("hsiphash self-test 2u32: FAIL\n");
+		ret = -EINVAL;
+	}
+	if (hsiphash_3u32(0x03020100U, 0x07060504U,
+			  0x0b0a0908U, test_key_hsiphash) != test_vectors_hsiphash[12]) {
+		pr_info("hsiphash self-test 3u32: FAIL\n");
+		ret = -EINVAL;
+	}
+	if (hsiphash_4u32(0x03020100U, 0x07060504U,
+			  0x0b0a0908U, 0x0f0e0d0cU, test_key_hsiphash) != test_vectors_hsiphash[16]) {
+		pr_info("hsiphash self-test 4u32: FAIL\n");
 		ret = -EINVAL;
 	}
 	if (!ret)
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH v7 3/6] random: use SipHash in place of MD5
From: Jason A. Donenfeld @ 2016-12-21 23:13 UTC (permalink / raw)
  To: Netdev, kernel-hardening, LKML, Linux Crypto Mailing List,
	David Laight, Ted Tso, Hannes Frederic Sowa, Eric Dumazet,
	Linus Torvalds, Eric Biggers, Tom Herbert, Andi Kleen,
	David Miller, Andy Lutomirski, Jean-Philippe Aumasson
  Cc: Jason A. Donenfeld
In-Reply-To: <20161221230216.25341-4-Jason@zx2c4.com>

Hi Ted,

On Thu, Dec 22, 2016 at 12:02 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> This duplicates the current algorithm for get_random_int/long

I should have mentioned this directly in the commit message, which I
forgot to update: this v7 adds the time-based key rotation, which,
while not strictly necessary for ensuring the security of the RNG,
might help alleviate some concerns, as we talked about. Performance is
quite good on both 32-bit and 64-bit -- better than MD5 in both cases.

If you like this, terrific. If not, I'm happy to take this in whatever
direction you prefer, and implement whatever construction you think
best. There's been a lot of noise on this list about it; we can
continue to discuss more, or you can just tell me whatever you want to
do, and I'll implement it and that'll be the end of it. As you said,
we can always get something decent now and improve it later.

Alternatively, if you've decided in the end you prefer your batched
entropy approach using chacha, I'm happy to implement a polished
version of that here in this patch series (so that we can keep the `rm
lib/md5.c` commit.)

Just let me know how you'd like to proceed.

Thanks,
Jason

^ permalink raw reply

* Re: [PATCH v7 3/6] random: use SipHash in place of MD5
From: Andy Lutomirski @ 2016-12-21 23:42 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Netdev, kernel-hardening@lists.openwall.com, LKML,
	Linux Crypto Mailing List, David Laight, Ted Tso,
	Hannes Frederic Sowa, Eric Dumazet, Linus Torvalds, Eric Biggers,
	Tom Herbert, Andi Kleen, David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <20161221230216.25341-4-Jason@zx2c4.com>

On Wed, Dec 21, 2016 at 3:02 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>  unsigned int get_random_int(void)
>  {
> -       __u32 *hash;
> -       unsigned int ret;
> -
> -       if (arch_get_random_int(&ret))
> -               return ret;
> -
> -       hash = get_cpu_var(get_random_int_hash);
> -
> -       hash[0] += current->pid + jiffies + random_get_entropy();
> -       md5_transform(hash, random_int_secret);
> -       ret = hash[0];
> -       put_cpu_var(get_random_int_hash);
> -
> -       return ret;
> +       unsigned int arch_result;
> +       u64 result;
> +       struct random_int_secret *secret;
> +
> +       if (arch_get_random_int(&arch_result))
> +               return arch_result;
> +
> +       secret = get_random_int_secret();
> +       result = siphash_3u64(secret->chaining, jiffies,
> +                             (u64)random_get_entropy() + current->pid,
> +                             secret->secret);
> +       secret->chaining += result;
> +       put_cpu_var(secret);
> +       return result;
>  }
>  EXPORT_SYMBOL(get_random_int);

Hmm.  I haven't tried to prove anything for real.  But here goes (in
the random oracle model):

Suppose I'm an attacker and I don't know the secret or the chaining
value.  Then, regardless of what the entropy is, I can't predict the
numbers.

Now suppose I do know the secret and the chaining value due to some
leak.  If I want to deduce prior outputs, I think I'm stuck: I'd need
to find a value "result" such that prev_chaining + result = chaining
and result = H(prev_chaining, ..., secret);.  I don't think this can
be done efficiently in the random oracle model regardless of what the
"..." is.

But, if I know the secret and chaining value, I can predict the next
output assuming I can guess the entropy.  What's worse is that, even
if I can't guess the entropy, if I *observe* the next output then I
can calculate the next chaining value.

So this is probably good enough, and making it better is hard.  Changing it to:

u64 entropy = (u64)random_get_entropy() + current->pid;
result = siphash(..., entropy, ...);
secret->chaining += result + entropy;

would reduce this problem by forcing an attacker to brute-force the
entropy on each iteration, which is probably an improvement.

To fully fix it, something like "catastrophic reseeding" would be
needed, but that's hard to get right.

(An aside: on x86 at least, using two percpu variables is faster
because directly percpu access is essentially free, whereas getting
the address of a percpu variable is not free.)

^ permalink raw reply

* Re: HalfSipHash Acceptable Usage
From: George Spelvin @ 2016-12-22  0:18 UTC (permalink / raw)
  To: linux, tytso
  Cc: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, hannes,
	Jason, jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, luto, netdev, tom, torvalds, vegard.nossum
In-Reply-To: <20161221222702.h2vboms776zpgpi4@thunk.org>

Theodore Ts'o wrote:
> On Wed, Dec 21, 2016 at 01:37:51PM -0500, George Spelvin wrote:
>> SipHash annihilates the competition on 64-bit superscalar hardware.
>> SipHash dominates the field on 64-bit in-order hardware.
>> SipHash wins easily on 32-bit hardware *with enough registers*.
>> On register-starved 32-bit machines, it really struggles.

> And "with enough registers" includes ARM and MIPS, right?

Yes.  As a matter of fact, 32-bit ARM does particularly well
on 64-bit SipHash due to its shift+op instructions.

There is a noticeable performance drop, but nothing catastrophic.

The main thing I've been worried about is all the flow tracking
and NAT done by small home routers, and that's addressed by using
HalfSipHash for the hash tables.  They don't *initiate* a lot of
TCP sessions.

> So the only
> real problem is 32-bit x86, and you're right, at that point, only
> people who might care are people who are using a space-radiation
> hardened 386 --- and they're not likely to be doing high throughput
> TCP connections.  :-)

The only requirement on performance is "don't make DaveM angry." :-)

I was just trying to answer the question of why we *worried* about the
performance, not specifically argue that we *should* use HalfSipHash.

^ permalink raw reply

* Re: [PATCH v7 6/6] siphash: implement HalfSipHash1-3 for hash tables
From: Andi Kleen @ 2016-12-22  0:46 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Netdev, kernel-hardening, LKML, linux-crypto, David Laight,
	Ted Tso, Hannes Frederic Sowa, edumazet, Linus Torvalds,
	Eric Biggers, Tom Herbert, davem, luto, Jean-Philippe Aumasson
In-Reply-To: <20161221230216.25341-7-Jason@zx2c4.com>

> 64-bit x86_64:
> [    0.509409] test_siphash:     SipHash2-4 cycles: 4049181
> [    0.510650] test_siphash:     SipHash1-3 cycles: 2512884
> [    0.512205] test_siphash: HalfSipHash1-3 cycles: 3429920
> [    0.512904] test_siphash:    JenkinsHash cycles:  978267

I'm not sure what these numbers mean. Surely a single siphash2-4
does not take 4+ million cycles? 

If you run them in a loop please divide by the iterations.

But generally running small code in a loop is often an unrealistic
benchmark strategy because it hides cache misses, primes
predictors, changes frequencies and changes memory costs,
but also can overload pipelines and oversubscribe
resources.

[see also page 46+ in http://halobates.de/applicative-mental-models.pdf]

So the numbers you get there are at least somewhat
dubious. It would be good to have at least some test which
is not just a tiny micro benchmark to compare before making
conclusions.

-Andi

^ permalink raw reply

* Re: HalfSipHash Acceptable Usage
From: George Spelvin @ 2016-12-22  1:13 UTC (permalink / raw)
  To: linux, tytso
  Cc: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, hannes,
	Jason, jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, luto, netdev, tom, torvalds, vegard.nossum
In-Reply-To: <20161221222702.h2vboms776zpgpi4@thunk.org>

As a separate message, to disentangle the threads, I'd like to
talk about get_random_long().

After some thinking, I still like the "state-preserving" construct
that's equivalent to the current MD5 code.  Yes, we could just do
siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
preserve a bit more.

It requires library support from the SipHash code to return the full
SipHash state, but I hope that's a fair thing to ask for.

Here's my current straw man design for comment.  It's very similar to
the current MD5-based design, but feeds all the seed material in the
"correct" way, as opposed to Xring directly into the MD5 state.

* Each CPU has a (Half)SipHash state vector,
  "unsigned long get_random_int_hash[4]".  Unlike the current
  MD5 code, we take care to initialize it to an asymmetric state.

* There's a global 256-bit random_int_secret (which we could
  reseed periodically).

To generate a random number:
* If get_random_int_hash is all-zero, seed it with fresh a half-sized
  SipHash key and the appropriate XOR constants.
* Generate three words of random_get_entropy(), jiffies, and current->pid.
  (This is arbitary seed material, copied from the current code.)
* Crank through that with (Half)SipHash-1-0.
* Crank through the random_int_secret with (Half)SipHash-1-0.
* Return v1 ^ v3.

Here are the reasons:
* The first step is just paranoia, but SipHash's security promise depends
  on starting with an asymmetric state, we want unique per-CPU states,
  and it's a one-time cost.
* When the input words are themselves secret, there's no security
  advantage, and almost no speed advantage, to doing two rounds for one
  input word versus two words with one round each.  Thus, SipHash-1.
* The above is not exactly true, due to the before+after XOR pattern
  that SipHash uses, but I think it's true anyway.
* Likewise, there's no benefit to unkeyed finalization rounds over keyed
  ones.  That's why I just enlarged the global secret.
* The per-call seed material is hashed first on general principles,
  because that's the novel part that might have fresh entropy.
* To the extent the initial state is secret, the rounds processing the
  global secret are 4 finalization rounds for the initial state and
  the per-call entropy.
* The final word(s) of the global secret might be vulnerable to analysis,
  due to incomplete mixing, but since the global secret is always hashed
  in the same order, and larger that the desired security level, the
  initial words should be secure.
* By carrying forward the full internal state, we ensure that repeated
  calls return different results, and to the extent that the per-call
  seed material has entropy, it's preserved.
* The final return is all that's needed, since the last steps in the 
  SipRound are "v1 ^= v2" and "v3 ^= v0".  It's no security loss,
  and a very minor speedup.
* Also, this avoids directly "exposing" the final XOR with the last
  word of the global secret (which is made to v0).

If I'm allowed to use full SipHash, some shortcuts can be taken,
but I believe the above would be secure with HalfSipHash.

If additional performance is required, I'd consider shrinking the
global secret to 192 bits on 32-bit machines but I want more than
128 bits of ey material, and enough rounds to be equivalent to 4
finalization rounds.

^ permalink raw reply

* Re: [PATCH v7 1/6] siphash: add cryptographically secure PRF
From: Stephen Hemminger @ 2016-12-22  1:40 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Netdev, kernel-hardening, LKML, linux-crypto, David Laight,
	Ted Tso, Hannes Frederic Sowa, edumazet, Linus Torvalds,
	Eric Biggers, Tom Herbert, ak, davem, luto,
	Jean-Philippe Aumasson, Eric Dumazet
In-Reply-To: <20161221230216.25341-2-Jason@zx2c4.com>

On Thu, 22 Dec 2016 00:02:11 +0100
"Jason A. Donenfeld" <Jason@zx2c4.com> wrote:

> SipHash is a 64-bit keyed hash function that is actually a
> cryptographically secure PRF, like HMAC. Except SipHash is super fast,
> and is meant to be used as a hashtable keyed lookup function, or as a
> general PRF for short input use cases, such as sequence numbers or RNG
> chaining.
> 
> For the first usage:
> 
> There are a variety of attacks known as "hashtable poisoning" in which an
> attacker forms some data such that the hash of that data will be the
> same, and then preceeds to fill up all entries of a hashbucket. This is
> a realistic and well-known denial-of-service vector. Currently
> hashtables use jhash, which is fast but not secure, and some kind of
> rotating key scheme (or none at all, which isn't good). SipHash is meant
> as a replacement for jhash in these cases.
> 
> There are a modicum of places in the kernel that are vulnerable to
> hashtable poisoning attacks, either via userspace vectors or network
> vectors, and there's not a reliable mechanism inside the kernel at the
> moment to fix it. The first step toward fixing these issues is actually
> getting a secure primitive into the kernel for developers to use. Then
> we can, bit by bit, port things over to it as deemed appropriate.
> 
> While SipHash is extremely fast for a cryptographically secure function,
> it is likely a bit slower than the insecure jhash, and so replacements
> will be evaluated on a case-by-case basis based on whether or not the
> difference in speed is negligible and whether or not the current jhash usage
> poses a real security risk.
> 
> For the second usage:
> 
> A few places in the kernel are using MD5 or SHA1 for creating secure
> sequence numbers, syn cookies, port numbers, or fast random numbers.
> SipHash is a faster and more fitting, and more secure replacement for MD5
> in those situations. Replacing MD5 and SHA1 with SipHash for these uses is
> obvious and straight-forward, and so is submitted along with this patch
> series. There shouldn't be much of a debate over its efficacy.
> 
> Dozens of languages are already using this internally for their hash
> tables and PRFs. Some of the BSDs already use this in their kernels.
> SipHash is a widely known high-speed solution to a widely known set of
> problems, and it's time we catch-up.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Eric Biggers <ebiggers3@gmail.com>
> Cc: David Laight <David.Laight@aculab.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>

The networking tree (net-next) which is where you are submitting to is technically
closed right now.

^ permalink raw reply

* Re: [PATCH v7 1/6] siphash: add cryptographically secure PRF
From: Jason A. Donenfeld @ 2016-12-22  1:42 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Netdev, kernel-hardening, LKML, Linux Crypto Mailing List,
	David Laight, Ted Tso, Hannes Frederic Sowa, Eric Dumazet,
	Linus Torvalds, Eric Biggers, Tom Herbert, Andi Kleen,
	David Miller, Andy Lutomirski, Jean-Philippe Aumasson,
	Eric Dumazet

On Thu, Dec 22, 2016 at 2:40 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> The networking tree (net-next) which is where you are submitting to is technically
> closed right now.

That's okay. At some point in the future it will be open. By then v83
of this patch set will be shiny and done, just waiting for the merge
window to open. There's a lot to discuss with this, so getting the
feedback early is beneficial.

Jason

^ permalink raw reply

* Re: HalfSipHash Acceptable Usage
From: Andy Lutomirski @ 2016-12-22  1:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: George Spelvin, Jason A. Donenfeld, Andi Kleen, David Miller,
	David Laight, Daniel J . Bernstein, Eric Biggers, Eric Dumazet,
	Hannes Frederic Sowa, Jean-Philippe Aumasson,
	kernel-hardening@lists.openwall.com, Linux Crypto Mailing List,
	Linux Kernel Mailing List, Network Development, Tom Herbert,
	Theodore Ts'o, Vegard Nossum
In-Reply-To: <CA+55aFy8fNOxw3bnwkX1S46jKnW6i26mueaiuOsScyN3kFJp+A@mail.gmail.com>

On Wed, Dec 21, 2016 at 9:25 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Dec 21, 2016 at 7:55 AM, George Spelvin
> <linux@sciencehorizons.net> wrote:
>>
>> How much does kernel_fpu_begin()/kernel_fpu_end() cost?
>
> It's now better than it used to be, but it's absolutely disastrous
> still. We're talking easily many hundreds of cycles. Under some loads,
> thousands.
>
> And I warn you already: it will _benchmark_ a hell of a lot better
> than it will work in reality. In benchmarks, you'll hit all the
> optimizations ("oh, I've already saved away all the FP registers, no
> need to do it again").
>
> In contrast, in reality, especially with things like "do it once or
> twice per incoming packet", you'll easily hit the absolute worst
> cases, where not only does it take a few hundred cycles to save the FP
> state, you'll then return to user space in between packets, which
> triggers the slow-path return code and reloads the FP state, which is
> another few hundred cycles plus.

Hah, you're thinking that the x86 code works the way that Rik and I
want it to work, and you just made my day. :)  What actually happens
is that the state is saved in kernel_fpu_begin() and restored in
kernel_fpu_end(), and it'll take a few hundred cycles best case.  If
you do it a bunch of times in a loop, you *might* trigger a CPU
optimization that notices that the state being saved is the same state
that was just restored, but you're still going to pay the full restore
code each round trip no matter what.

The code is much clearer in 4.10 kernels now that I deleted the unused
"lazy" branches.

>
> Similarly, in benchmarks you'll hit the "modern CPU's power on the AVX
> unit and keep it powered up for a while afterwards", while in real
> life you would quite easily hit the "oh, AVX is powered down because
> we were idle, now it powers up at half speed which is another latency
> hit _and_ the AVX unit won't run full out anyway".

I *think* that was mostly fixed in Broadwell or thereabouts (in terms
of latency -- throughput and power consumption still suffers).

^ permalink raw reply

* Re: [PATCH v7 3/6] random: use SipHash in place of MD5
From: Hannes Frederic Sowa @ 2016-12-22  2:07 UTC (permalink / raw)
  To: Andy Lutomirski, Jason A. Donenfeld
  Cc: Netdev, kernel-hardening@lists.openwall.com, LKML,
	Linux Crypto Mailing List, David Laight, Ted Tso, Eric Dumazet,
	Linus Torvalds, Eric Biggers, Tom Herbert, Andi Kleen,
	David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <CALCETrVttVoZMvCYZcrAqM1c=YQP_nCfdfO1MsrSHjvjTFxH+A@mail.gmail.com>

On 22.12.2016 00:42, Andy Lutomirski wrote:
> On Wed, Dec 21, 2016 at 3:02 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>>  unsigned int get_random_int(void)
>>  {
>> -       __u32 *hash;
>> -       unsigned int ret;
>> -
>> -       if (arch_get_random_int(&ret))
>> -               return ret;
>> -
>> -       hash = get_cpu_var(get_random_int_hash);
>> -
>> -       hash[0] += current->pid + jiffies + random_get_entropy();
>> -       md5_transform(hash, random_int_secret);
>> -       ret = hash[0];
>> -       put_cpu_var(get_random_int_hash);
>> -
>> -       return ret;
>> +       unsigned int arch_result;
>> +       u64 result;
>> +       struct random_int_secret *secret;
>> +
>> +       if (arch_get_random_int(&arch_result))
>> +               return arch_result;
>> +
>> +       secret = get_random_int_secret();
>> +       result = siphash_3u64(secret->chaining, jiffies,
>> +                             (u64)random_get_entropy() + current->pid,
>> +                             secret->secret);
>> +       secret->chaining += result;
>> +       put_cpu_var(secret);
>> +       return result;
>>  }
>>  EXPORT_SYMBOL(get_random_int);
> 
> Hmm.  I haven't tried to prove anything for real.  But here goes (in
> the random oracle model):
> 
> Suppose I'm an attacker and I don't know the secret or the chaining
> value.  Then, regardless of what the entropy is, I can't predict the
> numbers.
> 
> Now suppose I do know the secret and the chaining value due to some
> leak.  If I want to deduce prior outputs, I think I'm stuck: I'd need
> to find a value "result" such that prev_chaining + result = chaining
> and result = H(prev_chaining, ..., secret);.  I don't think this can
> be done efficiently in the random oracle model regardless of what the
> "..." is.
> 
> But, if I know the secret and chaining value, I can predict the next
> output assuming I can guess the entropy.  What's worse is that, even
> if I can't guess the entropy, if I *observe* the next output then I
> can calculate the next chaining value.
> 
> So this is probably good enough, and making it better is hard.  Changing it to:
> 
> u64 entropy = (u64)random_get_entropy() + current->pid;
> result = siphash(..., entropy, ...);
> secret->chaining += result + entropy;
> 
> would reduce this problem by forcing an attacker to brute-force the
> entropy on each iteration, which is probably an improvement.
> 
> To fully fix it, something like "catastrophic reseeding" would be
> needed, but that's hard to get right.

I wonder if Ted's proposal was analyzed further in terms of performance
if get_random_int should provide cprng alike properties?

For reference: https://lkml.org/lkml/2016/12/14/351

The proposal made sense to me and would completely solve the above
mentioned problem on the cost of repeatedly reseeding from the crng.

Bye,
Hannes

^ permalink raw reply

* George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
From: Andy Lutomirski @ 2016-12-22  2:07 UTC (permalink / raw)
  To: George Spelvin
  Cc: Ted Ts'o, Andi Kleen, David S. Miller, David Laight,
	D. J. Bernstein, Eric Biggers, Eric Dumazet, Hannes Frederic Sowa,
	Jason A. Donenfeld, Jean-Philippe Aumasson,
	kernel-hardening@lists.openwall.com, Linux Crypto Mailing List,
	linux-kernel@vger.kernel.org, Network Development, Tom Herbert,
	Linus Torvalds, Vegard Nossum

On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin
<linux@sciencehorizons.net> wrote:
> As a separate message, to disentangle the threads, I'd like to
> talk about get_random_long().
>
> After some thinking, I still like the "state-preserving" construct
> that's equivalent to the current MD5 code.  Yes, we could just do
> siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
> preserve a bit more.
>
> It requires library support from the SipHash code to return the full
> SipHash state, but I hope that's a fair thing to ask for.

I don't even think it needs that.  This is just adding a
non-destructive final operation, right?

>
> Here's my current straw man design for comment.  It's very similar to
> the current MD5-based design, but feeds all the seed material in the
> "correct" way, as opposed to Xring directly into the MD5 state.
>
> * Each CPU has a (Half)SipHash state vector,
>   "unsigned long get_random_int_hash[4]".  Unlike the current
>   MD5 code, we take care to initialize it to an asymmetric state.
>
> * There's a global 256-bit random_int_secret (which we could
>   reseed periodically).
>
> To generate a random number:
> * If get_random_int_hash is all-zero, seed it with fresh a half-sized
>   SipHash key and the appropriate XOR constants.
> * Generate three words of random_get_entropy(), jiffies, and current->pid.
>   (This is arbitary seed material, copied from the current code.)
> * Crank through that with (Half)SipHash-1-0.
> * Crank through the random_int_secret with (Half)SipHash-1-0.
> * Return v1 ^ v3.

Just to clarify, if we replace SipHash with a black box, I think this
effectively means, where "entropy" is random_get_entropy() || jiffies
|| current->pid:

The first call returns H(random seed || entropy_0 || secret).  The
second call returns H(random seed || entropy_0 || secret || entropy_1
|| secret).  Etc.

If not, then I have a fairly strong preference to keep whatever
construction we come up with consistent with something that could
actually happen with invocations of unmodified SipHash -- then all the
security analysis on SipHash goes through.

Anyway, I have mixed thoughts about the construction.  It manages to
have a wide state at essentially no cost, which buys us quite a bit of
work factor to break it.  Even with full knowledge of the state, an
output doesn't reveal the entropy except to the extent that it can be
brute-force (this is just whatever the appropriate extended version of
first preimage resistance gives us).  The one thing I don't like is
that I don't see how to prove that you can't run it backwards if you
manage to acquire a memory dump.  In fact, I that that there exist, at
least in theory, hash functions that are secure in the random oracle
model but that *can* be run backwards given the full state.  From
memory, SHA-3 has exactly that property, and it would be a bit sad for
a CSPRNG to be reversible.

We could also periodically mix in a big (128-bit?) chunk of fresh
urandom output to keep the bad guys guessing.

(P.S.  This kind of resembles the duplex sponge construction.  If
hardware SHA-3 ever shows up, a duplex sponge RNG might nice indeed.)

^ permalink raw reply

* Re: [PATCH v7 3/6] random: use SipHash in place of MD5
From: Andy Lutomirski @ 2016-12-22  2:09 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jason A. Donenfeld, Netdev, kernel-hardening@lists.openwall.com,
	LKML, Linux Crypto Mailing List, David Laight, Ted Tso,
	Eric Dumazet, Linus Torvalds, Eric Biggers, Tom Herbert,
	Andi Kleen, David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <17bd0c70-d2c1-165b-f5b2-252dfca404e8@stressinduktion.org>

On Wed, Dec 21, 2016 at 6:07 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 22.12.2016 00:42, Andy Lutomirski wrote:
>> On Wed, Dec 21, 2016 at 3:02 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>>>  unsigned int get_random_int(void)
>>>  {
>>> -       __u32 *hash;
>>> -       unsigned int ret;
>>> -
>>> -       if (arch_get_random_int(&ret))
>>> -               return ret;
>>> -
>>> -       hash = get_cpu_var(get_random_int_hash);
>>> -
>>> -       hash[0] += current->pid + jiffies + random_get_entropy();
>>> -       md5_transform(hash, random_int_secret);
>>> -       ret = hash[0];
>>> -       put_cpu_var(get_random_int_hash);
>>> -
>>> -       return ret;
>>> +       unsigned int arch_result;
>>> +       u64 result;
>>> +       struct random_int_secret *secret;
>>> +
>>> +       if (arch_get_random_int(&arch_result))
>>> +               return arch_result;
>>> +
>>> +       secret = get_random_int_secret();
>>> +       result = siphash_3u64(secret->chaining, jiffies,
>>> +                             (u64)random_get_entropy() + current->pid,
>>> +                             secret->secret);
>>> +       secret->chaining += result;
>>> +       put_cpu_var(secret);
>>> +       return result;
>>>  }
>>>  EXPORT_SYMBOL(get_random_int);
>>
>> Hmm.  I haven't tried to prove anything for real.  But here goes (in
>> the random oracle model):
>>
>> Suppose I'm an attacker and I don't know the secret or the chaining
>> value.  Then, regardless of what the entropy is, I can't predict the
>> numbers.
>>
>> Now suppose I do know the secret and the chaining value due to some
>> leak.  If I want to deduce prior outputs, I think I'm stuck: I'd need
>> to find a value "result" such that prev_chaining + result = chaining
>> and result = H(prev_chaining, ..., secret);.  I don't think this can
>> be done efficiently in the random oracle model regardless of what the
>> "..." is.
>>
>> But, if I know the secret and chaining value, I can predict the next
>> output assuming I can guess the entropy.  What's worse is that, even
>> if I can't guess the entropy, if I *observe* the next output then I
>> can calculate the next chaining value.
>>
>> So this is probably good enough, and making it better is hard.  Changing it to:
>>
>> u64 entropy = (u64)random_get_entropy() + current->pid;
>> result = siphash(..., entropy, ...);
>> secret->chaining += result + entropy;
>>
>> would reduce this problem by forcing an attacker to brute-force the
>> entropy on each iteration, which is probably an improvement.
>>
>> To fully fix it, something like "catastrophic reseeding" would be
>> needed, but that's hard to get right.
>
> I wonder if Ted's proposal was analyzed further in terms of performance
> if get_random_int should provide cprng alike properties?
>
> For reference: https://lkml.org/lkml/2016/12/14/351
>
> The proposal made sense to me and would completely solve the above
> mentioned problem on the cost of repeatedly reseeding from the crng.
>

Unless I've misunderstood it, Ted's proposal causes get_random_int()
to return bytes straight from urandom (effectively), which should make
it very strong.  And if urandom is competitively fast now, I don't see
the problem.  ChaCha20 is designed for speed, after all.

^ permalink raw reply

* Re: [PATCH v7 3/6] random: use SipHash in place of MD5
From: Jason A. Donenfeld @ 2016-12-22  2:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Netdev, kernel-hardening@lists.openwall.com, LKML,
	Linux Crypto Mailing List, David Laight, Ted Tso,
	Hannes Frederic Sowa, Eric Dumazet, Linus Torvalds, Eric Biggers,
	Tom Herbert, Andi Kleen, David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <CALCETrVttVoZMvCYZcrAqM1c=YQP_nCfdfO1MsrSHjvjTFxH+A@mail.gmail.com>

Hi Andy,

On Thu, Dec 22, 2016 at 12:42 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> So this is probably good enough, and making it better is hard.  Changing it to:
>
> u64 entropy = (u64)random_get_entropy() + current->pid;
> result = siphash(..., entropy, ...);
> secret->chaining += result + entropy;
>
> would reduce this problem by forcing an attacker to brute-force the
> entropy on each iteration, which is probably an improvement.

Ahh, so that's the reasoning behind a similar suggestion of yours in a
previous email. Makes sense to me. I'll include this in the next merge
if we don't come up with a different idea before then. Your reasoning
seems good for it.

Part of what makes this process a bit goofy is that it's not all
together clear what the design goals are. Right now we're going for
"not worse than before", which we've nearly achieved. How good of an
RNG do we want? I'm willing to examine and analyze the security and
performance of all constructions we can come up with. One thing I
don't want to do, however, is start tweaking the primitives themselves
in ways not endorsed by the designers. So, I believe that precludes
things like carrying over SipHash's internal state (like what was done
with MD5), because there hasn't been a formal security analysis of
this like there has with other uses of SipHash. I also don't want to
change any internals of how SipHash actually works. I mention that
because of some of the suggestions on other threads, which make me
rather uneasy.

So with that said, while writing this reply to you, I was
simultaneously reading some other crypto code and was reminded that
there's a variant of SipHash which outputs an additional 64-bits; it's
part of the siphash reference code, which they call the "128-bit
mode". It has the benefit that we can return 64-bits to the caller and
save 64-bits for the chaining key. That way there's no correlation
between the returned secret and the chaining key, which I think would
completely alleviate all of your concerns, and simplify the analysis a
bit.

Here's what it looks like:
https://git.zx2c4.com/linux-dev/commit/?h=siphash&id=46fbe5b408e66b2d16b4447860f8083480e1c08d

The downside is that it takes 4 extra Sip rounds. This puts the
performance still better than MD5, though, and likely still better
than the other batched entropy solution. We could optimize this, I
suppose, by giving it only two parameters -- chaining,
jiffies+entropy+pid -- instead of the current three -- chaining,
jiffies, entropy+pid -- which would then shave off 2 Sip rounds. But I
liked the idea of having a bit more spread in the entropy input field.

Anyway, with this in mind, we now have three possibilities:

1. result = siphash(chaining, entropy, key); chaining += result + entropy
2. result = siphash_extra_output(chaining, entropy, key, &chaining);
3. Ted's batched entropy idea using chacha20

The more I think about this, the more I suspect that we should just
use chacha20. It will still be faster than MD5. I don't like the
non-determinism of it (some processes will start slower than others,
if the batched entropy has run out and ASLR demands more), but I guess
I can live with that. But, most importantly, it greatly simplifies
both the security analysis and what we can promise to callers about
the function. Right now in the comment documentation, we're coy with
callers about the security of the RNG. If we moved to a known
construction like chacha20/get_random_bytes_batched, then we could
just be straight up with a promise that the numbers it returns are
high quality.

Thoughts on 2 and 3, and on 1 vs 2 vs 3?

Jason

^ permalink raw reply

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
From: Jason A. Donenfeld @ 2016-12-22  2:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: George Spelvin, Ted Ts'o, Andi Kleen, David S. Miller,
	David Laight, D. J. Bernstein, Eric Biggers, Eric Dumazet,
	Hannes Frederic Sowa, Jean-Philippe Aumasson,
	kernel-hardening@lists.openwall.com, Linux Crypto Mailing List,
	linux-kernel@vger.kernel.org, Network Development, Tom Herbert,
	Linus Torvalds, Vegard Nossum

> On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin
>> After some thinking, I still like the "state-preserving" construct
>> that's equivalent to the current MD5 code.  Yes, we could just do
>> siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
>> preserve a bit more.
>>
>> It requires library support from the SipHash code to return the full
>> SipHash state, but I hope that's a fair thing to ask for.

This is not a good idea. If I understand correctly, the idea here is
to just keep around SipHash's internal state variables, and chain them
over to the next call, sort of like how md5_transform with the current
code works on the same scratch space. There has been no security
analysis in the literature on this use of the primitive, and I have no
confidence that this is a secure use of the function. Unless somebody
can point me toward a paper I missed or a comment from a real
cryptographer about the specifics of SipHash, I think I'm right to
admonish against this dangerous road.

Let's talk about constructions. And let's only decide on a
construction that we're actually equipped to analyze. Let's definitely
not talk about making our own primitives, or retrofitting nice
primitive's internals into our own Frankenstein.

Alternatively, if I'm wrong, please send me an eprint/arXiv link to a
paper that discusses this use of SipHash.

^ permalink raw reply

* Re: [PATCH v7 3/6] random: use SipHash in place of MD5
From: Jason A. Donenfeld @ 2016-12-22  2:49 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Andy Lutomirski, Netdev, kernel-hardening@lists.openwall.com,
	LKML, Linux Crypto Mailing List, David Laight, Ted Tso,
	Eric Dumazet, Linus Torvalds, Eric Biggers, Tom Herbert,
	Andi Kleen, David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <17bd0c70-d2c1-165b-f5b2-252dfca404e8@stressinduktion.org>

Hi Andy & Hannes,

On Thu, Dec 22, 2016 at 3:07 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> I wonder if Ted's proposal was analyzed further in terms of performance
> if get_random_int should provide cprng alike properties?
>
> For reference: https://lkml.org/lkml/2016/12/14/351
>
> The proposal made sense to me and would completely solve the above
> mentioned problem on the cost of repeatedly reseeding from the crng.

On Thu, Dec 22, 2016 at 3:09 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> Unless I've misunderstood it, Ted's proposal causes get_random_int()
> to return bytes straight from urandom (effectively), which should make
> it very strong.  And if urandom is competitively fast now, I don't see
> the problem.  ChaCha20 is designed for speed, after all.

Funny -- while you guys were sending this back & forth, I was writing
my reply to Andy which essentially arrives at the same conclusion.
Given that we're all arriving to the same thing, and that Ted shot in
this direction long before we all did, I'm leaning toward abandoning
SipHash for the de-MD5-ification of get_random_int/long, and working
on polishing Ted's idea into something shiny for this patchset.

I did have two objections to this. The first was that my SipHash
construction is faster. But in any case, they're both faster than the
current MD5, so it's just extra rice. The second, and the more
important one, was that batching entropy up like this means that 32
calls will be really fast, and then the 33rd will be slow, since it
has to do a whole ChaCha round, because get_random_bytes must be
called to refill the batch. Since get_random_long is called for every
process startup, I didn't really like there being inconsistent
performance on process startup. And I'm pretty sure that one ChaCha
whole block is slower than computing MD5, even though it lasts 32
times as long, though I need to measure this. But maybe that's dumb in
the end? Are these concerns that should point us toward the
determinism (and speed) of SipHash? Are these concerns that don't
matter and so we should roll with the simplicity of reusing ChaCha?

Jason

^ permalink raw reply

* Re: [PATCH v7 3/6] random: use SipHash in place of MD5
From: Jason A. Donenfeld @ 2016-12-22  3:12 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Andy Lutomirski, Netdev, kernel-hardening@lists.openwall.com,
	LKML, Linux Crypto Mailing List, David Laight, Ted Tso,
	Eric Dumazet, Linus Torvalds, Eric Biggers, Tom Herbert,
	Andi Kleen, David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <CAHmME9phg=GuhEUaMxxv_=RexffPDqrOEhmaKffy_ZSt7bfC7g@mail.gmail.com>

On Thu, Dec 22, 2016 at 3:49 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> I did have two objections to this. The first was that my SipHash
> construction is faster. But in any case, they're both faster than the
> current MD5, so it's just extra rice. The second, and the more
> important one, was that batching entropy up like this means that 32
> calls will be really fast, and then the 33rd will be slow, since it
> has to do a whole ChaCha round, because get_random_bytes must be
> called to refill the batch. Since get_random_long is called for every
> process startup, I didn't really like there being inconsistent
> performance on process startup. And I'm pretty sure that one ChaCha
> whole block is slower than computing MD5, even though it lasts 32
> times as long, though I need to measure this. But maybe that's dumb in
> the end? Are these concerns that should point us toward the
> determinism (and speed) of SipHash? Are these concerns that don't
> matter and so we should roll with the simplicity of reusing ChaCha?

I ran some measurements in order to quantify what I'm talking about.
Repeatedly running md5_transform is about 2.3 times faster than
repeatedly running extract_crng. What does this mean?

One call to extract_crng gives us 32 times as many longs as one call
to md5_transform. This means that spread over 32 process creations,
chacha will be 13.9 times faster. However, every 32nd process will
take 2.3 times as long to generate its ASLR value as it would with the
old md5_transform code.

Personally, I don't think that 2.3 is a big deal. And I really like
how much this simplifies the analysis.
But if it's a big deal to you, then we can continue to discuss my
SipHash construction, which gives faster and more consistent
performance, at the cost of a more complicated and probably less
impressive security analysis.

Jason

^ permalink raw reply

* Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
From: George Spelvin @ 2016-12-22  3:55 UTC (permalink / raw)
  To: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, hannes,
	Jason, jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, linux, luto, netdev, tom, torvalds, tytso,
	vegard.nossum
In-Reply-To: <CAHmME9pww5Q0Wy9MtkO7PAx2Tstfp=6Og3qZLZ=Rh8NaFo0Gog@mail.gmail.com>

> Plus the benchmark was bogus anyway, and when I built a more specific
> harness -- actually comparing the TCP sequence number functions --
> SipHash was faster than MD5, even on register starved x86. So I think
> we're fine and this chapter of the discussion can come to a close, in
> order to move on to more interesting things.

Do we have to go through this?  No, the benchmark was *not* bogus.

Here's myresults from *your* benchmark.  I can't reboot some of my test
machines, so I took net/core/secure_seq.c, lib/siphash.c, lib/md5.c and
include/linux/siphash.h straight out of your test tree.

Then I replaced the kernel #includes with the necessary typedefs
and #defines to make it compile in user-space.  (Voluminous but
straightforward.)  E.g.

#define __aligned(x) __attribute__((__aligned__(x)))
#define ____cacheline_aligned __aligned(64)
#define CONFIG_INET 1
#define IS_ENABLED(x) 1
#define ktime_get_real_ns() 0
#define sysctl_tcp_timestamps 0

... etc.

Then I modified your benchmark code into the appended code.  The
differences are:
* I didn't iterate 100K times, I timed the functions *once*.
* I saved the times in a buffer and printed them all at the end
  so printf() wouldn't pollute the caches.
* Before every even-numbered iteration, I flushed the I-cache
  of everything from _init to _fini (i.e. all the non-library code).
  This cold-cache case is what is going to happen in the kernel.

In the results below, note that I did *not* re-flush between phases
of the test.  The effects of cacheing is clearly apparent in the tcpv4
results, where the tcpv6 code loaded the cache.

You can also see that the SipHash code benefits more from cacheing when
entered with a cold cache, as it iterates over the input words, while
the MD5 code is one big unrolled blob.

Order of computation is down the columns first, across second.

The P4 results were:
tcpv6 md5 cold:		4084	3488	3584	3584	3568
tcpv4 md5 cold:		1052	 996	 996	1060	 996
tcpv6 siphash cold:	4080	3296	3312	3296	3312
tcpv4 siphash cold:	2968	2748	2972	2716	2716
tcpv6 md5 hot:		 900	 712	 712	712	 712
tcpv4 md5 hot:		 632	 672	 672	672	 672
tcpv6 siphash hot:	2484	2292	2340	2340	2340
tcpv4 siphash hot:	1660	1560	1564	2340	1564

SipHash actually wins slightly in the cold-cache case, because
it iterates more.  In the hot-cache case, it loses horribly.

Core 2 duo:
tcpv6 md5 cold:		3396	2868	2964	3012	2832
tcpv4 md5 cold:		1368	1044	1320	1332	1308
tcpv6 siphash cold:	2940	2952	2916	2448	2604
tcpv4 siphash cold:	3192	2988	3576	3504	3624
tcpv6 md5 hot:		1116	1032	 996	1008	1008
tcpv4 md5 hot:		 936	 936	 936	 936	 936
tcpv6 siphash hot:	1200	1236	1236	1188	1188
tcpv4 siphash hot:	 936	 804	 804	 804	 804

Pretty much a tie, honestly.

Ivy Bridge:
tcpv6 md5 cold:		6086	6136	6962	6358	6060
tcpv4 md5 cold:		 816	 732	1046	1054	1012
tcpv6 siphash cold:	3756	1886	2152	2390	2566
tcpv4 siphash cold:	3264	2108	3026	3120	3526
tcpv6 md5 hot:		1062	 808	 824	 824	 832
tcpv4 md5 hot:		 730	 730	 740	 748	 748
tcpv6 siphash hot:	 960	 952	 936	1112	 926
tcpv4 siphash hot:	 638	 544	 562	 552	 560

Modern processors *hate* cold caches.  But notice how md5 is *faster*
than SipHash on hot-cache IPv6.

Ivy Bridge, -m64:
tcpv6 md5 cold:		4680	3672	3956	3616	3525
tcpv4 md5 cold:		1066	1416	1179	1179	1134
tcpv6 siphash cold:	 940	1258	1995	1609	2255
tcpv4 siphash cold:	1440	1269	1292	1870	1621
tcpv6 md5 hot:		1372	1111	1122	1088	1088
tcpv4 md5 hot:		 997	 997	 997	 997	 998
tcpv6 siphash hot:	 340	 340	 340	 352	 340
tcpv4 siphash hot:	 227	 238	 238	 238	 238

Of course, when you compile -m64, SipHash is unbeatable.


Here's the modified benchmark() code.  The entire package is
a bit voluminous for the mailing list, but anyone is welcome to it.

static void clflush(void)
{
	extern char const _init, _fini;
	char const *p = &_init;

	while (p < &_fini) {
		asm("clflush %0" : : "m" (*p));
		p += 64;
	}
}

typedef uint32_t cycles_t;
static cycles_t get_cycles(void)
{
	uint32_t eax, edx;
	asm volatile("rdtsc" : "=a" (eax), "=d" (edx));
	return eax;
}

static int benchmark(void)
{
	cycles_t start, finish;
	int i;
	u32 seq_number = 0;
	__be32 saddr6[4] = { 1, 4, 182, 393 }, daddr6[4] = { 9192, 18288, 2222222, 0xffffff10 };
	__be32 saddr4 = 28888, daddr4 = 182112;
	__be16 sport = 22, dport = 41992;
	u32 tsoff;
	cycles_t result[4];

	printf("seq num benchmark\n");

	for (i = 0; i < 10; i++) {
		if ((i & 1) == 0)
			clflush();

		start = get_cycles();
		seq_number += secure_tcpv6_sequence_number_md5(saddr6, daddr6, sport, dport, &tsoff);
		finish = get_cycles();
		result[0] = finish - start;

		start = get_cycles();
		seq_number += secure_tcp_sequence_number_md5(saddr4, daddr4, sport, dport, &tsoff);
		finish = get_cycles();
		result[1] = finish - start;

		start = get_cycles();
		seq_number += secure_tcpv6_sequence_number(saddr6, daddr6, sport, dport, &tsoff);
		finish = get_cycles();
		result[2] = finish - start;

		start = get_cycles();
		seq_number += secure_tcp_sequence_number(saddr4, daddr4, sport, dport, &tsoff);
		finish = get_cycles();
		result[3] = finish - start;

		printf("* Iteration %d results:\n", i);
		printf("secure_tcpv6_sequence_number_md5# cycles: %u\n", result[0]);
		printf("secure_tcp_sequence_number_md5# cycles: %u\n", result[1]);
		printf("secure_tcpv6_sequence_number_siphash# cycles: %u\n", result[2]);
		printf("secure_tcp_sequence_number_siphash# cycles: %u\n", result[3]);
		printf("benchmark result: %u\n", seq_number);
	}

	printf("benchmark result: %u\n", seq_number);
	return 0;
}
//device_initcall(benchmark);

int
main(void)
{
	memset(net_secret, 0xff, sizeof net_secret);
	memset(net_secret_md5, 0xff, sizeof net_secret_md5);
	return benchmark();
}

^ permalink raw reply

* Re: [kernel-hardening] Re: HalfSipHash Acceptable Usage
From: Jason A. Donenfeld @ 2016-12-22  4:40 UTC (permalink / raw)
  To: George Spelvin
  Cc: Andi Kleen, David Miller, David Laight, Daniel J . Bernstein,
	Eric Biggers, Eric Dumazet, Hannes Frederic Sowa,
	Jean-Philippe Aumasson, kernel-hardening,
	Linux Crypto Mailing List, LKML, Andy Lutomirski, Netdev,
	Tom Herbert, Linus Torvalds, Theodore Ts'o, Vegard Nossum
In-Reply-To: <20161222035549.10827.qmail@ns.sciencehorizons.net>

Hi George,

On Thu, Dec 22, 2016 at 4:55 AM, George Spelvin
<linux@sciencehorizons.net> wrote:
> Do we have to go through this?  No, the benchmark was *not* bogus.
> Then I replaced the kernel #includes with the necessary typedefs
> and #defines to make it compile in user-space.
> * I didn't iterate 100K times, I timed the functions *once*.
> * I saved the times in a buffer and printed them all at the end
>   so printf() wouldn't pollute the caches.
> * Before every even-numbered iteration, I flushed the I-cache
>   of everything from _init to _fini (i.e. all the non-library code).
>   This cold-cache case is what is going to happen in the kernel.

Wow! Great. Thanks for the pointers on the right way to do this. Very
helpful, and enlightening results indeed. Think you could send me the
whole .c of what you finally came up with? I'd like to run this on
some other architectures; I've got a few odd boxes laying around here.

> The P4 results were:
> SipHash actually wins slightly in the cold-cache case, because
> it iterates more.  In the hot-cache case, it loses
> Core 2 duo:
> Pretty much a tie, honestly.
> Ivy Bridge:
> Modern processors *hate* cold caches.  But notice how md5 is *faster*
> than SipHash on hot-cache IPv6.
> Ivy Bridge, -m64:
> Of course, when you compile -m64, SipHash is unbeatable.

Okay, so I think these results are consistent with some of the
assessments from before -- that SipHash is really just fine as a
replacement for MD5. Not great on older 32-bit x86, but not too
horrible, and the performance improvements on every other architecture
and the security improvements everywhere are a net good.

> Here's the modified benchmark() code.  The entire package is
> a bit voluminous for the mailing list, but anyone is welcome to it.

Please do send! I'm sure I'll learn from reading it. Thanks again for
doing the hardwork of putting something proper together.

Thanks,
Jason

^ permalink raw reply

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
From: George Spelvin @ 2016-12-22  5:01 UTC (permalink / raw)
  To: linux, luto
  Cc: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, hannes,
	Jason, jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, netdev, tom, torvalds, tytso, vegard.nossum
In-Reply-To: <CALCETrVn1tWBQx-RCSqCQ2ZcB6hPdioaV52q8vY+Mz1fRKsUXA@mail.gmail.com>

Andy Lutomirski wrote:
> I don't even think it needs that.  This is just adding a
> non-destructive final operation, right?

It is, but the problem is that SipHash is intended for *small* inputs,
so the standard implementations aren't broken into init/update/final
functions.

There's just one big function that keeps the state variables in
registers and never stores them anywhere.

If we *had* init/update/final functions, then it would be trivial.

> Just to clarify, if we replace SipHash with a black box, I think this
> effectively means, where "entropy" is random_get_entropy() || jiffies
> || current->pid:

> The first call returns H(random seed || entropy_0 || secret).  The
> second call returns H(random seed || entropy_0 || secret || entropy_1
> || secret).  Etc.

Basically, yes.  I was skipping the padding byte and keying the
finalization rounds on the grounds of "can't hurt and might help",
but we could do it a more standard way.

> If not, then I have a fairly strong preference to keep whatever
> construction we come up with consistent with something that could
> actually happen with invocations of unmodified SipHash -- then all the
> security analysis on SipHash goes through.

Okay.  I don't think it makes a difference, but it's not a *big* waste
of time.  If we have finalization rounds, we can reduce the secret
to 128 bits.

If we include the padding byte, we can do one of two things:
1) Make the secret 184 bits, to fill up the final partial word as
   much as possible, or
2) Make the entropy 1 byte smaller and conceptually misalign the
   secret.  What we'd actually do is remove the last byte of
   the secret and include it in the entropy words, but that's
   just a rotation of the secret between storage and hashing.

Also, I assume you'd like SipHash-2-4, since you want to rely
on a security analysis.

(Regarding the padding byte, getting it right might be annoying
to do exactly.  All of the security analysis depends *only* on
its low 3 bits indicating how much of the final block is used.
As it says in the SipHash paper, they included 8 bits just because
it was easy.  But if you want it exact, it's just one more byte of
state.)

> The one thing I don't like is
> that I don't see how to prove that you can't run it backwards if you
> manage to acquire a memory dump.  In fact, I that that there exist, at
> least in theory, hash functions that are secure in the random oracle
> model but that *can* be run backwards given the full state.  From
> memory, SHA-3 has exactly that property, and it would be a bit sad for
> a CSPRNG to be reversible.

Er...  get_random_int() is specifically *not* designed to be resistant
to state capture, and I didn't try.  Remember, what it's used for
is ASLR, what we're worried about is somene learning the layouts
of still-running processes, and and if you get a memory dump, you have
the memory layout!

If you want anti-backtracking, though, it's easy to add.  What we
hash is:

entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...

You mix the output word right back in to the (unfinalized) state after
generating it.  This is still equivalent to unmodified back-box SipHash,
you're just using a (conceptually independent) SipHash invocation to
produce some of its input.

Each output is produced by copying the state, padding & finalizing after the
secret.

In fact, to make our lives easier, let's define the secret to end with
a counter byte that happens to be equal to the padding byte.  The input
stream will be:

Previous output: 8 (or 4 for HalfSipHash) bytes
Entropy: 15 bytes (8 bytes timer, 4 bytes jiffies, 3 bytes pid)
Secret: 16 bytes
Counter: 1 byte
...repeat...

> We could also periodically mix in a big (128-bit?) chunk of fresh
> urandom output to keep the bad guys guessing.

Simpler and faster to just update the global master secret.
The state is per-CPU, so mixing in has to be repeated per CPU.

With these changes, I'm satisifed that it's secure, cheap, has a
sufficiently wide state size, *and* all standard SipHash analysis applies.

The only remaining issues are:
1) How many rounds, and
2) May we use HalfSipHash?

I'd *like* to persuade you that skipping the padding byte wouldn't
invalidate any security proofs, because it's true and would simplify
the code.  But if you want 100% stock, I'm willing to cater to that.

Ted, what do you think?

^ permalink raw reply

* Re: Re: [PATCH v7 3/6] random: use SipHash in place of MD5
From: Theodore Ts'o @ 2016-12-22  5:41 UTC (permalink / raw)
  To: kernel-hardening
  Cc: Hannes Frederic Sowa, Andy Lutomirski, Netdev, LKML,
	Linux Crypto Mailing List, David Laight, Eric Dumazet,
	Linus Torvalds, Eric Biggers, Tom Herbert, Andi Kleen,
	David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <CAHmME9phg=GuhEUaMxxv_=RexffPDqrOEhmaKffy_ZSt7bfC7g@mail.gmail.com>

On Thu, Dec 22, 2016 at 03:49:39AM +0100, Jason A. Donenfeld wrote:
> 
> Funny -- while you guys were sending this back & forth, I was writing
> my reply to Andy which essentially arrives at the same conclusion.
> Given that we're all arriving to the same thing, and that Ted shot in
> this direction long before we all did, I'm leaning toward abandoning
> SipHash for the de-MD5-ification of get_random_int/long, and working
> on polishing Ted's idea into something shiny for this patchset.

here are my numbers comparing siphash (using the first three patches
of the v7 siphash patches) with my batched chacha20 implementation.
The results are taken by running get_random_* 10000 times, and then
dividing the numbers by 10000 to get the average number of cycles for
the call.  I compiled 32-bit and 64-bit kernels, and ran the results
using kvm:

                   siphash                        batched chacha20
         get_random_int  get_random_long   get_random_int   get_random_long   

32-bit    270                  278             114            146
64-bit     75                   75             106            186

> I did have two objections to this. The first was that my SipHash
> construction is faster.

Well, it's faster on everything except 32-bit x86.  :-P

> The second, and the more
> important one, was that batching entropy up like this means that 32
> calls will be really fast, and then the 33rd will be slow, since it
> has to do a whole ChaCha round, because get_random_bytes must be
> called to refill the batch.

... and this will take 2121 cycles on 64-bit x86, and 2315 cycles on a
32-bit x86.  Which on a 2.3 GHz processor, is just under a
microsecond.  As far as being inconsistent on process startup, I very
much doubt a microsecond is really going to be visible to the user.  :-)

The bottom line is that I think we're really "pixel peeping" at this
point --- which is what obsessed digital photographers will do when
debating the quality of a Canon vs Nikon DSLR by blowing up a photo by
a thousand times, and then trying to claim that this is visible to the
human eye.  Or people who obsessing over the frequency response curves
of TH-X00 headphones with Mahogony vs Purpleheart wood, when it's
likely that in a blind head-to-head comparison, most people wouldn't
be able to tell the difference....

I think the main argument for using the batched getrandom approach is
that it, I would argue, simpler than introducing siphash into the
picture.  On 64-bit platforms it is faster and more consistent, so
it's basically that versus complexity of having to adding siphash to
the things that people have to analyze when considering random number
security on Linux.   But it's a close call either way, I think.

	    	     	      	      - Ted

P.S.  My benchmarking code....

diff --git a/drivers/char/random.c b/drivers/char/random.c
index a51f0ff43f00..41860864b775 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1682,6 +1682,55 @@ static int rand_initialize(void)
 }
 early_initcall(rand_initialize);
 
+static unsigned int get_random_int_new(void);
+static unsigned long get_random_long_new(void);
+
+#define NUM_CYCLES 10000
+#define AVG(finish, start) ((unsigned int)(finish - start + NUM_CYCLES/2) / NUM_CYCLES)
+
+static int rand_benchmark(void)
+{
+	cycles_t start,finish;
+	int i, out;
+
+	pr_crit("random benchmark!!\n");
+	start = get_cycles();
+	for (i = 0; i < NUM_CYCLES; i++) {
+		get_random_int();}
+	finish = get_cycles();
+	pr_err("get_random_int # cycles: %u\n", AVG(finish, start));
+
+	start = get_cycles();
+	for (i = 0; i < NUM_CYCLES; i++) {
+		get_random_int_new();
+	}
+	finish = get_cycles();
+	pr_err("get_random_int_new (batched chacha20) # cycles: %u\n", AVG(finish, start));
+
+	start = get_cycles();
+	for (i = 0; i < NUM_CYCLES; i++) {
+		get_random_long();
+	}
+	finish = get_cycles();
+	pr_err("get_random_long # cycles: %u\n", AVG(finish, start));
+
+	start = get_cycles();
+	for (i = 0; i < NUM_CYCLES; i++) {
+		get_random_long_new();
+	}
+	finish = get_cycles();
+	pr_err("get_random_long_new (batched chacha20) # cycles: %u\n", AVG(finish, start));
+
+	start = get_cycles();
+	for (i = 0; i < NUM_CYCLES; i++) {
+		get_random_bytes(&out, sizeof(out));
+	}
+	finish = get_cycles();
+	pr_err("get_random_bytes # cycles: %u\n", AVG(finish, start));
+	return 0;
+}
+device_initcall(rand_benchmark);
+
 #ifdef CONFIG_BLOCK
 void rand_initialize_disk(struct gendisk *disk)
 {
@@ -2064,8 +2113,10 @@ unsigned int get_random_int(void)
 	unsigned int ret;
 	u64 *chaining;
 
+#if 0	// force slow path
 	if (arch_get_random_int(&ret))
 		return ret;
+#endif
 
 	chaining = &get_cpu_var(get_random_int_chaining);
 	ret = *chaining = siphash_3u64(*chaining, jiffies, random_get_entropy() +
@@ -2083,8 +2134,10 @@ unsigned long get_random_long(void)
 	unsigned long ret;
 	u64 *chaining;
 
+#if 0 // force slow path
 	if (arch_get_random_long(&ret))
 		return ret;
+#endif 
 
 	chaining = &get_cpu_var(get_random_int_chaining);
 	ret = *chaining = siphash_3u64(*chaining, jiffies, random_get_entropy() +
@@ -2094,6 +2147,47 @@ unsigned long get_random_long(void)
 }
 EXPORT_SYMBOL(get_random_long);
 
+struct random_buf {
+	__u8 buf[CHACHA20_BLOCK_SIZE];
+	int ptr;
+};
+
+static DEFINE_PER_CPU(struct random_buf, batched_entropy);
+
+static void get_batched_entropy(void *buf, int n)
+{
+	struct random_buf *p;
+
+	p = &get_cpu_var(batched_entropy);
+
+	if ((p->ptr == 0) ||
+	    (p->ptr + n >= CHACHA20_BLOCK_SIZE)) {
+		extract_crng(p->buf);
+		p->ptr = 0;
+	}
+	BUG_ON(n > CHACHA20_BLOCK_SIZE);
+	memcpy(buf, p->buf, n);
+	p->ptr += n;
+	put_cpu_var(batched_entropy);
+}
+
+static unsigned int get_random_int_new(void)
+{
+	unsigned int ret;
+
+	get_batched_entropy(&ret, sizeof(ret));
+	return ret;
+}
+
+static unsigned long get_random_long_new(void)
+{
+	unsigned long ret;
+
+	get_batched_entropy(&ret, sizeof(ret));
+	return ret;
+}
+
+
 /**
  * randomize_page - Generate a random, page aligned address
  * @start:	The smallest acceptable address the caller will take.

^ permalink raw reply related

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
From: Andy Lutomirski @ 2016-12-22  5:42 UTC (permalink / raw)
  To: George Spelvin
  Cc: Andrew Lutomirski, Andi Kleen, David S. Miller, David Laight,
	D. J. Bernstein, Eric Biggers, Eric Dumazet, Hannes Frederic Sowa,
	Jason A. Donenfeld, Jean-Philippe Aumasson,
	kernel-hardening@lists.openwall.com, Linux Crypto Mailing List,
	linux-kernel@vger.kernel.org, Network Development, Tom Herbert,
	Linus Torvalds, Ted Ts'o
In-Reply-To: <20161222050138.12011.qmail@ns.sciencehorizons.net>

On Wed, Dec 21, 2016 at 9:01 PM, George Spelvin
<linux@sciencehorizons.net> wrote:
> Andy Lutomirski wrote:
>> I don't even think it needs that.  This is just adding a
>> non-destructive final operation, right?
>
> It is, but the problem is that SipHash is intended for *small* inputs,
> so the standard implementations aren't broken into init/update/final
> functions.
>
> There's just one big function that keeps the state variables in
> registers and never stores them anywhere.
>
> If we *had* init/update/final functions, then it would be trivial.
>
>> Just to clarify, if we replace SipHash with a black box, I think this
>> effectively means, where "entropy" is random_get_entropy() || jiffies
>> || current->pid:
>
>> The first call returns H(random seed || entropy_0 || secret).  The
>> second call returns H(random seed || entropy_0 || secret || entropy_1
>> || secret).  Etc.
>
> Basically, yes.  I was skipping the padding byte and keying the
> finalization rounds on the grounds of "can't hurt and might help",
> but we could do it a more standard way.
>
>> If not, then I have a fairly strong preference to keep whatever
>> construction we come up with consistent with something that could
>> actually happen with invocations of unmodified SipHash -- then all the
>> security analysis on SipHash goes through.
>
> Okay.  I don't think it makes a difference, but it's not a *big* waste
> of time.  If we have finalization rounds, we can reduce the secret
> to 128 bits.
>
> If we include the padding byte, we can do one of two things:
> 1) Make the secret 184 bits, to fill up the final partial word as
>    much as possible, or
> 2) Make the entropy 1 byte smaller and conceptually misalign the
>    secret.  What we'd actually do is remove the last byte of
>    the secret and include it in the entropy words, but that's
>    just a rotation of the secret between storage and hashing.
>
> Also, I assume you'd like SipHash-2-4, since you want to rely
> on a security analysis.

I haven't looked, but I assume that the analysis at least thought
about reduced rounds, so maybe other variants are okay.

>> The one thing I don't like is
>> that I don't see how to prove that you can't run it backwards if you
>> manage to acquire a memory dump.  In fact, I that that there exist, at
>> least in theory, hash functions that are secure in the random oracle
>> model but that *can* be run backwards given the full state.  From
>> memory, SHA-3 has exactly that property, and it would be a bit sad for
>> a CSPRNG to be reversible.
>
> Er...  get_random_int() is specifically *not* designed to be resistant
> to state capture, and I didn't try.  Remember, what it's used for
> is ASLR, what we're worried about is somene learning the layouts
> of still-running processes, and and if you get a memory dump, you have
> the memory layout!

True, but it's called get_random_int(), and it seems like making it
stronger, especially if the performance cost is low to zero, is a good
thing.

>
> If you want anti-backtracking, though, it's easy to add.  What we
> hash is:
>
> entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...
>
> You mix the output word right back in to the (unfinalized) state after
> generating it.  This is still equivalent to unmodified back-box SipHash,
> you're just using a (conceptually independent) SipHash invocation to
> produce some of its input.

Ah, cute.  This could probably be sped up by doing something like:

entropy_0 || secret || output_0 ^ entropy_1 || secret || ...

It's a little weak because the output is only 64 bits, so you could
plausibly backtrack it on a GPU or FPGA cluster or on an ASIC if the
old entropy is guessable.  I suspect there are sneaky ways around it
like using output_n-1 ^ output_n-2 or similar.  I'll sleep on it.

>
> The only remaining issues are:
> 1) How many rounds, and
> 2) May we use HalfSipHash?

I haven't looked closely enough to have a real opinion here.  I don't
know what the security margin is believed to be.

>
> I'd *like* to persuade you that skipping the padding byte wouldn't
> invalidate any security proofs, because it's true and would simplify
> the code.  But if you want 100% stock, I'm willing to cater to that.

I lean toward stock in the absence of a particularly good reason.  At
the very least I'd want to read that paper carefully.

>
> Ted, what do you think?



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5
From: Jason A. Donenfeld @ 2016-12-22  6:03 UTC (permalink / raw)
  To: kernel-hardening, Theodore Ts'o, Hannes Frederic Sowa,
	Andy Lutomirski, Netdev, LKML, Linux Crypto Mailing List,
	David Laight, Eric Dumazet, Linus Torvalds, Eric Biggers,
	Tom Herbert, Andi Kleen, David S. Miller, Jean-Philippe Aumasson
In-Reply-To: <20161222054125.lzxhd6ctovm3wk4p@thunk.org>

Hi Ted,

On Thu, Dec 22, 2016 at 6:41 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> The bottom line is that I think we're really "pixel peeping" at this
> point --- which is what obsessed digital photographers will do when
> debating the quality of a Canon vs Nikon DSLR by blowing up a photo by
> a thousand times, and then trying to claim that this is visible to the
> human eye.  Or people who obsessing over the frequency response curves
> of TH-X00 headphones with Mahogony vs Purpleheart wood, when it's
> likely that in a blind head-to-head comparison, most people wouldn't
> be able to tell the difference....

This is hilarious, thanks for the laugh. I believe you're right about this...

>
> I think the main argument for using the batched getrandom approach is
> that it, I would argue, simpler than introducing siphash into the
> picture.  On 64-bit platforms it is faster and more consistent, so
> it's basically that versus complexity of having to adding siphash to
> the things that people have to analyze when considering random number
> security on Linux.   But it's a close call either way, I think.

I find this compelling. We'll have one csprng for both
get_random_int/long and for /dev/urandom, and we'll be able to update
that silly warning on the comment of get_random_int/long to read
"gives output of either rdrand quality or of /dev/urandom quality",
which makes it more useful for other things. It introduces less error
prone code, and it lets the RNG analysis be spent on just one RNG, not
two.

So, with your blessing, I'm going to move ahead with implementing a
pretty version of this for v8.

Regards,
Jason

^ permalink raw reply

* Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)
From: George Spelvin @ 2016-12-22  8:02 UTC (permalink / raw)
  To: linux, luto
  Cc: ak, davem, David.Laight, djb, ebiggers3, eric.dumazet, hannes,
	Jason, jeanphilippe.aumasson, kernel-hardening, linux-crypto,
	linux-kernel, luto, netdev, tom, torvalds, tytso, vegard.nossum
In-Reply-To: <CALCETrU84-EeU91AkoAZLWrhK=FSrBgV139oSSM-Vw5Mc0mdAg@mail.gmail.com>

> True, but it's called get_random_int(), and it seems like making it
> stronger, especially if the performance cost is low to zero, is a good
> thing.

If it's cheap enough, I don't mind.  But it's documented as being
marginal-quality, used where speed is more important.

In particular, it's *not* used for key material; only values that matter
only as long as they are in use.  Whule they're in use, they can't be
concealed from an attacker with kernel access, and when they're dne
being used, they're worthless.

>> If you want anti-backtracking, though, it's easy to add.  What we
>> hash is:
>>
>> entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...
>>
>> You mix the output word right back in to the (unfinalized) state after
>> generating it.  This is still equivalent to unmodified back-box SipHash,
>> you're just using a (conceptually independent) SipHash invocation to
>> produce some of its input.

> Ah, cute.  This could probably be sped up by doing something like:
>
> entropy_0 || secret || output_0 ^ entropy_1 || secret || ...

I'm disinclined to do that because that requires deferring the mixing
until the *next* time you generate something.  Storing the value you
don't want revealed by a state capture defeats the purpose.

I'd rather mix it in immediately, so you have anti-backtracking from
the moment of creation.

Also, I don't think it's particularly "cute" or clever; mixing output back
in is the standard way all antibacktracking is accomplished.  It's how
the Davies-Meyer hash construction makes a reversible function one-way.

(There is a second way to do it by throwing away state, but that's
expensive in seed entropy.)

> It's a little weak because the output is only 64 bits, so you could
> plausibly backtrack it on a GPU or FPGA cluster or on an ASIC if the
> old entropy is guessable.  I suspect there are sneaky ways around it
> like using output_n-1 ^ output_n-2 or similar.  I'll sleep on it.

Ah, yes, I see.  Given the final state, you guess the output word, go
back one round, then forward the finalization rounds.   Is the output
equal to the guessed output?  You'll find the true value, plus
Poisson(1 - 2^-64) additional.  (Since you have 2^64-1 chances at
something with probability 1 in 2^64.)

And this can be iterated as often as you like to get earlier output words,
as long as you can guess the entropy.  *That's* the part that hurts;
you'd like something that peters out.

You could use the double-length-output SipHash variant (which requires
a second set of finalization rounds) and mix more output back, but
that's expensive.

The challenge is coming up with more unpredictable data to mix in than one
invocation of SipHash returns.  And without storing previous output
anywhere, because that is exactly wrong.

A running sum or xor or whatever of the outputs doesn't help, because
once you've guessed the last output, you can backtrack that for no
additional effort.

State capture is incredibly difficult, our application doesn't require
resistance anyway... unless you can think of something cheap, I think
we can just live with this.

>> I'd *like* to persuade you that skipping the padding byte wouldn't
>> invalidate any security proofs, because it's true and would simplify
>> the code.  But if you want 100% stock, I'm willing to cater to that.

> I lean toward stock in the absence of a particularly good reason.  At
> the very least I'd want to read that paper carefully.

Er... adding the length is standard Merkle-Damgaard strengthening.
Why you do this is described in the original papers by Merkle and Damgaard.

The lay summary is at
https://en.wikipedia.org/wiki/Merkle-Damgard_construction

The original sources are:
http://www.merkle.com/papers/Thesis1979.pdf
http://saluc.engr.uconn.edu/refs/algorithms/hashalg/damgard89adesign.pdf

Merkle describes the construction; Damgaard proves it secure.  Basically,
appending the length is required to handle variable-length input if the
input is not itself self-delimiting.

The proof of security is theorem 3.1 in the latter.  (The first, more
detailed explanation involves the use of an extra bit, which the second
then explains how todo without.)

In particular, see the top of page 420, which notes that the security
proof only requires encoding *how much padding is added* in the final
block, not the overall length of the message, and the second remark on
p. 421 which notes that no such suffix is required if it's not necessary
to distinguish messages with different numbers of trailing null bytes.

The rules are alluded to in the "Choice of padding rule" part of the
"Rationale" section of the SipHash paper (p. 7), but the description is
very brief because it assumes the reader has the background.

That's why they say "We could have chosen a slightly simpler padding rule,
such as appending a <tt>80</tt> byte followed by zeroes."

The thing is, if the amount of the last block that is used is fixed
(within the domain of a particular key), you don't need to encode this
information at all.

^ permalink raw reply

* Re: dm-crypt optimization
From: Binoy Jayan @ 2016-12-22  8:25 UTC (permalink / raw)
  To: Milan Broz
  Cc: Oded, Ofir, Herbert Xu, Arnd Bergmann, Mark Brown,
	Alasdair Kergon, David S. Miller, private-kwg, dm-devel,
	linux-crypto, Rajendra, Linux kernel mailing list, linux-raid,
	Shaohua Li, Mike Snitzer
In-Reply-To: <bf5e7237-2f5c-3fc6-c7d6-38c3b13ac2c3@gmail.com>

Hi Milan,

On 21 December 2016 at 18:17, Milan Broz <gmazyland@gmail.com> wrote:

> So the core problem is that your crypto accelerator can operate efficiently only
> with bigger batch sizes.

Thank you for the reply. Yes, that would be rather an improvement when having
bigger block sizes.

> How big blocks your crypto hw need to be able to operate more efficiently?
> What about 4k blocks (no batches), could it be usable trade-off?

The benchmark results for Qualcomm Snapdragon SoC's (mentioned below) show
significant improvement with 4K blocks but in batches of all such contiguous
segments in the block layer's request queue in the form of a chained
scatterlist.
However, it uses the algorithm 'aes-xts' instead of the conventional
'essiv-cbc-aes'
used in dm-crypt. Also, it uses the device mapper dm-req-crypt instead
of dm-cypt.

http://nelenkov.blogspot.in/2015/05/hardware-accelerated-disk-encryption-in.html
Section : 'Performance'

Its reports and IO rate of 46.3MB/s compared to an IO rate of 25.1MB/s while
using a software-based FDE (based on dm-crypt).  But I am not sure how genuine
this data is or how it was tested.

Since qualcomm SoC's use hardware backed keystore for managing keys and since
there is no easy way to make dm-crypt work with qualcomm's engines, I do not
have solid benchmark data to show an improved performance when using 4k blocks.

> With some (backward incompatible) changes in LUKS format I would like to see support
> for encryption blocks equivalent to sectors size, so it basically means for 4k drive 4k
> encryption block.
> (This should decrease overhead, now is everything processed on 512 blocks only.)
>
> Support of bigger block sizes would be unsafe without additional mechanism that provides
> atomic writes of multiple sectors. Maybe it applies to 4k as well on some devices though...)

Did you mean write to the crypto output buffers or the actual disk write?
I didn't quite understand how the block size for encryption affects atomic
writes as it is the block layer which handles them. As far as dm-crypt is,
concerned it just encrypts/decrypts a 'struct bio' instance and submits the IO
operation to the block layer.

> The above is not going against your proposal, I am just curious if this is enough
> to provide better performance on your hw accelerator or not.

May be I should be able to procure an open crypto board and get back to you with
some results. Or may be show even a marginal improvement while using software
algorithm by avoiding the crypto overhead for every 512 bytes.

-Binoy

^ permalink raw reply

* Re: [RFC PATCH v2] crypto: Add IV generation algorithms
From: Herbert Xu @ 2016-12-22  8:55 UTC (permalink / raw)
  To: Milan Broz
  Cc: Binoy Jayan, Oded, Ofir, David S. Miller, linux-crypto,
	Mark Brown, Arnd Bergmann, linux-kernel, Alasdair Kergon,
	Mike Snitzer, dm-devel, Shaohua Li, linux-raid, Rajendra
In-Reply-To: <d6d92865-98fa-4d02-035f-9080bc265c35@gmail.com>

On Tue, Dec 13, 2016 at 11:01:08AM +0100, Milan Broz wrote:
>
> By the move everything to cryptoAPI we are basically introducing some strange mix
> of IV and modes there, I wonder how this is going to be maintained.
> Anyway, Herbert should say if it is ok...

Well there is precedent in how do the IPsec IV generation.  In
that case the IV generators too are completely specific to that
application, i.e., IPsec.  However, the way structured it allowed
us to have one single entry path from the IPsec stack into the
crypto layer regardless of whether you are using AEAD or traditional
encryption/hashing algorithms.

For IPsec we make the IV generators behave like normal AEAD
algorithms, except that they take the sequence number as the IV.

The goal here are obviously different.  However, by employing
the same method as we do in IPsec, it appears to me that you
can effectively process multiple blocks at once instead of having
to supply one block at a time due to the IV generation issue.

> I really do not think the disk encryption key management should be moved
> outside of dm-crypt. We cannot then change key structure later easily.

It doesn't have to live outside of dm-crypt.  You can register
these IV generators from there if you really want.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox