[PATCH 1/4] lib: Add xxhash module

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/4] lib: Add xxhash module
@ 2017-06-22 22:01 Nick Terrell
  2017-06-22 22:01 ` [PATCH 3/4] btrfs: Add zstd support Nick Terrell
  2017-06-22 22:01 ` [PATCH 4/4] squashfs: Add zstd support Nick Terrell
  0 siblings, 2 replies; 18+ messages in thread
From: Nick Terrell @ 2017-06-22 22:01 UTC (permalink / raw)
  To: Nick Terrell
  Cc: kernel-team, Chris Mason, Yann Collet, squashfs-devel,
	linux-btrfs, linux-kernel

Adds xxhash kernel module with xxh32 and xxh64 hashes. xxhash is an
extremely fast non-cryptographic hash algorithm for checksumming.
The zstd compression and decompression modules added in the next patch
require xxhash. I extracted it out from zstd since it is useful on its
own. I copied the code from the upstream XXHash source repository and
translated it into kernel style. I ran benchmarks and tests in the kernel
and tests in userland.

I benchmarked xxhash as a special character device. I ran in four modes,
no-op, xxh32, xxh64, and crc32. The no-op mode simply copies the data to
kernel space and ignores it. The xxh32, xxh64, and crc32 modes compute
hashes on the copied data. I also ran it with four different buffer sizes.
The benchmark file is located in the upstream zstd source repository under
`contrib/linux-kernel/xxhash_test.c` [1].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using the file `filesystem.squashfs`
from `ubuntu-16.10-desktop-amd64.iso`, which is 1,536,217,088 B large.
Run the following commands for the benchmark:

    modprobe xxhash_test
    mknod xxhash_test c 245 0
    time cp filesystem.squashfs xxhash_test

The time is reported by the time of the userland `cp`.
The GB/s is computed with

    1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Normalized GB/s is computed with

    1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).


| Buffer Size (B) | Hash  | Time (s) | GB/s | Adjusted GB/s |
|-----------------|-------|----------|------|---------------|
|            1024 | none  |    0.408 | 3.77 |             - |
|            1024 | xxh32 |    0.649 | 2.37 |          6.37 |
|            1024 | xxh64 |    0.542 | 2.83 |         11.46 |
|            1024 | crc32 |    1.290 | 1.19 |          1.74 |
|            4096 | none  |    0.380 | 4.04 |             - |
|            4096 | xxh32 |    0.645 | 2.38 |          5.79 |
|            4096 | xxh64 |    0.500 | 3.07 |         12.80 |
|            4096 | crc32 |    1.168 | 1.32 |          1.95 |
|            8192 | none  |    0.351 | 4.38 |             - |
|            8192 | xxh32 |    0.614 | 2.50 |          5.84 |
|            8192 | xxh64 |    0.464 | 3.31 |         13.60 |
|            8192 | crc32 |    1.163 | 1.32 |          1.89 |
|           16384 | none  |    0.346 | 4.43 |             - |
|           16384 | xxh32 |    0.590 | 2.60 |          6.30 |
|           16384 | xxh64 |    0.466 | 3.30 |         12.80 |
|           16384 | crc32 |    1.183 | 1.30 |          1.84 |

Tested in userland using the test-suite in the zstd repo under
`contrib/linux-kernel/test/XXHashUserlandTest.cpp` [2] by mocking the
kernel functions. A line in each branch of every function in `xxhash.c`
was commented out to ensure that the test-suite fails. Additionally
tested while testing zstd and with SMHasher [3].

[1] https://phabricator.intern.facebook.com/P57526246
[2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/test/XXHashUserlandTest.cpp
[3] https://github.com/aappleby/smhasher

zstd source repository: https://github.com/facebook/zstd
XXHash source repository: https://github.com/cyan4973/xxhash

Signed-off-by: Nick Terrell <terrelln@fb.com>
---
 include/linux/xxhash.h | 236 +++++++++++++++++++++++
 lib/Kconfig            |   3 +
 lib/Makefile           |   1 +
 lib/xxhash.c           | 500 +++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 740 insertions(+)
 create mode 100644 include/linux/xxhash.h
 create mode 100644 lib/xxhash.c

diff --git a/include/linux/xxhash.h b/include/linux/xxhash.h
new file mode 100644
index 0000000..9e1f42c
--- /dev/null
+++ b/include/linux/xxhash.h
@@ -0,0 +1,236 @@
+/*
+ * xxHash - Extremely Fast Hash algorithm
+ * Copyright (C) 2012-2016, Yann Collet.
+ *
+ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *   * Redistributions of source code must retain the above copyright
+ *     notice, this list of conditions and the following disclaimer.
+ *   * Redistributions in binary form must reproduce the above
+ *     copyright notice, this list of conditions and the following disclaimer
+ *     in the documentation and/or other materials provided with the
+ *     distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * This program is free software; you can redistribute it and/or modify it under
+ * the terms of the GNU General Public License version 2 as published by the
+ * Free Software Foundation. This program is dual-licensed; you may select
+ * either version 2 of the GNU General Public License ("GPL") or BSD license
+ * ("BSD").
+ *
+ * You can contact the author at:
+ * - xxHash homepage: http://cyan4973.github.io/xxHash/
+ * - xxHash source repository: https://github.com/Cyan4973/xxHash
+ */
+
+/*
+ * Notice extracted from xxHash homepage:
+ *
+ * xxHash is an extremely fast Hash algorithm, running at RAM speed limits.
+ * It also successfully passes all tests from the SMHasher suite.
+ *
+ * Comparison (single thread, Windows Seven 32 bits, using SMHasher on a Core 2
+ * Duo @3GHz)
+ *
+ * Name            Speed       Q.Score   Author
+ * xxHash          5.4 GB/s     10
+ * CrapWow         3.2 GB/s      2       Andrew
+ * MumurHash 3a    2.7 GB/s     10       Austin Appleby
+ * SpookyHash      2.0 GB/s     10       Bob Jenkins
+ * SBox            1.4 GB/s      9       Bret Mulvey
+ * Lookup3         1.2 GB/s      9       Bob Jenkins
+ * SuperFastHash   1.2 GB/s      1       Paul Hsieh
+ * CityHash64      1.05 GB/s    10       Pike & Alakuijala
+ * FNV             0.55 GB/s     5       Fowler, Noll, Vo
+ * CRC32           0.43 GB/s     9
+ * MD5-32          0.33 GB/s    10       Ronald L. Rivest
+ * SHA1-32         0.28 GB/s    10
+ *
+ * Q.Score is a measure of quality of the hash function.
+ * It depends on successfully passing SMHasher test set.
+ * 10 is a perfect score.
+ *
+ * A 64-bits version, named xxh64 offers much better speed,
+ * but for 64-bits applications only.
+ * Name     Speed on 64 bits    Speed on 32 bits
+ * xxh64       13.8 GB/s            1.9 GB/s
+ * xxh32        6.8 GB/s            6.0 GB/s
+ */
+
+#ifndef XXHASH_H
+#define XXHASH_H
+
+#include <linux/types.h>
+
+/*-****************************
+ * Simple Hash Functions
+ *****************************/
+
+/**
+ * xxh32() - calculate the 32-bit hash of the input with a given seed.
+ *
+ * @input:  The data to hash.
+ * @length: The length of the data to hash.
+ * @seed:   The seed can be used to alter the result predictably.
+ *
+ * Speed on Core 2 Duo @ 3 GHz (single thread, SMHasher benchmark) : 5.4 GB/s
+ *
+ * Return:  The 32-bit hash of the data.
+ */
+uint32_t xxh32(const void *input, size_t length, uint32_t seed);
+
+/**
+ * xxh64() - calculate the 64-bit hash of the input with a given seed.
+ *
+ * @input:  The data to hash.
+ * @length: The length of the data to hash.
+ * @seed:   The seed can be used to alter the result predictably.
+ *
+ * This function runs 2x faster on 64-bit systems, but slower on 32-bit systems.
+ *
+ * Return:  The 64-bit hash of the data.
+ */
+uint64_t xxh64(const void *input, size_t length, uint64_t seed);
+
+/*-****************************
+ * Streaming Hash Functions
+ *****************************/
+
+/*
+ * These definitions are only meant to allow allocation of XXH state
+ * statically, on stack, or in a struct for example.
+ * Do not use members directly.
+ */
+
+/**
+ * struct xxh32_state - private xxh32 state, do not use members directly
+ */
+struct xxh32_state {
+	uint32_t total_len_32;
+	uint32_t large_len;
+	uint32_t v1;
+	uint32_t v2;
+	uint32_t v3;
+	uint32_t v4;
+	uint32_t mem32[4];
+	uint32_t memsize;
+};
+
+/**
+ * struct xxh32_state - private xxh64 state, do not use members directly
+ */
+struct xxh64_state {
+	uint64_t total_len;
+	uint64_t v1;
+	uint64_t v2;
+	uint64_t v3;
+	uint64_t v4;
+	uint64_t mem64[4];
+	uint32_t memsize;
+};
+
+/**
+ * xxh32_reset() - reset the xxh32 state to start a new hashing operation
+ *
+ * @state: The xxh32 state to reset.
+ * @seed:  Initialize the hash state with this seed.
+ *
+ * Call this function on any xxh32_state to prepare for a new hashing operation.
+ */
+void xxh32_reset(struct xxh32_state *state, uint32_t seed);
+
+/**
+ * xxh32_update() - hash the data given and update the xxh32 state
+ *
+ * @state:  The xxh32 state to update.
+ * @input:  The data to hash.
+ * @length: The length of the data to hash.
+ *
+ * After calling xxh32_reset() call xxh32_update() as many times as necessary.
+ *
+ * Return:  Zero on success, otherwise an error code.
+ */
+int xxh32_update(struct xxh32_state *state, const void *input, size_t length);
+
+/**
+ * xxh32_digest() - produce the current xxh32 hash
+ *
+ * @state: Produce the current xxh32 hash of this state.
+ *
+ * A hash value can be produced at any time. It is still possible to continue
+ * inserting input into the hash state after a call to xxh32_digest(), and
+ * generate new hashes later on, by calling xxh32_digest() again.
+ *
+ * Return: The xxh32 hash stored in the state.
+ */
+uint32_t xxh32_digest(const struct xxh32_state *state);
+
+/**
+ * xxh64_reset() - reset the xxh64 state to start a new hashing operation
+ *
+ * @state: The xxh64 state to reset.
+ * @seed:  Initialize the hash state with this seed.
+ */
+void xxh64_reset(struct xxh64_state *state, uint64_t seed);
+
+/**
+ * xxh64_update() - hash the data given and update the xxh64 state
+ * @state:  The xxh64 state to update.
+ * @input:  The data to hash.
+ * @length: The length of the data to hash.
+ *
+ * After calling xxh64_reset() call xxh64_update() as many times as necessary.
+ *
+ * Return:  Zero on success, otherwise an error code.
+ */
+int xxh64_update(struct xxh64_state *state, const void *input, size_t length);
+
+/**
+ * xxh64_digest() - produce the current xxh64 hash
+ *
+ * @state: Produce the current xxh64 hash of this state.
+ *
+ * A hash value can be produced at any time. It is still possible to continue
+ * inserting input into the hash state after a call to xxh64_digest(), and
+ * generate new hashes later on, by calling xxh64_digest() again.
+ *
+ * Return: The xxh64 hash stored in the state.
+ */
+uint64_t xxh64_digest(const struct xxh64_state *state);
+
+/*-**************************
+ * Utils
+ ***************************/
+
+/**
+ * xxh32_copy_state() - copy the source state into the destination state
+ *
+ * @src: The source xxh32 state.
+ * @dst: The destination xxh32 state.
+ */
+void xxh32_copy_state(struct xxh32_state *dst, const struct xxh32_state *src);
+
+/**
+ * xxh64_copy_state() - copy the source state into the destination state
+ *
+ * @src: The source xxh64 state.
+ * @dst: The destination xxh64 state.
+ */
+void xxh64_copy_state(struct xxh64_state *dst, const struct xxh64_state *src);
+
+#endif /* XXHASH_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a..b6009d7 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -184,6 +184,9 @@ config CRC8
 	  when they need to do cyclic redundancy check according CRC8
 	  algorithm. Module will be called crc8.
 
+config XXHASH
+	tristate
+
 config AUDIT_GENERIC
 	bool
 	depends on AUDIT && !AUDIT_ARCH
diff --git a/lib/Makefile b/lib/Makefile
index 0166fbc..1338226 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -102,6 +102,7 @@ obj-$(CONFIG_CRC32_SELFTEST)	+= crc32test.o
 obj-$(CONFIG_CRC7)	+= crc7.o
 obj-$(CONFIG_LIBCRC32C)	+= libcrc32c.o
 obj-$(CONFIG_CRC8)	+= crc8.o
+obj-$(CONFIG_XXHASH)	+= xxhash.o
 obj-$(CONFIG_GENERIC_ALLOCATOR) += genalloc.o
 
 obj-$(CONFIG_842_COMPRESS) += 842/
diff --git a/lib/xxhash.c b/lib/xxhash.c
new file mode 100644
index 0000000..dc94904
--- /dev/null
+++ b/lib/xxhash.c
@@ -0,0 +1,500 @@
+/*
+ * xxHash - Extremely Fast Hash algorithm
+ * Copyright (C) 2012-2016, Yann Collet.
+ *
+ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *   * Redistributions of source code must retain the above copyright
+ *     notice, this list of conditions and the following disclaimer.
+ *   * Redistributions in binary form must reproduce the above
+ *     copyright notice, this list of conditions and the following disclaimer
+ *     in the documentation and/or other materials provided with the
+ *     distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * This program is free software; you can redistribute it and/or modify it under
+ * the terms of the GNU General Public License version 2 as published by the
+ * Free Software Foundation. This program is dual-licensed; you may select
+ * either version 2 of the GNU General Public License ("GPL") or BSD license
+ * ("BSD").
+ *
+ * You can contact the author at:
+ * - xxHash homepage: http://cyan4973.github.io/xxHash/
+ * - xxHash source repository: https://github.com/Cyan4973/xxHash
+ */
+
+#include <asm/unaligned.h>
+#include <linux/errno.h>
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/xxhash.h>
+
+/*-*************************************
+ * Macros
+ **************************************/
+#define xxh_rotl32(x, r) ((x << r) | (x >> (32 - r)))
+#define xxh_rotl64(x, r) ((x << r) | (x >> (64 - r)))
+
+#ifdef __LITTLE_ENDIAN
+# define XXH_CPU_LITTLE_ENDIAN 1
+#else
+# define XXH_CPU_LITTLE_ENDIAN 0
+#endif
+
+/*-*************************************
+ * Constants
+ **************************************/
+static const uint32_t PRIME32_1 = 2654435761U;
+static const uint32_t PRIME32_2 = 2246822519U;
+static const uint32_t PRIME32_3 = 3266489917U;
+static const uint32_t PRIME32_4 =  668265263U;
+static const uint32_t PRIME32_5 =  374761393U;
+
+static const uint64_t PRIME64_1 = 11400714785074694791ULL;
+static const uint64_t PRIME64_2 = 14029467366897019727ULL;
+static const uint64_t PRIME64_3 =  1609587929392839161ULL;
+static const uint64_t PRIME64_4 =  9650029242287828579ULL;
+static const uint64_t PRIME64_5 =  2870177450012600261ULL;
+
+/*-**************************
+ *  Utils
+ ***************************/
+void xxh32_copy_state(struct xxh32_state *dst, const struct xxh32_state *src)
+{
+	memcpy(dst, src, sizeof(*dst));
+}
+EXPORT_SYMBOL(xxh32_copy_state);
+
+void xxh64_copy_state(struct xxh64_state *dst, const struct xxh64_state *src)
+{
+	memcpy(dst, src, sizeof(*dst));
+}
+EXPORT_SYMBOL(xxh64_copy_state);
+
+/*-***************************
+ * Simple Hash Functions
+ ****************************/
+static uint32_t xxh32_round(uint32_t seed, const uint32_t input)
+{
+	seed += input * PRIME32_2;
+	seed = xxh_rotl32(seed, 13);
+	seed *= PRIME32_1;
+	return seed;
+}
+
+uint32_t xxh32(const void *input, const size_t len, const uint32_t seed)
+{
+	const uint8_t *p = (const uint8_t *)input;
+	const uint8_t *b_end = p + len;
+	uint32_t h32;
+
+	if (len >= 16) {
+		const uint8_t *const limit = b_end - 16;
+		uint32_t v1 = seed + PRIME32_1 + PRIME32_2;
+		uint32_t v2 = seed + PRIME32_2;
+		uint32_t v3 = seed + 0;
+		uint32_t v4 = seed - PRIME32_1;
+
+		do {
+			v1 = xxh32_round(v1, get_unaligned_le32(p));
+			p += 4;
+			v2 = xxh32_round(v2, get_unaligned_le32(p));
+			p += 4;
+			v3 = xxh32_round(v3, get_unaligned_le32(p));
+			p += 4;
+			v4 = xxh32_round(v4, get_unaligned_le32(p));
+			p += 4;
+		} while (p <= limit);
+
+		h32 = xxh_rotl32(v1, 1) + xxh_rotl32(v2, 7) +
+			xxh_rotl32(v3, 12) + xxh_rotl32(v4, 18);
+	} else {
+		h32 = seed + PRIME32_5;
+	}
+
+	h32 += (uint32_t)len;
+
+	while (p + 4 <= b_end) {
+		h32 += get_unaligned_le32(p) * PRIME32_3;
+		h32 = xxh_rotl32(h32, 17) * PRIME32_4;
+		p += 4;
+	}
+
+	while (p < b_end) {
+		h32 += (*p) * PRIME32_5;
+		h32 = xxh_rotl32(h32, 11) * PRIME32_1;
+		p++;
+	}
+
+	h32 ^= h32 >> 15;
+	h32 *= PRIME32_2;
+	h32 ^= h32 >> 13;
+	h32 *= PRIME32_3;
+	h32 ^= h32 >> 16;
+
+	return h32;
+}
+EXPORT_SYMBOL(xxh32);
+
+static uint64_t xxh64_round(uint64_t acc, const uint64_t input)
+{
+	acc += input * PRIME64_2;
+	acc = xxh_rotl64(acc, 31);
+	acc *= PRIME64_1;
+	return acc;
+}
+
+static uint64_t xxh64_merge_round(uint64_t acc, uint64_t val)
+{
+	val = xxh64_round(0, val);
+	acc ^= val;
+	acc = acc * PRIME64_1 + PRIME64_4;
+	return acc;
+}
+
+uint64_t xxh64(const void *input, const size_t len, const uint64_t seed)
+{
+	const uint8_t *p = (const uint8_t *)input;
+	const uint8_t *const b_end = p + len;
+	uint64_t h64;
+
+	if (len >= 32) {
+		const uint8_t *const limit = b_end - 32;
+		uint64_t v1 = seed + PRIME64_1 + PRIME64_2;
+		uint64_t v2 = seed + PRIME64_2;
+		uint64_t v3 = seed + 0;
+		uint64_t v4 = seed - PRIME64_1;
+
+		do {
+			v1 = xxh64_round(v1, get_unaligned_le64(p));
+			p += 8;
+			v2 = xxh64_round(v2, get_unaligned_le64(p));
+			p += 8;
+			v3 = xxh64_round(v3, get_unaligned_le64(p));
+			p += 8;
+			v4 = xxh64_round(v4, get_unaligned_le64(p));
+			p += 8;
+		} while (p <= limit);
+
+		h64 = xxh_rotl64(v1, 1) + xxh_rotl64(v2, 7) +
+			xxh_rotl64(v3, 12) + xxh_rotl64(v4, 18);
+		h64 = xxh64_merge_round(h64, v1);
+		h64 = xxh64_merge_round(h64, v2);
+		h64 = xxh64_merge_round(h64, v3);
+		h64 = xxh64_merge_round(h64, v4);
+
+	} else {
+		h64  = seed + PRIME64_5;
+	}
+
+	h64 += (uint64_t)len;
+
+	while (p + 8 <= b_end) {
+		const uint64_t k1 = xxh64_round(0, get_unaligned_le64(p));
+
+		h64 ^= k1;
+		h64 = xxh_rotl64(h64, 27) * PRIME64_1 + PRIME64_4;
+		p += 8;
+	}
+
+	if (p + 4 <= b_end) {
+		h64 ^= (uint64_t)(get_unaligned_le32(p)) * PRIME64_1;
+		h64 = xxh_rotl64(h64, 23) * PRIME64_2 + PRIME64_3;
+		p += 4;
+	}
+
+	while (p < b_end) {
+		h64 ^= (*p) * PRIME64_5;
+		h64 = xxh_rotl64(h64, 11) * PRIME64_1;
+		p++;
+	}
+
+	h64 ^= h64 >> 33;
+	h64 *= PRIME64_2;
+	h64 ^= h64 >> 29;
+	h64 *= PRIME64_3;
+	h64 ^= h64 >> 32;
+
+	return h64;
+}
+EXPORT_SYMBOL(xxh64);
+
+/*-**************************************************
+ * Advanced Hash Functions
+ ***************************************************/
+void xxh32_reset(struct xxh32_state *statePtr, const uint32_t seed)
+{
+	/* use a local state for memcpy() to avoid strict-aliasing warnings */
+	struct xxh32_state state;
+
+	memset(&state, 0, sizeof(state));
+	state.v1 = seed + PRIME32_1 + PRIME32_2;
+	state.v2 = seed + PRIME32_2;
+	state.v3 = seed + 0;
+	state.v4 = seed - PRIME32_1;
+	memcpy(statePtr, &state, sizeof(state));
+}
+EXPORT_SYMBOL(xxh32_reset);
+
+void xxh64_reset(struct xxh64_state *statePtr, const uint64_t seed)
+{
+	/* use a local state for memcpy() to avoid strict-aliasing warnings */
+	struct xxh64_state state;
+
+	memset(&state, 0, sizeof(state));
+	state.v1 = seed + PRIME64_1 + PRIME64_2;
+	state.v2 = seed + PRIME64_2;
+	state.v3 = seed + 0;
+	state.v4 = seed - PRIME64_1;
+	memcpy(statePtr, &state, sizeof(state));
+}
+EXPORT_SYMBOL(xxh64_reset);
+
+int xxh32_update(struct xxh32_state *state, const void *input, const size_t len)
+{
+	const uint8_t *p = (const uint8_t *)input;
+	const uint8_t *const b_end = p + len;
+
+	if (input == NULL)
+		return -EINVAL;
+
+	state->total_len_32 += (uint32_t)len;
+	state->large_len |= (len >= 16) | (state->total_len_32 >= 16);
+
+	if (state->memsize + len < 16) { /* fill in tmp buffer */
+		memcpy((uint8_t *)(state->mem32) + state->memsize, input, len);
+		state->memsize += (uint32_t)len;
+		return 0;
+	}
+
+	if (state->memsize) { /* some data left from previous update */
+		const uint32_t *p32 = state->mem32;
+
+		memcpy((uint8_t *)(state->mem32) + state->memsize, input,
+			16 - state->memsize);
+
+		state->v1 = xxh32_round(state->v1, get_unaligned_le32(p32));
+		p32++;
+		state->v2 = xxh32_round(state->v2, get_unaligned_le32(p32));
+		p32++;
+		state->v3 = xxh32_round(state->v3, get_unaligned_le32(p32));
+		p32++;
+		state->v4 = xxh32_round(state->v4, get_unaligned_le32(p32));
+		p32++;
+
+		p += 16-state->memsize;
+		state->memsize = 0;
+	}
+
+	if (p <= b_end - 16) {
+		const uint8_t *const limit = b_end - 16;
+		uint32_t v1 = state->v1;
+		uint32_t v2 = state->v2;
+		uint32_t v3 = state->v3;
+		uint32_t v4 = state->v4;
+
+		do {
+			v1 = xxh32_round(v1, get_unaligned_le32(p));
+			p += 4;
+			v2 = xxh32_round(v2, get_unaligned_le32(p));
+			p += 4;
+			v3 = xxh32_round(v3, get_unaligned_le32(p));
+			p += 4;
+			v4 = xxh32_round(v4, get_unaligned_le32(p));
+			p += 4;
+		} while (p <= limit);
+
+		state->v1 = v1;
+		state->v2 = v2;
+		state->v3 = v3;
+		state->v4 = v4;
+	}
+
+	if (p < b_end) {
+		memcpy(state->mem32, p, (size_t)(b_end-p));
+		state->memsize = (uint32_t)(b_end-p);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(xxh32_update);
+
+uint32_t xxh32_digest(const struct xxh32_state *state)
+{
+	const uint8_t *p = (const uint8_t *)state->mem32;
+	const uint8_t *const b_end = (const uint8_t *)(state->mem32) +
+		state->memsize;
+	uint32_t h32;
+
+	if (state->large_len) {
+		h32 = xxh_rotl32(state->v1, 1) + xxh_rotl32(state->v2, 7) +
+			xxh_rotl32(state->v3, 12) + xxh_rotl32(state->v4, 18);
+	} else {
+		h32 = state->v3 /* == seed */ + PRIME32_5;
+	}
+
+	h32 += state->total_len_32;
+
+	while (p + 4 <= b_end) {
+		h32 += get_unaligned_le32(p) * PRIME32_3;
+		h32 = xxh_rotl32(h32, 17) * PRIME32_4;
+		p += 4;
+	}
+
+	while (p < b_end) {
+		h32 += (*p) * PRIME32_5;
+		h32 = xxh_rotl32(h32, 11) * PRIME32_1;
+		p++;
+	}
+
+	h32 ^= h32 >> 15;
+	h32 *= PRIME32_2;
+	h32 ^= h32 >> 13;
+	h32 *= PRIME32_3;
+	h32 ^= h32 >> 16;
+
+	return h32;
+}
+EXPORT_SYMBOL(xxh32_digest);
+
+int xxh64_update(struct xxh64_state *state, const void *input, const size_t len)
+{
+	const uint8_t *p = (const uint8_t *)input;
+	const uint8_t *const b_end = p + len;
+
+	if (input == NULL)
+		return -EINVAL;
+
+	state->total_len += len;
+
+	if (state->memsize + len < 32) { /* fill in tmp buffer */
+		memcpy(((uint8_t *)state->mem64) + state->memsize, input, len);
+		state->memsize += (uint32_t)len;
+		return 0;
+	}
+
+	if (state->memsize) { /* tmp buffer is full */
+		const uint64_t *p64 = state->mem64;
+
+		memcpy(((uint8_t *)p64) + state->memsize, input,
+			32 - state->memsize);
+
+		state->v1 = xxh64_round(state->v1, get_unaligned_le64(p64));
+		p64++;
+		state->v2 = xxh64_round(state->v2, get_unaligned_le64(p64));
+		p64++;
+		state->v3 = xxh64_round(state->v3, get_unaligned_le64(p64));
+		p64++;
+		state->v4 = xxh64_round(state->v4, get_unaligned_le64(p64));
+
+		p += 32 - state->memsize;
+		state->memsize = 0;
+	}
+
+	if (p + 32 <= b_end) {
+		const uint8_t *const limit = b_end - 32;
+		uint64_t v1 = state->v1;
+		uint64_t v2 = state->v2;
+		uint64_t v3 = state->v3;
+		uint64_t v4 = state->v4;
+
+		do {
+			v1 = xxh64_round(v1, get_unaligned_le64(p));
+			p += 8;
+			v2 = xxh64_round(v2, get_unaligned_le64(p));
+			p += 8;
+			v3 = xxh64_round(v3, get_unaligned_le64(p));
+			p += 8;
+			v4 = xxh64_round(v4, get_unaligned_le64(p));
+			p += 8;
+		} while (p <= limit);
+
+		state->v1 = v1;
+		state->v2 = v2;
+		state->v3 = v3;
+		state->v4 = v4;
+	}
+
+	if (p < b_end) {
+		memcpy(state->mem64, p, (size_t)(b_end-p));
+		state->memsize = (uint32_t)(b_end - p);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(xxh64_update);
+
+uint64_t xxh64_digest(const struct xxh64_state *state)
+{
+	const uint8_t *p = (const uint8_t *)state->mem64;
+	const uint8_t *const b_end = (const uint8_t *)state->mem64 +
+		state->memsize;
+	uint64_t h64;
+
+	if (state->total_len >= 32) {
+		const uint64_t v1 = state->v1;
+		const uint64_t v2 = state->v2;
+		const uint64_t v3 = state->v3;
+		const uint64_t v4 = state->v4;
+
+		h64 = xxh_rotl64(v1, 1) + xxh_rotl64(v2, 7) +
+			xxh_rotl64(v3, 12) + xxh_rotl64(v4, 18);
+		h64 = xxh64_merge_round(h64, v1);
+		h64 = xxh64_merge_round(h64, v2);
+		h64 = xxh64_merge_round(h64, v3);
+		h64 = xxh64_merge_round(h64, v4);
+	} else {
+		h64  = state->v3 + PRIME64_5;
+	}
+
+	h64 += (uint64_t)state->total_len;
+
+	while (p + 8 <= b_end) {
+		const uint64_t k1 = xxh64_round(0, get_unaligned_le64(p));
+
+		h64 ^= k1;
+		h64 = xxh_rotl64(h64, 27) * PRIME64_1 + PRIME64_4;
+		p += 8;
+	}
+
+	if (p + 4 <= b_end) {
+		h64 ^= (uint64_t)(get_unaligned_le32(p)) * PRIME64_1;
+		h64 = xxh_rotl64(h64, 23) * PRIME64_2 + PRIME64_3;
+		p += 4;
+	}
+
+	while (p < b_end) {
+		h64 ^= (*p) * PRIME64_5;
+		h64 = xxh_rotl64(h64, 11) * PRIME64_1;
+		p++;
+	}
+
+	h64 ^= h64 >> 33;
+	h64 *= PRIME64_2;
+	h64 ^= h64 >> 29;
+	h64 *= PRIME64_3;
+	h64 ^= h64 >> 32;
+
+	return h64;
+}
+EXPORT_SYMBOL(xxh64_digest);
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_DESCRIPTION("xxHash");
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] btrfs: Add zstd support
  2017-06-22 22:01 [PATCH 1/4] lib: Add xxhash module Nick Terrell
@ 2017-06-22 22:01 ` Nick Terrell
  2017-06-25 15:02   ` kbuild test robot
  2017-06-25 19:03   ` kbuild test robot
  2017-06-22 22:01 ` [PATCH 4/4] squashfs: Add zstd support Nick Terrell
  1 sibling, 2 replies; 18+ messages in thread
From: Nick Terrell @ 2017-06-22 22:01 UTC (permalink / raw)
  To: Nick Terrell
  Cc: kernel-team, Chris Mason, Yann Collet, squashfs-devel,
	linux-btrfs, linux-kernel

Add zstd compression and decompression support to BtrFS. zstd at its
fastest level compresses almost as well as zlib, while offering much
faster compression and decompression, approaching lzo speeds.

I benchmarked btrfs with zstd compression against no compression, lzo
compression, and zlib compression. I benchmarked two scenarios. Copying
a set of files to btrfs, and then reading the files. Copying a tarball
to btrfs, extracting it to btrfs, and then reading the extracted files.
After every operation, I call `sync` and include the sync time.
Between every pair of operations I unmount and remount the filesystem
to avoid caching. The benchmark files can be found in the upstream
zstd source repository under
`contrib/linux-kernel/{btrfs-benchmark.sh,btrfs-extract-benchmark.sh}`
[1] [2].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.

The first compression benchmark is copying 10 copies of the unzipped
Silesia corpus [3] into a BtrFS filesystem mounted with
`-o compress-force=Method`. The decompression benchmark times how long
it takes to `tar` all 10 copies into `/dev/null`. The compression ratio is
measured by comparing the output of `df` and `du`. See the benchmark file
[1] for details. I benchmarked multiple zstd compression levels, although
the patch uses zstd level 1.

| Method  | Ratio | Compression MB/s | Decompression speed |
|---------|-------|------------------|---------------------|
| None    |  0.99 |              504 |                 686 |
| lzo     |  1.66 |              398 |                 442 |
| zlib    |  2.58 |               65 |                 241 |
| zstd 1  |  2.57 |              260 |                 383 |
| zstd 3  |  2.71 |              174 |                 408 |
| zstd 6  |  2.87 |               70 |                 398 |
| zstd 9  |  2.92 |               43 |                 406 |
| zstd 12 |  2.93 |               21 |                 408 |
| zstd 15 |  3.01 |               11 |                 354 |

The next benchmark first copies `linux-4.11.6.tar` [4] to btrfs. Then it
measures the compression ratio, extracts the tar, and deletes the tar.
Then it measures the compression ratio again, and `tar`s the extracted
files into `/dev/null`. See the benchmark file [2] for details.

| Method | Tar Ratio | Extract Ratio | Copy (s) | Extract (s)| Read (s) |
|--------|-----------|---------------|----------|------------|----------|
| None   |      0.97 |          0.78 |    0.981 |      5.501 |    8.807 |
| lzo    |      2.06 |          1.38 |    1.631 |      8.458 |    8.585 |
| zlib   |      3.40 |          1.86 |    7.750 |     21.544 |   11.744 |
| zstd 1 |      3.57 |          1.85 |    2.579 |     11.479 |    9.389 |

[1] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-benchmark.sh
[2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/btrfs-extract-benchmark.sh
[3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
[4] https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.11.6.tar.xz

zstd source repository: https://github.com/facebook/zstd

Signed-off-by: Nick Terrell <terrelln@fb.com>
---
 fs/btrfs/Kconfig           |   2 +
 fs/btrfs/Makefile          |   2 +-
 fs/btrfs/compression.c     |   1 +
 fs/btrfs/compression.h     |   6 +-
 fs/btrfs/ctree.h           |   1 +
 fs/btrfs/disk-io.c         |   2 +
 fs/btrfs/ioctl.c           |   6 +-
 fs/btrfs/props.c           |   6 +
 fs/btrfs/super.c           |  12 +-
 fs/btrfs/sysfs.c           |   2 +
 fs/btrfs/zstd.c            | 433 +++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h |   8 +-
 12 files changed, 469 insertions(+), 12 deletions(-)
 create mode 100644 fs/btrfs/zstd.c

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 80e9c18..a26c63b 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -6,6 +6,8 @@ config BTRFS_FS
 	select ZLIB_DEFLATE
 	select LZO_COMPRESS
 	select LZO_DECOMPRESS
+	select ZSTD_COMPRESS
+	select ZSTD_DECOMPRESS
 	select RAID6_PQ
 	select XOR_BLOCKS
 	select SRCU
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..962a95a 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -6,7 +6,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   transaction.o inode.o file.o tree-defrag.o \
 	   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
 	   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
-	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
+	   export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o hash.o free-space-tree.o
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 10e6b28..3beb0d0 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -761,6 +761,7 @@ static struct {
 static const struct btrfs_compress_op * const btrfs_compress_op[] = {
 	&btrfs_zlib_compress,
 	&btrfs_lzo_compress,
+	&btrfs_zstd_compress,
 };

 void __init btrfs_init_compress(void)
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 39ec43a..d99fc21 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -60,8 +60,9 @@ enum btrfs_compression_type {
 	BTRFS_COMPRESS_NONE  = 0,
 	BTRFS_COMPRESS_ZLIB  = 1,
 	BTRFS_COMPRESS_LZO   = 2,
-	BTRFS_COMPRESS_TYPES = 2,
-	BTRFS_COMPRESS_LAST  = 3,
+	BTRFS_COMPRESS_ZSTD  = 3,
+	BTRFS_COMPRESS_TYPES = 3,
+	BTRFS_COMPRESS_LAST  = 4,
 };

 struct btrfs_compress_op {
@@ -92,5 +93,6 @@ struct btrfs_compress_op {

 extern const struct btrfs_compress_op btrfs_zlib_compress;
 extern const struct btrfs_compress_op btrfs_lzo_compress;
+extern const struct btrfs_compress_op btrfs_zstd_compress;

 #endif
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4f8f75d..61dd3dd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -271,6 +271,7 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS |		\
 	 BTRFS_FEATURE_INCOMPAT_BIG_METADATA |		\
 	 BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO |		\
+	 BTRFS_FEATURE_INCOMPAT_COMPRESS_ZSTD |		\
 	 BTRFS_FEATURE_INCOMPAT_RAID56 |		\
 	 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF |		\
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5f678dc..49c0e91 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2831,6 +2831,8 @@ int open_ctree(struct super_block *sb,
 	features |= BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF;
 	if (fs_info->compress_type == BTRFS_COMPRESS_LZO)
 		features |= BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO;
+	else if (fs_info->compress_type == BTRFS_COMPRESS_ZSTD)
+		features |= BTRFS_FEATURE_INCOMPAT_COMPRESS_ZSTD;

 	if (features & BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA)
 		btrfs_info(fs_info, "has skinny extents");
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e176375..f732cfd 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -327,8 +327,10 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)

 		if (fs_info->compress_type == BTRFS_COMPRESS_LZO)
 			comp = "lzo";
-		else
+		else if (fs_info->compress_type == BTRFS_COMPRESS_ZLIB)
 			comp = "zlib";
+		else
+			comp = "zstd";
 		ret = btrfs_set_prop(inode, "btrfs.compression",
 				     comp, strlen(comp), 0);
 		if (ret)
@@ -1463,6 +1465,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,

 	if (range->compress_type == BTRFS_COMPRESS_LZO) {
 		btrfs_set_fs_incompat(fs_info, COMPRESS_LZO);
+	} else if (range->compress_type == BTRFS_COMPRESS_ZSTD) {
+		btrfs_set_fs_incompat(fs_info, COMPRESS_ZSTD);
 	}

 	ret = defrag_count;
diff --git a/fs/btrfs/props.c b/fs/btrfs/props.c
index d6cb155..162105f 100644
--- a/fs/btrfs/props.c
+++ b/fs/btrfs/props.c
@@ -383,6 +383,8 @@ static int prop_compression_validate(const char *value, size_t len)
 		return 0;
 	else if (!strncmp("zlib", value, len))
 		return 0;
+	else if (!strncmp("zstd", value, len))
+		return 0;

 	return -EINVAL;
 }
@@ -405,6 +407,8 @@ static int prop_compression_apply(struct inode *inode,
 		type = BTRFS_COMPRESS_LZO;
 	else if (!strncmp("zlib", value, len))
 		type = BTRFS_COMPRESS_ZLIB;
+	else if (!strncmp("zstd", value, len))
+		type = BTRFS_COMPRESS_ZSTD;
 	else
 		return -EINVAL;

@@ -422,6 +426,8 @@ static const char *prop_compression_extract(struct inode *inode)
 		return "zlib";
 	case BTRFS_COMPRESS_LZO:
 		return "lzo";
+	case BTRFS_COMPRESS_ZSTD:
+		return "zstd";
 	}

 	return NULL;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4f1cdd5..4f792d5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -513,6 +513,14 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 				btrfs_clear_opt(info->mount_opt, NODATASUM);
 				btrfs_set_fs_incompat(info, COMPRESS_LZO);
 				no_compress = 0;
+			} else if (strcmp(args[0].from, "zstd") == 0) {
+				compress_type = "zstd";
+				info->compress_type = BTRFS_COMPRESS_ZSTD;
+				btrfs_set_opt(info->mount_opt, COMPRESS);
+				btrfs_clear_opt(info->mount_opt, NODATACOW);
+				btrfs_clear_opt(info->mount_opt, NODATASUM);
+				btrfs_set_fs_incompat(info, COMPRESS_ZSTD);
+				no_compress = 0;
 			} else if (strncmp(args[0].from, "no", 2) == 0) {
 				compress_type = "no";
 				btrfs_clear_opt(info->mount_opt, COMPRESS);
@@ -1240,8 +1248,10 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry)
 	if (btrfs_test_opt(info, COMPRESS)) {
 		if (info->compress_type == BTRFS_COMPRESS_ZLIB)
 			compress_type = "zlib";
-		else
+		else if (info->compress_type == BTRFS_COMPRESS_LZO)
 			compress_type = "lzo";
+		else
+			compress_type = "zstd";
 		if (btrfs_test_opt(info, FORCE_COMPRESS))
 			seq_printf(seq, ",compress-force=%s", compress_type);
 		else
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 1f157fb..b0dec90 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -200,6 +200,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(mixed_backref, MIXED_BACKREF);
 BTRFS_FEAT_ATTR_INCOMPAT(default_subvol, DEFAULT_SUBVOL);
 BTRFS_FEAT_ATTR_INCOMPAT(mixed_groups, MIXED_GROUPS);
 BTRFS_FEAT_ATTR_INCOMPAT(compress_lzo, COMPRESS_LZO);
+BTRFS_FEAT_ATTR_INCOMPAT(compress_zstd, COMPRESS_ZSTD);
 BTRFS_FEAT_ATTR_INCOMPAT(big_metadata, BIG_METADATA);
 BTRFS_FEAT_ATTR_INCOMPAT(extended_iref, EXTENDED_IREF);
 BTRFS_FEAT_ATTR_INCOMPAT(raid56, RAID56);
@@ -212,6 +213,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(default_subvol),
 	BTRFS_FEAT_ATTR_PTR(mixed_groups),
 	BTRFS_FEAT_ATTR_PTR(compress_lzo),
+	BTRFS_FEAT_ATTR_PTR(compress_zstd),
 	BTRFS_FEAT_ATTR_PTR(big_metadata),
 	BTRFS_FEAT_ATTR_PTR(extended_iref),
 	BTRFS_FEAT_ATTR_PTR(raid56),
diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
new file mode 100644
index 0000000..838741b
--- /dev/null
+++ b/fs/btrfs/zstd.c
@@ -0,0 +1,433 @@
+/*
+ * Copyright (c) 2016-present, Facebook, Inc.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/init.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/pagemap.h>
+#include <linux/bio.h>
+#include <linux/zstd.h>
+#include "compression.h"
+
+#define ZSTD_BTRFS_MAX_WINDOWLOG 17
+#define ZSTD_BTRFS_MAX_INPUT (1 << ZSTD_BTRFS_MAX_WINDOWLOG)
+
+static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len)
+{
+	ZSTD_parameters params = ZSTD_getParams(1, src_len, 0);
+
+	if (params.cParams.windowLog > ZSTD_BTRFS_MAX_WINDOWLOG)
+		params.cParams.windowLog = ZSTD_BTRFS_MAX_WINDOWLOG;
+	WARN_ON(src_len > ZSTD_BTRFS_MAX_INPUT);
+	return params;
+}
+
+struct workspace {
+	void *mem;
+	size_t size;
+	char *buf;
+	struct list_head list;
+};
+
+static void zstd_free_workspace(struct list_head *ws)
+{
+	struct workspace *workspace = list_entry(ws, struct workspace, list);
+
+	vfree(workspace->mem);
+	kfree(workspace->buf);
+	kfree(workspace);
+}
+
+static struct list_head *zstd_alloc_workspace(void)
+{
+	ZSTD_parameters params =
+			zstd_get_btrfs_parameters(ZSTD_BTRFS_MAX_INPUT);
+	struct workspace *workspace;
+
+	workspace = kzalloc(sizeof(*workspace), GFP_NOFS);
+	if (!workspace)
+		return ERR_PTR(-ENOMEM);
+
+	workspace->size = max_t(size_t,
+			ZSTD_CStreamWorkspaceBound(params.cParams),
+			ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT));
+	workspace->mem = vmalloc(workspace->size);
+	workspace->buf = kmalloc(PAGE_SIZE, GFP_NOFS);
+	if (!workspace->mem || !workspace->buf)
+		goto fail;
+
+	INIT_LIST_HEAD(&workspace->list);
+
+	return &workspace->list;
+fail:
+	zstd_free_workspace(&workspace->list);
+	return ERR_PTR(-ENOMEM);
+}
+
+static int zstd_compress_pages(struct list_head *ws,
+		struct address_space *mapping,
+		u64 start,
+		struct page **pages,
+		unsigned long *out_pages,
+		unsigned long *total_in,
+		unsigned long *total_out)
+{
+	struct workspace *workspace = list_entry(ws, struct workspace, list);
+	ZSTD_CStream *stream;
+	int ret = 0;
+	int nr_pages = 0;
+	struct page *in_page = NULL;  /* The current page to read */
+	struct page *out_page = NULL; /* The current page to write to */
+	ZSTD_inBuffer in_buf = { NULL, 0, 0 };
+	ZSTD_outBuffer out_buf = { NULL, 0, 0 };
+	unsigned long tot_in = 0;
+	unsigned long tot_out = 0;
+	unsigned long len = *total_out;
+	const unsigned long nr_dest_pages = *out_pages;
+	unsigned long max_out = nr_dest_pages * PAGE_SIZE;
+	ZSTD_parameters params = zstd_get_btrfs_parameters(len);
+
+	*out_pages = 0;
+	*total_out = 0;
+	*total_in = 0;
+
+	/* Initialize the stream */
+	stream = ZSTD_initCStream(params, len, workspace->mem,
+			workspace->size);
+	if (!stream) {
+		pr_warn("BTRFS: ZSTD_initCStream failed\n");
+		ret = -EIO;
+		goto out;
+	}
+
+	/* map in the first page of input data */
+	in_page = find_get_page(mapping, start >> PAGE_SHIFT);
+	in_buf.src = kmap(in_page);
+	in_buf.pos = 0;
+	in_buf.size = min_t(size_t, len, PAGE_SIZE);
+
+
+	/* Allocate and map in the output buffer */
+	out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+	if (out_page == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	pages[nr_pages++] = out_page;
+	out_buf.dst = kmap(out_page);
+	out_buf.pos = 0;
+	out_buf.size = min_t(size_t, max_out, PAGE_SIZE);
+
+	while (1) {
+		size_t ret2;
+
+		ret2 = ZSTD_compressStream(stream, &out_buf, &in_buf);
+		if (ZSTD_isError(ret2)) {
+			pr_debug("BTRFS: ZSTD_compressStream returned %d\n",
+					ZSTD_getErrorCode(ret2));
+			ret = -EIO;
+			goto out;
+		}
+
+		/* Check to see if we are making it bigger */
+		if (tot_in + in_buf.pos > 8192 &&
+				tot_in + in_buf.pos <
+				tot_out + out_buf.pos) {
+			ret = -E2BIG;
+			goto out;
+		}
+
+		/* We've reached the end of our output range */
+		if (out_buf.pos >= max_out) {
+			tot_out += out_buf.pos;
+			ret = -E2BIG;
+			goto out;
+		}
+
+		/* Check if we need more output space */
+		if (out_buf.pos == out_buf.size) {
+			tot_out += PAGE_SIZE;
+			max_out -= PAGE_SIZE;
+			kunmap(out_page);
+			if (nr_pages == nr_dest_pages) {
+				out_page = NULL;
+				ret = -E2BIG;
+				goto out;
+			}
+			out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+			if (out_page == NULL) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			pages[nr_pages++] = out_page;
+			out_buf.dst = kmap(out_page);
+			out_buf.pos = 0;
+			out_buf.size = min_t(size_t, max_out, PAGE_SIZE);
+		}
+
+		/* We've reached the end of the input */
+		if (in_buf.pos >= len) {
+			tot_in += in_buf.pos;
+			break;
+		}
+
+		/* Check if we need more input */
+		if (in_buf.pos == in_buf.size) {
+			tot_in += PAGE_SIZE;
+			kunmap(in_page);
+			put_page(in_page);
+
+			start += PAGE_SIZE;
+			len -= PAGE_SIZE;
+			in_page = find_get_page(mapping, start >> PAGE_SHIFT);
+			in_buf.src = kmap(in_page);
+			in_buf.pos = 0;
+			in_buf.size = min_t(size_t, len, PAGE_SIZE);
+		}
+	}
+	while (1) {
+		size_t ret2;
+
+		ret2 = ZSTD_endStream(stream, &out_buf);
+		if (ZSTD_isError(ret2)) {
+			pr_debug("BTRFS: ZSTD_endStream returned %d\n",
+					ZSTD_getErrorCode(ret2));
+			ret = -EIO;
+			goto out;
+		}
+		if (ret2 == 0) {
+			tot_out += out_buf.pos;
+			break;
+		}
+		if (out_buf.pos >= max_out) {
+			tot_out += out_buf.pos;
+			ret = -E2BIG;
+			goto out;
+		}
+
+		tot_out += PAGE_SIZE;
+		max_out -= PAGE_SIZE;
+		kunmap(out_page);
+		if (nr_pages == nr_dest_pages) {
+			out_page = NULL;
+			ret = -E2BIG;
+			goto out;
+		}
+		out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+		if (out_page == NULL) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		pages[nr_pages++] = out_page;
+		out_buf.dst = kmap(out_page);
+		out_buf.pos = 0;
+		out_buf.size = min_t(size_t, max_out, PAGE_SIZE);
+	}
+
+	if (tot_out >= tot_in) {
+		ret = -E2BIG;
+		goto out;
+	}
+
+	ret = 0;
+	*total_in = tot_in;
+	*total_out = tot_out;
+out:
+	*out_pages = nr_pages;
+	/* Cleanup */
+	if (in_page) {
+		kunmap(in_page);
+		put_page(in_page);
+	}
+	if (out_page)
+		kunmap(out_page);
+	return ret;
+}
+
+static int zstd_decompress_bio(struct list_head *ws, struct page **pages_in,
+		u64 disk_start,
+		struct bio *orig_bio,
+		size_t srclen)
+{
+	struct workspace *workspace = list_entry(ws, struct workspace, list);
+	ZSTD_DStream *stream;
+	int ret = 0;
+	unsigned long page_in_index = 0;
+	unsigned long total_pages_in = DIV_ROUND_UP(srclen, PAGE_SIZE);
+	unsigned long buf_start;
+	unsigned long total_out = 0;
+	ZSTD_inBuffer in_buf = { NULL, 0, 0 };
+	ZSTD_outBuffer out_buf = { NULL, 0, 0 };
+
+	stream = ZSTD_initDStream(
+			ZSTD_BTRFS_MAX_INPUT, workspace->mem, workspace->size);
+	if (!stream) {
+		pr_debug("BTRFS: ZSTD_initDStream failed\n");
+		ret = -EIO;
+		goto done;
+	}
+
+	in_buf.src = kmap(pages_in[page_in_index]);
+	in_buf.pos = 0;
+	in_buf.size = min_t(size_t, srclen, PAGE_SIZE);
+
+	out_buf.dst = workspace->buf;
+	out_buf.pos = 0;
+	out_buf.size = PAGE_SIZE;
+
+	while (1) {
+		size_t ret2;
+
+		ret2 = ZSTD_decompressStream(stream, &out_buf, &in_buf);
+		if (ZSTD_isError(ret2)) {
+			pr_debug("BTRFS: ZSTD_decompressStream returned %d\n",
+					ZSTD_getErrorCode(ret2));
+			ret = -EIO;
+			goto done;
+		}
+		buf_start = total_out;
+		total_out += out_buf.pos;
+		out_buf.pos = 0;
+
+		ret = btrfs_decompress_buf2page(out_buf.dst, buf_start,
+				total_out, disk_start, orig_bio);
+		if (ret == 0)
+			break;
+
+		if (in_buf.pos >= srclen)
+			break;
+
+		/* Check if we've hit the end of a frame */
+		if (ret2 == 0)
+			break;
+
+		if (in_buf.pos == in_buf.size) {
+			kunmap(pages_in[page_in_index++]);
+			if (page_in_index >= total_pages_in) {
+				in_buf.src = NULL;
+				ret = -EIO;
+				goto done;
+			}
+			srclen -= PAGE_SIZE;
+			in_buf.src = kmap(pages_in[page_in_index]);
+			in_buf.pos = 0;
+			in_buf.size = min_t(size_t, srclen, PAGE_SIZE);
+		}
+	}
+	ret = 0;
+	zero_fill_bio(orig_bio);
+done:
+	if (in_buf.src)
+		kunmap(pages_in[page_in_index]);
+	return ret;
+}
+
+static int zstd_decompress(struct list_head *ws, unsigned char *data_in,
+		struct page *dest_page,
+		unsigned long start_byte,
+		size_t srclen, size_t destlen)
+{
+	struct workspace *workspace = list_entry(ws, struct workspace, list);
+	ZSTD_DStream *stream;
+	int ret = 0;
+	size_t ret2;
+	ZSTD_inBuffer in_buf = { NULL, 0, 0 };
+	ZSTD_outBuffer out_buf = { NULL, 0, 0 };
+	unsigned long total_out = 0;
+	unsigned long pg_offset = 0;
+	char *kaddr;
+
+	stream = ZSTD_initDStream(
+			ZSTD_BTRFS_MAX_INPUT, workspace->mem, workspace->size);
+	if (!stream) {
+		pr_warn("BTRFS: ZSTD_initDStream failed\n");
+		ret = -EIO;
+		goto finish;
+	}
+
+	destlen = min_t(size_t, destlen, PAGE_SIZE);
+
+	in_buf.src = data_in;
+	in_buf.pos = 0;
+	in_buf.size = srclen;
+
+	out_buf.dst = workspace->buf;
+	out_buf.pos = 0;
+	out_buf.size = PAGE_SIZE;
+
+	ret2 = 1;
+	while (pg_offset < destlen && in_buf.pos < in_buf.size) {
+		unsigned long buf_start;
+		unsigned long buf_offset;
+		unsigned long bytes;
+
+		/* Check if the frame is over and we still need more input */
+		if (ret2 == 0) {
+			pr_debug("BTRFS: ZSTD_decompressStream ended early\n");
+			ret = -EIO;
+			goto finish;
+		}
+		ret2 = ZSTD_decompressStream(stream, &out_buf, &in_buf);
+		if (ZSTD_isError(ret2)) {
+			pr_debug("BTRFS: ZSTD_decompressStream returned %d\n",
+					ZSTD_getErrorCode(ret2));
+			ret = -EIO;
+			goto finish;
+		}
+
+		buf_start = total_out;
+		total_out += out_buf.pos;
+		out_buf.pos = 0;
+
+		if (total_out <= start_byte)
+			continue;
+
+		if (total_out > start_byte && buf_start < start_byte)
+			buf_offset = start_byte - buf_start;
+		else
+			buf_offset = 0;
+
+		bytes = min_t(unsigned long, destlen - pg_offset,
+				out_buf.size - buf_offset);
+
+		kaddr = kmap_atomic(dest_page);
+		memcpy(kaddr + pg_offset, out_buf.dst + buf_offset, bytes);
+		kunmap_atomic(kaddr);
+
+		pg_offset += bytes;
+	}
+	ret = 0;
+finish:
+	if (pg_offset < destlen) {
+		kaddr = kmap_atomic(dest_page);
+		memset(kaddr + pg_offset, 0, destlen - pg_offset);
+		kunmap_atomic(kaddr);
+	}
+	return ret;
+}
+
+const struct btrfs_compress_op btrfs_zstd_compress = {
+	.alloc_workspace = zstd_alloc_workspace,
+	.free_workspace = zstd_free_workspace,
+	.compress_pages = zstd_compress_pages,
+	.decompress_bio = zstd_decompress_bio,
+	.decompress = zstd_decompress,
+};
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index a456e53..992c150 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -255,13 +255,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
 #define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS	(1ULL << 2)
 #define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO	(1ULL << 3)
-/*
- * some patches floated around with a second compression method
- * lets save that incompat here for when they do get in
- * Note we don't actually support it, we're just reserving the
- * number
- */
-#define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZOv2	(1ULL << 4)
+#define BTRFS_FEATURE_INCOMPAT_COMPRESS_ZSTD	(1ULL << 4)

 /*
  * older kernels tried to do bigger metadata blocks, but the
--
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] squashfs: Add zstd support
  2017-06-22 22:01 [PATCH 1/4] lib: Add xxhash module Nick Terrell
  2017-06-22 22:01 ` [PATCH 3/4] btrfs: Add zstd support Nick Terrell
@ 2017-06-22 22:01 ` Nick Terrell
  1 sibling, 0 replies; 18+ messages in thread
From: Nick Terrell @ 2017-06-22 22:01 UTC (permalink / raw)
  To: Nick Terrell
  Cc: kernel-team, Chris Mason, Yann Collet, squashfs-devel,
	linux-btrfs, linux-kernel, Sean Purcell

Add zstd compression and decompression support to SquashFS. zstd is a
great fit for SquashFS because it can compress at ratios approaching xz,
while decompressing twice as fast as zlib. For SquashFS in particular,
it can decompress as fast as lzo and lz4. It also has the flexibility
to turn down the compression ratio for faster compression times.

The compression benchmark is run on the file tree from the SquashFS archive
found in ubuntu-16.10-desktop-amd64.iso [1]. It uses `mksquashfs` with the
default block size (128 KB) and and various compression algorithms/levels.
xz and zstd are also benchmarked with 256 KB blocks. The decompression
benchmark times how long it takes to `tar` the file tree into `/dev/null`.
See the benchmark file in the upstream zstd source repository located under
`contrib/linux-kernel/squashfs-benchmark.sh` [2] for details.

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD.

| Method         | Ratio | Compression MB/s | Decompression MB/s |
|----------------|-------|------------------|--------------------|
| gzip           |  2.92 |               15 |                128 |
| lzo            |  2.64 |              9.5 |                217 |
| lz4            |  2.12 |               94 |                218 |
| xz             |  3.43 |              5.5 |                 35 |
| xz 256 KB      |  3.53 |              5.4 |                 40 |
| zstd 1         |  2.71 |               96 |                210 |
| zstd 5         |  2.93 |               69 |                198 |
| zstd 10        |  3.01 |               41 |                225 |
| zstd 15        |  3.13 |             11.4 |                224 |
| zstd 16 256 KB |  3.24 |              8.1 |                210 |

This patch was written by Sean Purcell <me@seanp.xyz>, but I will be
taking over the submission process.

[1] http://releases.ubuntu.com/16.10/
[2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/squashfs-benchmark.sh

zstd source repository: https://github.com/facebook/zstd

Cc: Sean Purcell <me@seanp.xyz>
Signed-off-by: Nick Terrell <terrelln@fb.com>
---
 fs/squashfs/Kconfig        |  14 +++++
 fs/squashfs/Makefile       |   1 +
 fs/squashfs/decompressor.c |   7 +++
 fs/squashfs/decompressor.h |   4 ++
 fs/squashfs/squashfs_fs.h  |   1 +
 fs/squashfs/zstd_wrapper.c | 150 +++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 177 insertions(+)
 create mode 100644 fs/squashfs/zstd_wrapper.c

diff --git a/fs/squashfs/Kconfig b/fs/squashfs/Kconfig
index ffb093e..1adb334 100644
--- a/fs/squashfs/Kconfig
+++ b/fs/squashfs/Kconfig
@@ -165,6 +165,20 @@ config SQUASHFS_XZ

 	  If unsure, say N.

+config SQUASHFS_ZSTD
+	bool "Include support for ZSTD compressed file systems"
+	depends on SQUASHFS
+	select ZSTD_DECOMPRESS
+	help
+	  Saying Y here includes support for reading Squashfs file systems
+	  compressed with ZSTD compression.  ZSTD gives better compression than
+	  the default ZLIB compression, while using less CPU.
+
+	  ZSTD is not the standard compression used in Squashfs and so most
+	  file systems will be readable without selecting this option.
+
+	  If unsure, say N.
+
 config SQUASHFS_4K_DEVBLK_SIZE
 	bool "Use 4K device block size?"
 	depends on SQUASHFS
diff --git a/fs/squashfs/Makefile b/fs/squashfs/Makefile
index 246a6f3..6655631 100644
--- a/fs/squashfs/Makefile
+++ b/fs/squashfs/Makefile
@@ -15,3 +15,4 @@ squashfs-$(CONFIG_SQUASHFS_LZ4) += lz4_wrapper.o
 squashfs-$(CONFIG_SQUASHFS_LZO) += lzo_wrapper.o
 squashfs-$(CONFIG_SQUASHFS_XZ) += xz_wrapper.o
 squashfs-$(CONFIG_SQUASHFS_ZLIB) += zlib_wrapper.o
+squashfs-$(CONFIG_SQUASHFS_ZSTD) += zstd_wrapper.o
diff --git a/fs/squashfs/decompressor.c b/fs/squashfs/decompressor.c
index d2bc136..8366398 100644
--- a/fs/squashfs/decompressor.c
+++ b/fs/squashfs/decompressor.c
@@ -65,6 +65,12 @@ static const struct squashfs_decompressor squashfs_zlib_comp_ops = {
 };
 #endif

+#ifndef CONFIG_SQUASHFS_ZSTD
+static const struct squashfs_decompressor squashfs_zstd_comp_ops = {
+	NULL, NULL, NULL, NULL, ZSTD_COMPRESSION, "zstd", 0
+};
+#endif
+
 static const struct squashfs_decompressor squashfs_unknown_comp_ops = {
 	NULL, NULL, NULL, NULL, 0, "unknown", 0
 };
@@ -75,6 +81,7 @@ static const struct squashfs_decompressor *decompressor[] = {
 	&squashfs_lzo_comp_ops,
 	&squashfs_xz_comp_ops,
 	&squashfs_lzma_unsupported_comp_ops,
+	&squashfs_zstd_comp_ops,
 	&squashfs_unknown_comp_ops
 };

diff --git a/fs/squashfs/decompressor.h b/fs/squashfs/decompressor.h
index a25713c..0f5a8e4 100644
--- a/fs/squashfs/decompressor.h
+++ b/fs/squashfs/decompressor.h
@@ -58,4 +58,8 @@ extern const struct squashfs_decompressor squashfs_lzo_comp_ops;
 extern const struct squashfs_decompressor squashfs_zlib_comp_ops;
 #endif

+#ifdef CONFIG_SQUASHFS_ZSTD
+extern const struct squashfs_decompressor squashfs_zstd_comp_ops;
+#endif
+
 #endif
diff --git a/fs/squashfs/squashfs_fs.h b/fs/squashfs/squashfs_fs.h
index 506f4ba..24d12fd 100644
--- a/fs/squashfs/squashfs_fs.h
+++ b/fs/squashfs/squashfs_fs.h
@@ -241,6 +241,7 @@ struct meta_index {
 #define LZO_COMPRESSION		3
 #define XZ_COMPRESSION		4
 #define LZ4_COMPRESSION		5
+#define ZSTD_COMPRESSION	6

 struct squashfs_super_block {
 	__le32			s_magic;
diff --git a/fs/squashfs/zstd_wrapper.c b/fs/squashfs/zstd_wrapper.c
new file mode 100644
index 0000000..8cb7c76
--- /dev/null
+++ b/fs/squashfs/zstd_wrapper.c
@@ -0,0 +1,150 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2016-present, Facebook, Inc.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * zstd_wrapper.c
+ */
+
+#include <linux/mutex.h>
+#include <linux/buffer_head.h>
+#include <linux/slab.h>
+#include <linux/zstd.h>
+#include <linux/vmalloc.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs.h"
+#include "decompressor.h"
+#include "page_actor.h"
+
+struct workspace {
+	void *mem;
+	size_t mem_size;
+};
+
+static void *zstd_init(struct squashfs_sb_info *msblk, void *buff)
+{
+	struct workspace *wksp = kmalloc(sizeof(*wksp), GFP_KERNEL);
+	if (wksp == NULL)
+		goto failed;
+	wksp->mem_size = ZSTD_DStreamWorkspaceBound(max_t(size_t,
+				msblk->block_size, SQUASHFS_METADATA_SIZE));
+	wksp->mem = vmalloc(wksp->mem_size);
+	if (wksp->mem == NULL)
+		goto failed;
+
+	return wksp;
+
+failed:
+	ERROR("Failed to allocate zstd workspace\n");
+	kfree(wksp);
+	return ERR_PTR(-ENOMEM);
+}
+
+
+static void zstd_free(void *strm)
+{
+	struct workspace *wksp = strm;
+
+	if (wksp)
+		vfree(wksp->mem);
+	kfree(wksp);
+}
+
+
+static int zstd_uncompress(struct squashfs_sb_info *msblk, void *strm,
+	struct buffer_head **bh, int b, int offset, int length,
+	struct squashfs_page_actor *output)
+{
+	struct workspace *wksp = strm;
+	ZSTD_DStream *stream;
+	size_t total_out = 0;
+	size_t zstd_err;
+	int k = 0;
+	ZSTD_inBuffer in_buf = { NULL, 0, 0 };
+	ZSTD_outBuffer out_buf = { NULL, 0, 0 };
+
+	stream = ZSTD_initDStream(wksp->mem_size, wksp->mem, wksp->mem_size);
+
+	if (!stream) {
+		ERROR("Failed to initialize zstd decompressor\n");
+		goto out;
+	}
+
+	out_buf.size = PAGE_SIZE;
+	out_buf.dst = squashfs_first_page(output);
+
+	do {
+		if (in_buf.pos == in_buf.size && k < b) {
+			int avail = min(length, msblk->devblksize - offset);
+			length -= avail;
+			in_buf.src = bh[k]->b_data + offset;
+			in_buf.size = avail;
+			in_buf.pos = 0;
+			offset = 0;
+		}
+
+		if (out_buf.pos == out_buf.size) {
+			out_buf.dst = squashfs_next_page(output);
+			if (out_buf.dst == NULL) {
+				/* shouldn't run out of pages before stream is
+				 * done */
+				squashfs_finish_page(output);
+				goto out;
+			}
+			out_buf.pos = 0;
+			out_buf.size = PAGE_SIZE;
+		}
+
+		total_out -= out_buf.pos;
+		zstd_err = ZSTD_decompressStream(stream, &out_buf, &in_buf);
+		total_out += out_buf.pos; /* add the additional data produced */
+
+		if (in_buf.pos == in_buf.size && k < b)
+			put_bh(bh[k++]);
+	} while (zstd_err != 0 && !ZSTD_isError(zstd_err));
+
+	squashfs_finish_page(output);
+
+	if (ZSTD_isError(zstd_err)) {
+		ERROR("zstd decompression error: %d\n",
+				(int)ZSTD_getErrorCode(zstd_err));
+		goto out;
+	}
+
+	if (k < b)
+		goto out;
+
+	return (int)total_out;
+
+out:
+	for (; k < b; k++)
+		put_bh(bh[k]);
+
+	return -EIO;
+}
+
+const struct squashfs_decompressor squashfs_zstd_comp_ops = {
+	.init = zstd_init,
+	.free = zstd_free,
+	.decompress = zstd_uncompress,
+	.id = ZSTD_COMPRESSION,
+	.name = "zstd",
+	.supported = 1
+};
--
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] btrfs: Add zstd support
  2017-06-22 22:01 ` [PATCH 3/4] btrfs: Add zstd support Nick Terrell
@ 2017-06-25 15:02   ` kbuild test robot
  2017-06-25 19:03   ` kbuild test robot
  1 sibling, 0 replies; 18+ messages in thread
From: kbuild test robot @ 2017-06-25 15:02 UTC (permalink / raw)
  To: Nick Terrell
  Cc: kbuild-all, Nick Terrell, kernel-team, Chris Mason, Yann Collet,
	squashfs-devel, linux-btrfs, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4473 bytes --]

Hi Nick,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.12-rc6 next-20170623]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Nick-Terrell/lib-Add-xxhash-module/20170625-214344
config: blackfin-allyesconfig (attached as .config)
compiler: bfin-uclinux-gcc (GCC) 6.2.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=blackfin 

All error/warnings (new ones prefixed by >>):

   lib//zstd/fse_compress.c: In function 'FSE_buildCTable_wksp':
>> lib//zstd/fse_compress.c:181:1: warning: the frame size of 1036 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    }
    ^
   lib//zstd/fse_compress.c: In function 'FSE_compress_wksp':
   lib//zstd/fse_compress.c:857:1: warning: the frame size of 1552 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    }
    ^
--
   lib//zstd/compress.c: In function 'ZSTD_compressBlock_lazy':
>> lib//zstd/compress.c:2036:1: error: unable to find a register to spill in class 'CCREGS'
    static void ZSTD_compressBlock_lazy(ZSTD_CCtx *ctx, const void *src, size_t srcSize) { ZSTD_compressBlock_lazy_generic(ctx, src, srcSize, 0, 1); }
    ^~~~~~
>> lib//zstd/compress.c:2036:1: error: this is the insn:
   (insn 213 11 1172 9 (set (reg:BI 1429)
           (eq:BI (reg/v:SI 62 [ mls ])
               (const_int 5 [0x5]))) lib//zstd/compress.c:1855 118 {compare_eq}
        (nil))
   lib//zstd/compress.c:2036: confused by earlier errors, bailing out
--
   lib//zstd/huf_decompress.c: In function 'HUF_readDTableX4':
>> lib//zstd/huf_decompress.c:556:1: warning: the frame size of 1636 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    }
    ^

vim +/CCREGS +2036 lib//zstd/compress.c

87a5643e Nick Terrell 2017-06-22  2020  	/* Save reps for next block */
87a5643e Nick Terrell 2017-06-22  2021  	ctx->repToConfirm[0] = offset_1 ? offset_1 : savedOffset;
87a5643e Nick Terrell 2017-06-22  2022  	ctx->repToConfirm[1] = offset_2 ? offset_2 : savedOffset;
87a5643e Nick Terrell 2017-06-22  2023  
87a5643e Nick Terrell 2017-06-22  2024  	/* Last Literals */
87a5643e Nick Terrell 2017-06-22  2025  	{
87a5643e Nick Terrell 2017-06-22  2026  		size_t const lastLLSize = iend - anchor;
87a5643e Nick Terrell 2017-06-22  2027  		memcpy(seqStorePtr->lit, anchor, lastLLSize);
87a5643e Nick Terrell 2017-06-22  2028  		seqStorePtr->lit += lastLLSize;
87a5643e Nick Terrell 2017-06-22  2029  	}
87a5643e Nick Terrell 2017-06-22  2030  }
87a5643e Nick Terrell 2017-06-22  2031  
87a5643e Nick Terrell 2017-06-22  2032  static void ZSTD_compressBlock_btlazy2(ZSTD_CCtx *ctx, const void *src, size_t srcSize) { ZSTD_compressBlock_lazy_generic(ctx, src, srcSize, 1, 2); }
87a5643e Nick Terrell 2017-06-22  2033  
87a5643e Nick Terrell 2017-06-22  2034  static void ZSTD_compressBlock_lazy2(ZSTD_CCtx *ctx, const void *src, size_t srcSize) { ZSTD_compressBlock_lazy_generic(ctx, src, srcSize, 0, 2); }
87a5643e Nick Terrell 2017-06-22  2035  
87a5643e Nick Terrell 2017-06-22 @2036  static void ZSTD_compressBlock_lazy(ZSTD_CCtx *ctx, const void *src, size_t srcSize) { ZSTD_compressBlock_lazy_generic(ctx, src, srcSize, 0, 1); }
87a5643e Nick Terrell 2017-06-22  2037  
87a5643e Nick Terrell 2017-06-22  2038  static void ZSTD_compressBlock_greedy(ZSTD_CCtx *ctx, const void *src, size_t srcSize) { ZSTD_compressBlock_lazy_generic(ctx, src, srcSize, 0, 0); }
87a5643e Nick Terrell 2017-06-22  2039  
87a5643e Nick Terrell 2017-06-22  2040  FORCE_INLINE
87a5643e Nick Terrell 2017-06-22  2041  void ZSTD_compressBlock_lazy_extDict_generic(ZSTD_CCtx *ctx, const void *src, size_t srcSize, const U32 searchMethod, const U32 depth)
87a5643e Nick Terrell 2017-06-22  2042  {
87a5643e Nick Terrell 2017-06-22  2043  	seqStore_t *seqStorePtr = &(ctx->seqStore);
87a5643e Nick Terrell 2017-06-22  2044  	const BYTE *const istart = (const BYTE *)src;

:::::: The code at line 2036 was first introduced by commit
:::::: 87a5643e3b02e4cb9fb83bf8f6da13be18677883 lib: Add zstd modules

:::::: TO: Nick Terrell <terrelln@fb.com>
:::::: CC: 0day robot <fengguang.wu@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 44984 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] btrfs: Add zstd support
  2017-06-22 22:01 ` [PATCH 3/4] btrfs: Add zstd support Nick Terrell
  2017-06-25 15:02   ` kbuild test robot
@ 2017-06-25 19:03   ` kbuild test robot
  2017-06-25 21:30     ` Adam Borowski
  1 sibling, 1 reply; 18+ messages in thread
From: kbuild test robot @ 2017-06-25 19:03 UTC (permalink / raw)
  To: Nick Terrell
  Cc: kbuild-all, Nick Terrell, kernel-team, Chris Mason, Yann Collet,
	squashfs-devel, linux-btrfs, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 785 bytes --]

Hi Nick,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.12-rc6]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Nick-Terrell/lib-Add-xxhash-module/20170625-214344
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

>> ERROR: "__udivdi3" [lib/zstd/zstd_compress.ko] undefined!
   ERROR: "__udivdi3" [fs/ufs/ufs.ko] undefined!

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 60100 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] btrfs: Add zstd support
  2017-06-25 19:03   ` kbuild test robot
@ 2017-06-25 21:30     ` Adam Borowski
  2017-06-26 12:12       ` David Sterba
  0 siblings, 1 reply; 18+ messages in thread
From: Adam Borowski @ 2017-06-25 21:30 UTC (permalink / raw)
  To: Nick Terrell
  Cc: kernel-team, Chris Mason, Yann Collet, squashfs-devel,
	linux-btrfs, linux-kernel

On Mon, Jun 26, 2017 at 03:03:17AM +0800, kbuild test robot wrote:
> Hi Nick,
> 
> url:    https://github.com/0day-ci/linux/commits/Nick-Terrell/lib-Add-xxhash-module/20170625-214344
> config: i386-allmodconfig (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
> >> ERROR: "__udivdi3" [lib/zstd/zstd_compress.ko] undefined!
>    ERROR: "__udivdi3" [fs/ufs/ufs.ko] undefined!

Just to save you time to figure it out:
for division when one or both arguments are longer than the architecture's
word, gcc uses helper functions that are included when compiling in a hosted
environment -- but not in freestanding.

Thus, you want do_div() instead of /; do check widths and signedness of
arguments.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄⠀⠀⠀⠀ A master species delegates.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] btrfs: Add zstd support
  2017-06-25 21:30     ` Adam Borowski
@ 2017-06-26 12:12       ` David Sterba
  2017-06-26 16:54         ` Nick Terrell
  2017-06-27  4:18         ` [PATCH] lib/zstd: use div_u64() to let it build on 32-bit Adam Borowski
  0 siblings, 2 replies; 18+ messages in thread
From: David Sterba @ 2017-06-26 12:12 UTC (permalink / raw)
  To: Adam Borowski
  Cc: Nick Terrell, kernel-team, Chris Mason, Yann Collet,
	squashfs-devel, linux-btrfs, linux-kernel

On Sun, Jun 25, 2017 at 11:30:22PM +0200, Adam Borowski wrote:
> On Mon, Jun 26, 2017 at 03:03:17AM +0800, kbuild test robot wrote:
> > Hi Nick,
> > 
> > url:    https://github.com/0day-ci/linux/commits/Nick-Terrell/lib-Add-xxhash-module/20170625-214344
> > config: i386-allmodconfig (attached as .config)
> > compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> > reproduce:
> >         # save the attached .config to linux build tree
> >         make ARCH=i386 
> > 
> > All errors (new ones prefixed by >>):
> > 
> > >> ERROR: "__udivdi3" [lib/zstd/zstd_compress.ko] undefined!
> >    ERROR: "__udivdi3" [fs/ufs/ufs.ko] undefined!
> 
> Just to save you time to figure it out:
> for division when one or both arguments are longer than the architecture's
> word, gcc uses helper functions that are included when compiling in a hosted
> environment -- but not in freestanding.
> 
> Thus, you want do_div() instead of /; do check widths and signedness of
> arguments.

No do_div please, div_u64 or div64_u64.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] btrfs: Add zstd support
  2017-06-26 12:12       ` David Sterba
@ 2017-06-26 16:54         ` Nick Terrell
  2017-06-27  4:18         ` [PATCH] lib/zstd: use div_u64() to let it build on 32-bit Adam Borowski
  1 sibling, 0 replies; 18+ messages in thread
From: Nick Terrell @ 2017-06-26 16:54 UTC (permalink / raw)
  To: dsterba@suse.cz, Adam Borowski
  Cc: Kernel Team, Chris Mason, Yann Collet,
	squashfs-devel@lists.sourceforge.net, linux-btrfs@vger.kernel.org,
	linux-kernel@vger.kernel.org

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1358 bytes --]

Thanks for the clarification! I will fix the divisions.

On 6/26/17, 5:12 AM, "David Sterba" <dsterba@suse.cz> wrote:

    On Sun, Jun 25, 2017 at 11:30:22PM +0200, Adam Borowski wrote:
    > On Mon, Jun 26, 2017 at 03:03:17AM +0800, kbuild test robot wrote:
    > > Hi Nick,
    > > 
    > > url:    https://github.com/0day-ci/linux/commits/Nick-Terrell/lib-Add-xxhash-module/20170625-214344
    > > config: i386-allmodconfig (attached as .config)
    > > compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
    > > reproduce:
    > >         # save the attached .config to linux build tree
    > >         make ARCH=i386 
    > > 
    > > All errors (new ones prefixed by >>):
    > > 
    > > >> ERROR: "__udivdi3" [lib/zstd/zstd_compress.ko] undefined!
    > >    ERROR: "__udivdi3" [fs/ufs/ufs.ko] undefined!
    > 
    > Just to save you time to figure it out:
    > for division when one or both arguments are longer than the architecture's
    > word, gcc uses helper functions that are included when compiling in a hosted
    > environment -- but not in freestanding.
    > 
    > Thus, you want do_div() instead of /; do check widths and signedness of
    > arguments.
    
    No do_div please, div_u64 or div64_u64.
    


ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨èÚ&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~††Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ß£øm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
  2017-06-26 12:12       ` David Sterba
  2017-06-26 16:54         ` Nick Terrell
@ 2017-06-27  4:18         ` Adam Borowski
  2017-06-27  5:27           ` Nick Terrell
  1 sibling, 1 reply; 18+ messages in thread
From: Adam Borowski @ 2017-06-27  4:18 UTC (permalink / raw)
  To: dsterba, Nick Terrell, kernel-team, Chris Mason, Yann Collet,
	squashfs-devel, linux-btrfs, linux-kernel
  Cc: Adam Borowski

David Sterba wrote:
> > Thus, you want do_div() instead of /; do check widths and signedness of
> > arguments.
>
> No do_div please, div_u64 or div64_u64.

Good to know, the interface of do_div() is indeed weird.

I guess Nick has found and fixed the offending divisions in his tree
already, but this patch I'm sending is what I'm testing.

One thing to note is that it divides u64 by size_t, so the actual operation
differs on 32 vs 64-bit.  Yet the code fails to handle compressing pieces
bigger than 4GB in other places -- so use of size_t is misleading.  Perhaps
u32 would better convey this limitation?

Anyway, that this code didn't even compile on 32-bit also means it hasn't
been tested.  I just happen to have such an ARM machine doing Debian archive
rebuilds; I've rewritten the chroots with compress=zstd; this should be a
nice non-artificial test.  The load consists of snapshot+dpkg+gcc/etc+
assorted testsuites, two sbuild instances.  Seems to work fine for a whole
hour (yay!) already, let's see if there'll be any explosions.


-- >8 ---- >8 ---- >8 ---- >8 ---- >8 ---- >8 ---- >8 ---- >8 ---- >8 --
Note that "total" is limited to 2³²-1 elsewhere despite being declared
as size_t, so it's ok to use 64/32 -- it's much faster on eg. x86-32
than 64/64.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
---
 lib/zstd/fse_compress.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/zstd/fse_compress.c b/lib/zstd/fse_compress.c
index e016bb177833..f59f9ebfe9c0 100644
--- a/lib/zstd/fse_compress.c
+++ b/lib/zstd/fse_compress.c
@@ -49,6 +49,7 @@
 #include "fse.h"
 #include <linux/compiler.h>
 #include <linux/string.h> /* memcpy, memset */
+#include <linux/math64.h>
 
 /* **************************************************************
 *  Error Management
@@ -575,7 +576,7 @@ static size_t FSE_normalizeM2(short *norm, U32 tableLog, const unsigned *count,
 	{
 		U64 const vStepLog = 62 - tableLog;
 		U64 const mid = (1ULL << (vStepLog - 1)) - 1;
-		U64 const rStep = ((((U64)1 << vStepLog) * ToDistribute) + mid) / total; /* scale on remaining */
+		U64 const rStep = div_u64((((U64)1 << vStepLog) * ToDistribute) + mid, total); /* scale on remaining */
 		U64 tmpTotal = mid;
 		for (s = 0; s <= maxSymbolValue; s++) {
 			if (norm[s] == NOT_YET_ASSIGNED) {
@@ -609,7 +610,7 @@ size_t FSE_normalizeCount(short *normalizedCounter, unsigned tableLog, const uns
 	{
 		U32 const rtbTable[] = {0, 473195, 504333, 520860, 550000, 700000, 750000, 830000};
 		U64 const scale = 62 - tableLog;
-		U64 const step = ((U64)1 << 62) / total; /* <== here, one division ! */
+		U64 const step = div_u64((U64)1 << 62, total); /* <== here, one division ! */
 		U64 const vStep = 1ULL << (scale - 20);
 		int stillToDistribute = 1 << tableLog;
 		unsigned s;
-- 
2.13.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
  2017-06-27  4:18         ` [PATCH] lib/zstd: use div_u64() to let it build on 32-bit Adam Borowski
@ 2017-06-27  5:27           ` Nick Terrell
  2017-06-27 12:57             ` David Sterba
  2017-06-29  1:02             ` Adam Borowski
  0 siblings, 2 replies; 18+ messages in thread
From: Nick Terrell @ 2017-06-27  5:27 UTC (permalink / raw)
  To: Adam Borowski, dsterba@suse.cz, Kernel Team, Chris Mason,
	Yann Collet, squashfs-devel@lists.sourceforge.net,
	linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4071 bytes --]

Adam, Iâ€™ve applied the same patch in my tree. Iâ€™ll send out the update [1]
once it's reviewed, since I also reduced the stack usage of functions
using over 1 KB of stack space.

Youâ€™re right that div_u64() will work, since the FSE functions are only
called on blocks of at most 128 KB at a time. Perhaps a u32 would be
clearer, but I would prefer to leave the signatures as is, to stay closer
to upstream. Upstream FSE should work with sizes larger than 4 GB, but
since it can't happen in zstd, it isn't a priority.

I have userland tests set up mocking the linux kernel headers, and tested
32-bit mode there, but neglected to test the kernel on a 32-bit VM, which
Iâ€™ve now corrected. Thanks for testing the patch on your ARM machine!

[1] https://github.com/facebook/zstd/pull/738/files

On 6/26/17, 9:18 PM, "Adam Borowski" <kilobyte@angband.pl> wrote:

    David Sterba wrote:
    > > Thus, you want do_div() instead of /; do check widths and signedness of
    > > arguments.
    >
    > No do_div please, div_u64 or div64_u64.
    
    Good to know, the interface of do_div() is indeed weird.
    
    I guess Nick has found and fixed the offending divisions in his tree
    already, but this patch I'm sending is what I'm testing.
    
    One thing to note is that it divides u64 by size_t, so the actual operation
    differs on 32 vs 64-bit.  Yet the code fails to handle compressing pieces
    bigger than 4GB in other places -- so use of size_t is misleading.  Perhaps
    u32 would better convey this limitation?
    
    Anyway, that this code didn't even compile on 32-bit also means it hasn't
    been tested.  I just happen to have such an ARM machine doing Debian archive
    rebuilds; I've rewritten the chroots with compress=zstd; this should be a
    nice non-artificial test.  The load consists of snapshot+dpkg+gcc/etc+
    assorted testsuites, two sbuild instances.  Seems to work fine for a whole
    hour (yay!) already, let's see if there'll be any explosions.
    
    
    -- >8 ---- >8 ---- >8 ---- >8 ---- >8 ---- >8 ---- >8 ---- >8 ---- >8 --
    Note that "total" is limited to 2Â³Â²-1 elsewhere despite being declared
    as size_t, so it's ok to use 64/32 -- it's much faster on eg. x86-32
    than 64/64.
    
    Signed-off-by: Adam Borowski <kilobyte@angband.pl>
    ---
     lib/zstd/fse_compress.c | 5 +++--
     1 file changed, 3 insertions(+), 2 deletions(-)
    
    diff --git a/lib/zstd/fse_compress.c b/lib/zstd/fse_compress.c
    index e016bb177833..f59f9ebfe9c0 100644
    --- a/lib/zstd/fse_compress.c
    +++ b/lib/zstd/fse_compress.c
    @@ -49,6 +49,7 @@
     #include "fse.h"
     #include <linux/compiler.h>
     #include <linux/string.h> /* memcpy, memset */
    +#include <linux/math64.h>
     
     /* **************************************************************
     *  Error Management
    @@ -575,7 +576,7 @@ static size_t FSE_normalizeM2(short *norm, U32 tableLog, const unsigned *count,
     	{
     		U64 const vStepLog = 62 - tableLog;
     		U64 const mid = (1ULL << (vStepLog - 1)) - 1;
    -		U64 const rStep = ((((U64)1 << vStepLog) * ToDistribute) + mid) / total; /* scale on remaining */
    +		U64 const rStep = div_u64((((U64)1 << vStepLog) * ToDistribute) + mid, total); /* scale on remaining */
     		U64 tmpTotal = mid;
     		for (s = 0; s <= maxSymbolValue; s++) {
     			if (norm[s] == NOT_YET_ASSIGNED) {
    @@ -609,7 +610,7 @@ size_t FSE_normalizeCount(short *normalizedCounter, unsigned tableLog, const uns
     	{
     		U32 const rtbTable[] = {0, 473195, 504333, 520860, 550000, 700000, 750000, 830000};
     		U64 const scale = 62 - tableLog;
    -		U64 const step = ((U64)1 << 62) / total; /* <== here, one division ! */
    +		U64 const step = div_u64((U64)1 << 62, total); /* <== here, one division ! */
     		U64 const vStep = 1ULL << (scale - 20);
     		int stillToDistribute = 1 << tableLog;
     		unsigned s;
    -- 
    2.13.1
    
    

ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨èÚ&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~††Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ß£øm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
  2017-06-27  5:27           ` Nick Terrell
@ 2017-06-27 12:57             ` David Sterba
  2017-06-28  5:29               ` Nick Terrell
  2017-06-29  1:02             ` Adam Borowski
  1 sibling, 1 reply; 18+ messages in thread
From: David Sterba @ 2017-06-27 12:57 UTC (permalink / raw)
  To: Nick Terrell
  Cc: Adam Borowski, Kernel Team, Chris Mason, Yann Collet,
	squashfs-devel@lists.sourceforge.net, linux-btrfs@vger.kernel.org,
	linux-kernel@vger.kernel.org

Please don't top post.

On Tue, Jun 27, 2017 at 05:27:51AM +0000, Nick Terrell wrote:
> Adam, I’ve applied the same patch in my tree. I’ll send out the update [1]
> once it's reviewed, since I also reduced the stack usage of functions
> using over 1 KB of stack space.

Which function needs 1KB of stack space? That's quite a lot.

I can see in [1] that there are some on-stack buffers replaced by
pointers to the workspace. That's good, but I would like to know if
there's any hidden gem that grags the precious stack space.

> You’re right that div_u64() will work, since the FSE functions are only
> called on blocks of at most 128 KB at a time. Perhaps a u32 would be
> clearer, but I would prefer to leave the signatures as is, to stay closer
> to upstream. Upstream FSE should work with sizes larger than 4 GB, but
> since it can't happen in zstd, it isn't a priority.

Hm, I'd suggest to create a version optimized for kernel, eg. expecting
that 4+ GB buffer will never be used and you can use the most fittin in
type. This should affect only the function signatures, not the
algorithm implementation, so porting future zstd changes should be
straightforward.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
  2017-06-27 12:57             ` David Sterba
@ 2017-06-28  5:29               ` Nick Terrell
  0 siblings, 0 replies; 18+ messages in thread
From: Nick Terrell @ 2017-06-28  5:29 UTC (permalink / raw)
  To: dsterba@suse.cz
  Cc: Adam Borowski, Kernel Team, Chris Mason, Yann Collet,
	squashfs-devel@lists.sourceforge.net, linux-btrfs@vger.kernel.org,
	linux-kernel@vger.kernel.org

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1767 bytes --]

> Please don't top post.

Sorry about that.

> Which function needs 1KB of stack space? That's quite a lot.

FSE_buildCTable_wksp(), FSE_compress_wksp(), and HUF_readDTableX4()
required over 1 KB of stack space.

> I can see in [1] that there are some on-stack buffers replaced by
> pointers to the workspace. That's good, but I would like to know if
> there's any hidden gem that grags the precious stack space.

I've been hunting down functions that use up the most stack trace and
replacing buffers with pointers to the workspace. I compiled the code
with -Wframe-larger-than=512 and reduced the stack usage of all offending
functions. In the next version of the patch, no function uses more than
400 B of stack space. We'll be porting the changes back upstream as well.

> Hm, I'd suggest to create a version optimized for kernel, eg. expecting
> that 4+ GB buffer will never be used and you can use the most fittin in
> type. This should affect only the function signatures, not the
> algorithm implementation, so porting future zstd changes should be
> straightforward.

If the functions were exposed, then I would agree 100%. However, since
these are internal functions, and the rest of zstd uses size_t to represent
buffer sizes, I think it would be awkward to change just FSE/HUF functions.
I also prefer size_t because it is friendlier to the optimizer, especially
the loop optimizer, since the compiler doesn't have to worry about unsigned
overflow.

On a related note, zstd performs automatic optimizations to improve
compression speed and reduce memory usage when given small sources, which
is the common case in the kernel.

ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±ý»k~ÏâžØ^n‡r¡ö¦zË\x1aëh™¨èÚ&£ûàz¿äz¹Þ—ú+€Ê+zf£¢·hšˆ§~††Ûiÿÿïêÿ‘êçz_è®\x0fæj:+v‰¨þ)ß£øm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
  2017-06-27  5:27           ` Nick Terrell
  2017-06-27 12:57             ` David Sterba
@ 2017-06-29  1:02             ` Adam Borowski
  2017-06-29  3:01               ` [PATCH] btrfs: Keep one more workspace around Nick Terrell
  1 sibling, 1 reply; 18+ messages in thread
From: Adam Borowski @ 2017-06-29  1:02 UTC (permalink / raw)
  To: Nick Terrell
  Cc: dsterba@suse.cz, Kernel Team, Chris Mason, Yann Collet,
	linux-btrfs@vger.kernel.org, linux-kernel@vger.kernel.org

On Tue, Jun 27, 2017 at 05:27:51AM +0000, Nick Terrell wrote:
> Adam, I’ve applied the same patch in my tree. I’ll send out the update [1]
> once it's reviewed, since I also reduced the stack usage of functions
> using over 1 KB of stack space.
> 
> I have userland tests set up mocking the linux kernel headers, and tested
> 32-bit mode there, but neglected to test the kernel on a 32-bit VM, which
> I’ve now corrected. Thanks for testing the patch on your ARM machine!

Is there a version I should be testing?

I got a bunch of those:
[10170.448783] kworker/u8:6: page allocation stalls for 60720ms, order:0, mode:0x14000c2(GFP_KERNEL|__GFP_HIGHMEM), nodemask=(null)
[10170.448819] kworker/u8:6 cpuset=/ mems_allowed=0
[10170.448842] CPU: 3 PID: 13430 Comm: kworker/u8:6 Not tainted 4.12.0-rc7-00034-gdff47ed160bb #1
[10170.448846] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[10170.448872] Workqueue: btrfs-endio btrfs_endio_helper
[10170.448910] [<c010de1c>] (unwind_backtrace) from [<c010adb8>] (show_stack+0x10/0x14)
[10170.448925] [<c010adb8>] (show_stack) from [<c0442b00>] (dump_stack+0x78/0x8c)
[10170.448942] [<c0442b00>] (dump_stack) from [<c01b0178>] (warn_alloc+0xc0/0x170)
[10170.448952] [<c01b0178>] (warn_alloc) from [<c01b0c3c>] (__alloc_pages_nodemask+0x97c/0xe30)
[10170.448964] [<c01b0c3c>] (__alloc_pages_nodemask) from [<c01e217c>] (__vmalloc_node_range+0x144/0x27c)
[10170.448976] [<c01e217c>] (__vmalloc_node_range) from [<c01e2550>] (__vmalloc_node.constprop.10+0x48/0x50)
[10170.448982] [<c01e2550>] (__vmalloc_node.constprop.10) from [<c01e25ec>] (vmalloc+0x2c/0x34)
[10170.448990] [<c01e25ec>] (vmalloc) from [<c038f7cc>] (zstd_alloc_workspace+0x6c/0xb8)
[10170.448997] [<c038f7cc>] (zstd_alloc_workspace) from [<c038fcb8>] (find_workspace+0x120/0x1f4)
[10170.449002] [<c038fcb8>] (find_workspace) from [<c038ff60>] (end_compressed_bio_read+0x1d4/0x3b0)
[10170.449016] [<c038ff60>] (end_compressed_bio_read) from [<c0130e14>] (process_one_work+0x1d8/0x3f0)
[10170.449026] [<c0130e14>] (process_one_work) from [<c0131a18>] (worker_thread+0x38/0x558)
[10170.449035] [<c0131a18>] (worker_thread) from [<c0136854>] (kthread+0x124/0x154)
[10170.449042] [<c0136854>] (kthread) from [<c01076f8>] (ret_from_fork+0x14/0x3c)

which never happened with compress=lzo, and a 2GB RAM machine that runs 4
threads of various builds runs into memory pressure quite often.  On the
other hand, I used 4.11 for lzo so this needs more testing before I can
blame the zstd code.

Also, I had network problems all day today so the machine was mostly idle
instead of doing further tests -- not quite going to pull sources to build
over a phone connection.

I'm on linus:4.12-rc7 with only a handful of btrfs patches (v3 of Qu's chunk
check, some misc crap) -- I guess I should use at least btrfs-for-4.13.  Or
would you prefer full-blown next?


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄⠀⠀⠀⠀ A master species delegates.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] btrfs: Keep one more workspace around
  2017-06-29  1:02             ` Adam Borowski
@ 2017-06-29  3:01               ` Nick Terrell
  2017-06-29 13:59                 ` David Sterba
  0 siblings, 1 reply; 18+ messages in thread
From: Nick Terrell @ 2017-06-29  3:01 UTC (permalink / raw)
  To: Nick Terrell
  Cc: kernel-team, Chris Mason, Yann Collet, squashfs-devel,
	linux-btrfs, linux-kernel, Adam Borowski, David Sterba

> Is there a version I should be testing?

Not yet, I'm working on v2 of the patch set, which will be ready soon.

> I got a bunch of those:
> [10170.448783] kworker/u8:6: page allocation stalls for 60720ms, order:0, mode:0x14000c2(GFP_KERNEL|__GFP_HIGHMEM), nodemask=(null)
> [10170.448819] kworker/u8:6 cpuset=/ mems_allowed=0
> [10170.448842] CPU: 3 PID: 13430 Comm: kworker/u8:6 Not tainted 4.12.0-rc7-00034-gdff47ed160bb #1
> [10170.448846] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
> [10170.448872] Workqueue: btrfs-endio btrfs_endio_helper
> [10170.448910] [<c010de1c>] (unwind_backtrace) from [<c010adb8>] (show_stack+0x10/0x14)
> [10170.448925] [<c010adb8>] (show_stack) from [<c0442b00>] (dump_stack+0x78/0x8c)
> [10170.448942] [<c0442b00>] (dump_stack) from [<c01b0178>] (warn_alloc+0xc0/0x170)
> [10170.448952] [<c01b0178>] (warn_alloc) from [<c01b0c3c>] (__alloc_pages_nodemask+0x97c/0xe30)
> [10170.448964] [<c01b0c3c>] (__alloc_pages_nodemask) from [<c01e217c>] (__vmalloc_node_range+0x144/0x27c)
> [10170.448976] [<c01e217c>] (__vmalloc_node_range) from [<c01e2550>] (__vmalloc_node.constprop.10+0x48/0x50)
> [10170.448982] [<c01e2550>] (__vmalloc_node.constprop.10) from [<c01e25ec>] (vmalloc+0x2c/0x34)
> [10170.448990] [<c01e25ec>] (vmalloc) from [<c038f7cc>] (zstd_alloc_workspace+0x6c/0xb8)
> [10170.448997] [<c038f7cc>] (zstd_alloc_workspace) from [<c038fcb8>] (find_workspace+0x120/0x1f4)
> [10170.449002] [<c038fcb8>] (find_workspace) from [<c038ff60>] (end_compressed_bio_read+0x1d4/0x3b0)
> [10170.449016] [<c038ff60>] (end_compressed_bio_read) from [<c0130e14>] (process_one_work+0x1d8/0x3f0)
> [10170.449026] [<c0130e14>] (process_one_work) from [<c0131a18>] (worker_thread+0x38/0x558)
> [10170.449035] [<c0131a18>] (worker_thread) from [<c0136854>] (kthread+0x124/0x154)
> [10170.449042] [<c0136854>] (kthread) from [<c01076f8>] (ret_from_fork+0x14/0x3c)
>
> which never happened with compress=lzo, and a 2GB RAM machine that runs 4
> threads of various builds runs into memory pressure quite often.  On the
> other hand, I used 4.11 for lzo so this needs more testing before I can
> blame the zstd code.

I'm not sure what is causing the symptom of stalls in vmalloc(), but I
think I know what is causing vmalloc() to be called so often. Its probably
showing up for zstd and not lzo because it requires more memory.

find_workspace() allocates up to num_online_cpus() + 1 workspaces.
free_workspace() will only keep num_online_cpus() workspaces. When
(de)compressing we will allocate num_online_cpus() + 1 workspaces, then
free one, and repeat. Instead, we can just keep num_online_cpus() + 1
workspaces around, and never have to allocate/free another workspace in the
common case.

I tested on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. I mounted a
BtrFS partition with -o compress-force={lzo,zlib,zstd} and logged whenever
a workspace was allocated of freed. Then I copied vmlinux (527 MB) to the
partition. Before the patch, during the copy it would allocate and free 5-6
workspaces. After, it only allocated the initial 3. This held true for lzo,
zlib, and zstd.

> I'm on linus:4.12-rc7 with only a handful of btrfs patches (v3 of Qu's chunk
> check, some misc crap) -- I guess I should use at least btrfs-for-4.13.  Or
> would you prefer full-blown next?

Whatever is convenient for you. The relevant code in BtrFS hasn't changed
for a few months, so it shouldn't matter too much.

Signed-off-by: Nick Terrell <terrelln@fb.com>
---
 fs/btrfs/compression.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 3beb0d0..1a0ef55 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -874,7 +874,7 @@ static void free_workspace(int type, struct list_head *workspace)
 	int *free_ws			= &btrfs_comp_ws[idx].free_ws;

 	spin_lock(ws_lock);
-	if (*free_ws < num_online_cpus()) {
+	if (*free_ws <= num_online_cpus()) {
 		list_add(workspace, idle_ws);
 		(*free_ws)++;
 		spin_unlock(ws_lock);
--
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] btrfs: Keep one more workspace around
  2017-06-29  3:01               ` [PATCH] btrfs: Keep one more workspace around Nick Terrell
@ 2017-06-29 13:59                 ` David Sterba
  0 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2017-06-29 13:59 UTC (permalink / raw)
  To: Nick Terrell
  Cc: kernel-team, Chris Mason, Yann Collet, squashfs-devel,
	linux-btrfs, linux-kernel, Adam Borowski

On Wed, Jun 28, 2017 at 08:01:51PM -0700, Nick Terrell wrote:
> > Is there a version I should be testing?
> 
> Not yet, I'm working on v2 of the patch set, which will be ready soon.
> 
> > I got a bunch of those:
> > [10170.448783] kworker/u8:6: page allocation stalls for 60720ms, order:0, mode:0x14000c2(GFP_KERNEL|__GFP_HIGHMEM), nodemask=(null)
> > [10170.448819] kworker/u8:6 cpuset=/ mems_allowed=0
> > [10170.448842] CPU: 3 PID: 13430 Comm: kworker/u8:6 Not tainted 4.12.0-rc7-00034-gdff47ed160bb #1
> > [10170.448846] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
> > [10170.448872] Workqueue: btrfs-endio btrfs_endio_helper
> > [10170.448910] [<c010de1c>] (unwind_backtrace) from [<c010adb8>] (show_stack+0x10/0x14)
> > [10170.448925] [<c010adb8>] (show_stack) from [<c0442b00>] (dump_stack+0x78/0x8c)
> > [10170.448942] [<c0442b00>] (dump_stack) from [<c01b0178>] (warn_alloc+0xc0/0x170)
> > [10170.448952] [<c01b0178>] (warn_alloc) from [<c01b0c3c>] (__alloc_pages_nodemask+0x97c/0xe30)
> > [10170.448964] [<c01b0c3c>] (__alloc_pages_nodemask) from [<c01e217c>] (__vmalloc_node_range+0x144/0x27c)
> > [10170.448976] [<c01e217c>] (__vmalloc_node_range) from [<c01e2550>] (__vmalloc_node.constprop.10+0x48/0x50)
> > [10170.448982] [<c01e2550>] (__vmalloc_node.constprop.10) from [<c01e25ec>] (vmalloc+0x2c/0x34)
> > [10170.448990] [<c01e25ec>] (vmalloc) from [<c038f7cc>] (zstd_alloc_workspace+0x6c/0xb8)
> > [10170.448997] [<c038f7cc>] (zstd_alloc_workspace) from [<c038fcb8>] (find_workspace+0x120/0x1f4)
> > [10170.449002] [<c038fcb8>] (find_workspace) from [<c038ff60>] (end_compressed_bio_read+0x1d4/0x3b0)
> > [10170.449016] [<c038ff60>] (end_compressed_bio_read) from [<c0130e14>] (process_one_work+0x1d8/0x3f0)
> > [10170.449026] [<c0130e14>] (process_one_work) from [<c0131a18>] (worker_thread+0x38/0x558)
> > [10170.449035] [<c0131a18>] (worker_thread) from [<c0136854>] (kthread+0x124/0x154)
> > [10170.449042] [<c0136854>] (kthread) from [<c01076f8>] (ret_from_fork+0x14/0x3c)
> >
> > which never happened with compress=lzo, and a 2GB RAM machine that runs 4
> > threads of various builds runs into memory pressure quite often.  On the
> > other hand, I used 4.11 for lzo so this needs more testing before I can
> > blame the zstd code.
> 
> I'm not sure what is causing the symptom of stalls in vmalloc(), but I
> think I know what is causing vmalloc() to be called so often. Its probably
> showing up for zstd and not lzo because it requires more memory.
> 
> find_workspace() allocates up to num_online_cpus() + 1 workspaces.
> free_workspace() will only keep num_online_cpus() workspaces. When
> (de)compressing we will allocate num_online_cpus() + 1 workspaces, then
> free one, and repeat. Instead, we can just keep num_online_cpus() + 1
> workspaces around, and never have to allocate/free another workspace in the
> common case.

That would be much better and probably was the original intention. And I
guess improves performance when we don't have to do the extra alloc/free
rounds.

> I tested on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. I mounted a
> BtrFS partition with -o compress-force={lzo,zlib,zstd} and logged whenever
> a workspace was allocated of freed. Then I copied vmlinux (527 MB) to the
> partition. Before the patch, during the copy it would allocate and free 5-6
> workspaces. After, it only allocated the initial 3. This held true for lzo,
> zlib, and zstd.
> 
> > I'm on linus:4.12-rc7 with only a handful of btrfs patches (v3 of Qu's chunk
> > check, some misc crap) -- I guess I should use at least btrfs-for-4.13.  Or
> > would you prefer full-blown next?
> 
> Whatever is convenient for you. The relevant code in BtrFS hasn't changed
> for a few months, so it shouldn't matter too much.
> 
> Signed-off-by: Nick Terrell <terrelln@fb.com>
> ---
>  fs/btrfs/compression.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 3beb0d0..1a0ef55 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -874,7 +874,7 @@ static void free_workspace(int type, struct list_head *workspace)
>  	int *free_ws			= &btrfs_comp_ws[idx].free_ws;
> 
>  	spin_lock(ws_lock);
> -	if (*free_ws < num_online_cpus()) {
> +	if (*free_ws <= num_online_cpus()) {
>  		list_add(workspace, idle_ws);
>  		(*free_ws)++;
>  		spin_unlock(ws_lock);

Please send it as a proper patch, thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] btrfs: Keep one more workspace around
@ 2017-06-29 17:57 Nick Terrell
  2017-06-30  0:14 ` Adam Borowski
  2017-06-30  0:49 ` Omar Sandoval
  0 siblings, 2 replies; 18+ messages in thread
From: Nick Terrell @ 2017-06-29 17:57 UTC (permalink / raw)
  To: Nick Terrell
  Cc: kernel-team, Chris Mason, Yann Collet, linux-btrfs, Adam Borowski,
	David Sterba

find_workspace() allocates up to num_online_cpus() + 1 workspaces.
free_workspace() will only keep num_online_cpus() workspaces. When
(de)compressing we will allocate num_online_cpus() + 1 workspaces, then
free one, and repeat. Instead, we can just keep num_online_cpus() + 1
workspaces around, and never have to allocate/free another workspace in the
common case.

I tested on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. I mounted a
BtrFS partition with -o compress-force={lzo,zlib,zstd} and logged whenever
a workspace was allocated of freed. Then I copied vmlinux (527 MB) to the
partition. Before the patch, during the copy it would allocate and free 5-6
workspaces. After, it only allocated the initial 3. This held true for lzo,
zlib, and zstd. The time it took to execute cp vmlinux /mnt/btrfs && sync
dropped from 1.70s to 1.44s with lzo compression, and from 2.04s to 1.80s
for zstd compression.

Signed-off-by: Nick Terrell <terrelln@fb.com>
---
 fs/btrfs/compression.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 3beb0d0..1a0ef55 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -874,7 +874,7 @@ static void free_workspace(int type, struct list_head *workspace)
 	int *free_ws			= &btrfs_comp_ws[idx].free_ws;

 	spin_lock(ws_lock);
-	if (*free_ws < num_online_cpus()) {
+	if (*free_ws <= num_online_cpus()) {
 		list_add(workspace, idle_ws);
 		(*free_ws)++;
 		spin_unlock(ws_lock);
--
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] btrfs: Keep one more workspace around
  2017-06-29 17:57 [PATCH] btrfs: Keep one more workspace around Nick Terrell
@ 2017-06-30  0:14 ` Adam Borowski
  2017-06-30  0:49 ` Omar Sandoval
  1 sibling, 0 replies; 18+ messages in thread
From: Adam Borowski @ 2017-06-30  0:14 UTC (permalink / raw)
  To: Nick Terrell
  Cc: kernel-team, Chris Mason, Yann Collet, linux-btrfs, David Sterba

On Thu, Jun 29, 2017 at 10:57:26AM -0700, Nick Terrell wrote:
> The time it took to execute cp vmlinux /mnt/btrfs && sync
> dropped from 1.70s to 1.44s with lzo compression, and from 2.04s to 1.80s
> for zstd compression.

> -	if (*free_ws < num_online_cpus()) {
> +	if (*free_ws <= num_online_cpus()) {

A simple, self-contained, one-character fix that gives a nice speed-up.
What about getting this for 4.13 or perhaps even 4.9?

Workspace flickering is not a very serious bug, but if we can restore wasted
write speed _this_ easily...


-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄⠀⠀⠀⠀ A master species delegates.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] btrfs: Keep one more workspace around
  2017-06-29 17:57 [PATCH] btrfs: Keep one more workspace around Nick Terrell
  2017-06-30  0:14 ` Adam Borowski
@ 2017-06-30  0:49 ` Omar Sandoval
  1 sibling, 0 replies; 18+ messages in thread
From: Omar Sandoval @ 2017-06-30  0:49 UTC (permalink / raw)
  To: Nick Terrell
  Cc: kernel-team, Chris Mason, Yann Collet, linux-btrfs, Adam Borowski,
	David Sterba

On Thu, Jun 29, 2017 at 10:57:26AM -0700, Nick Terrell wrote:
> find_workspace() allocates up to num_online_cpus() + 1 workspaces.
> free_workspace() will only keep num_online_cpus() workspaces. When
> (de)compressing we will allocate num_online_cpus() + 1 workspaces, then
> free one, and repeat. Instead, we can just keep num_online_cpus() + 1
> workspaces around, and never have to allocate/free another workspace in the
> common case.
> 
> I tested on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. I mounted a
> BtrFS partition with -o compress-force={lzo,zlib,zstd} and logged whenever
> a workspace was allocated of freed. Then I copied vmlinux (527 MB) to the
> partition. Before the patch, during the copy it would allocate and free 5-6
> workspaces. After, it only allocated the initial 3. This held true for lzo,
> zlib, and zstd. The time it took to execute cp vmlinux /mnt/btrfs && sync
> dropped from 1.70s to 1.44s with lzo compression, and from 2.04s to 1.80s
> for zstd compression.

Good catch! It seems to me like it might be easier to just allocate them
all upfront anyways, but that's a battle for another day.

Reviewed-by: Omar Sandoval <osandov@fb.com>

> Signed-off-by: Nick Terrell <terrelln@fb.com>
> ---
>  fs/btrfs/compression.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 3beb0d0..1a0ef55 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -874,7 +874,7 @@ static void free_workspace(int type, struct list_head *workspace)
>  	int *free_ws			= &btrfs_comp_ws[idx].free_ws;
> 
>  	spin_lock(ws_lock);
> -	if (*free_ws < num_online_cpus()) {
> +	if (*free_ws <= num_online_cpus()) {
>  		list_add(workspace, idle_ws);
>  		(*free_ws)++;
>  		spin_unlock(ws_lock);
> --
> 2.9.3
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-06-30  0:49 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-22 22:01 [PATCH 1/4] lib: Add xxhash module Nick Terrell
2017-06-22 22:01 ` [PATCH 3/4] btrfs: Add zstd support Nick Terrell
2017-06-25 15:02   ` kbuild test robot
2017-06-25 19:03   ` kbuild test robot
2017-06-25 21:30     ` Adam Borowski
2017-06-26 12:12       ` David Sterba
2017-06-26 16:54         ` Nick Terrell
2017-06-27  4:18         ` [PATCH] lib/zstd: use div_u64() to let it build on 32-bit Adam Borowski
2017-06-27  5:27           ` Nick Terrell
2017-06-27 12:57             ` David Sterba
2017-06-28  5:29               ` Nick Terrell
2017-06-29  1:02             ` Adam Borowski
2017-06-29  3:01               ` [PATCH] btrfs: Keep one more workspace around Nick Terrell
2017-06-29 13:59                 ` David Sterba
2017-06-22 22:01 ` [PATCH 4/4] squashfs: Add zstd support Nick Terrell
  -- strict thread matches above, loose matches on Subject: below --
2017-06-29 17:57 [PATCH] btrfs: Keep one more workspace around Nick Terrell
2017-06-30  0:14 ` Adam Borowski
2017-06-30  0:49 ` Omar Sandoval

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).