From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Nikos Tsironis <ntsironis@arrikto.com>,
Ilias Tsitsimpis <iliastsi@arrikto.com>,
Mike Snitzer <snitzer@redhat.com>,
Sasha Levin <sashal@kernel.org>
Subject: [PATCH 4.14 45/63] dm snapshot: Fix excessive memory usage and workqueue stalls
Date: Thu, 24 Jan 2019 20:20:34 +0100 [thread overview]
Message-ID: <20190124190200.637308047@linuxfoundation.org> (raw)
In-Reply-To: <20190124190155.176570028@linuxfoundation.org>
4.14-stable review patch. If anyone has any objections, please let me know.
------------------
[ Upstream commit 721b1d98fb517ae99ab3b757021cf81db41e67be ]
kcopyd has no upper limit to the number of jobs one can allocate and
issue. Under certain workloads this can lead to excessive memory usage
and workqueue stalls. For example, when creating multiple dm-snapshot
targets with a 4K chunk size and then writing to the origin through the
page cache. Syncing the page cache causes a large number of BIOs to be
issued to the dm-snapshot origin target, which itself issues an even
larger (because of the BIO splitting taking place) number of kcopyd
jobs.
Running the following test, from the device mapper test suite [1],
dmtest run --suite snapshot -n many_snapshots_of_same_volume_N
, with 8 active snapshots, results in the kcopyd job slab cache growing
to 10G. Depending on the available system RAM this can lead to the OOM
killer killing user processes:
[463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
nodemask=(null), order=1, oom_score_adj=0
[463.492894] kthreadd cpuset=/ mems_allowed=0
[463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
[463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[463.492952] Call Trace:
[463.492964] dump_stack+0x7d/0xbb
[463.492973] dump_header+0x6b/0x2fc
[463.492987] ? lockdep_hardirqs_on+0xee/0x190
[463.493012] oom_kill_process+0x302/0x370
[463.493021] out_of_memory+0x113/0x560
[463.493030] __alloc_pages_slowpath+0xf40/0x1020
[463.493055] __alloc_pages_nodemask+0x348/0x3c0
[463.493067] cache_grow_begin+0x81/0x8b0
[463.493072] ? cache_grow_begin+0x874/0x8b0
[463.493078] fallback_alloc+0x1e4/0x280
[463.493092] kmem_cache_alloc_node+0xd6/0x370
[463.493098] ? copy_process.part.31+0x1c5/0x20d0
[463.493105] copy_process.part.31+0x1c5/0x20d0
[463.493115] ? __lock_acquire+0x3cc/0x1550
[463.493121] ? __switch_to_asm+0x34/0x70
[463.493129] ? kthread_create_worker_on_cpu+0x70/0x70
[463.493135] ? finish_task_switch+0x90/0x280
[463.493165] _do_fork+0xe0/0x6d0
[463.493191] ? kthreadd+0x19f/0x220
[463.493233] kernel_thread+0x25/0x30
[463.493235] kthreadd+0x1bf/0x220
[463.493242] ? kthread_create_on_cpu+0x90/0x90
[463.493248] ret_from_fork+0x3a/0x50
[463.493279] Mem-Info:
[463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
[463.493285] active_file:80216 inactive_file:80107 isolated_file:435
[463.493285] unevictable:0 dirty:51266 writeback:109372 unstable:0
[463.493285] slab_reclaimable:31191 slab_unreclaimable:3483521
[463.493285] mapped:526 shmem:4903 pagetables:1759 bounce:0
[463.493285] free:33623 free_pcp:2392 free_cma:0
...
[463.493489] Unreclaimable slab info:
[463.493513] Name Used Total
[463.493522] bio-6 1028KB 1028KB
[463.493525] bio-5 1028KB 1028KB
[463.493528] dm_snap_pending_exception 236783KB 243789KB
[463.493531] dm_exception 41KB 42KB
[463.493534] bio-4 1216KB 1216KB
[463.493537] bio-3 439396KB 439396KB
[463.493539] kcopyd_job 6973427KB 6973427KB
...
[463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
[463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
[463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Moreover, issuing a large number of kcopyd jobs results in kcopyd
hogging the CPU, while processing them. As a result, processing of work
items, queued for execution on the same CPU as the currently running
kcopyd thread, is stalled for long periods of time, hurting performance.
Running the aforementioned test we get, in dmesg, messages like the
following:
[67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
[67501.195586] Showing busy workqueues and worker pools:
[67501.195591] workqueue events: flags=0x0
[67501.195597] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195611] pending: cache_reap
[67501.195641] workqueue mm_percpu_wq: flags=0x8
[67501.195645] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195656] pending: vmstat_update
[67501.195682] workqueue kblockd: flags=0x18
[67501.195687] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[67501.195698] pending: blk_timeout_work
[67501.195753] workqueue kcopyd: flags=0x8
[67501.195757] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195768] pending: do_work [dm_mod]
[67501.195802] workqueue kcopyd: flags=0x8
[67501.195806] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195817] pending: do_work [dm_mod]
[67501.195834] workqueue kcopyd: flags=0x8
[67501.195838] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195848] pending: do_work [dm_mod]
[67501.195881] workqueue kcopyd: flags=0x8
[67501.195885] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195896] pending: do_work [dm_mod]
[67501.195920] workqueue kcopyd: flags=0x8
[67501.195924] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
[67501.195935] in-flight: 67:do_work [dm_mod]
[67501.195945] pending: do_work [dm_mod]
[67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765
The root cause for these issues is the way dm-snapshot uses kcopyd. In
particular, the lack of an explicit or implicit limit to the maximum
number of in-flight COW jobs. The merging path is not affected because
it implicitly limits the in-flight kcopyd jobs to one.
Fix these issues by using a semaphore to limit the maximum number of
in-flight kcopyd jobs. We grab the semaphore before allocating a new
kcopyd job in start_copy() and start_full_bio() and release it after the
job finishes in copy_callback().
The initial semaphore value is configurable through a module parameter,
to allow fine tuning the maximum number of in-flight COW jobs. Setting
this parameter to zero initializes the semaphore to INT_MAX.
A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
value was decided experimentally as a trade-off between memory
consumption, stalling the kernel's workqueues and maintaining a high
enough throughput.
Re-running the aforementioned test:
* Workqueue stalls are eliminated
* kcopyd's job slab cache uses a maximum of 130MB
* The time taken by the test to write to the snapshot-origin target is
reduced from 05m20.48s to 03m26.38s
[1] https://github.com/jthornber/device-mapper-test-suite
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/md/dm-snap.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index a0613bd8ed00..b502debc6df3 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -19,6 +19,7 @@
#include <linux/vmalloc.h>
#include <linux/log2.h>
#include <linux/dm-kcopyd.h>
+#include <linux/semaphore.h>
#include "dm.h"
@@ -105,6 +106,9 @@ struct dm_snapshot {
/* The on disk metadata handler */
struct dm_exception_store *store;
+ /* Maximum number of in-flight COW jobs. */
+ struct semaphore cow_count;
+
struct dm_kcopyd_client *kcopyd_client;
/* Wait for events based on state_bits */
@@ -145,6 +149,19 @@ struct dm_snapshot {
#define RUNNING_MERGE 0
#define SHUTDOWN_MERGE 1
+/*
+ * Maximum number of chunks being copied on write.
+ *
+ * The value was decided experimentally as a trade-off between memory
+ * consumption, stalling the kernel's workqueues and maintaining a high enough
+ * throughput.
+ */
+#define DEFAULT_COW_THRESHOLD 2048
+
+static int cow_threshold = DEFAULT_COW_THRESHOLD;
+module_param_named(snapshot_cow_threshold, cow_threshold, int, 0644);
+MODULE_PARM_DESC(snapshot_cow_threshold, "Maximum number of chunks being copied on write");
+
DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle,
"A percentage of time allocated for copy on write");
@@ -1189,6 +1206,8 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad_hash_tables;
}
+ sema_init(&s->cow_count, (cow_threshold > 0) ? cow_threshold : INT_MAX);
+
s->kcopyd_client = dm_kcopyd_client_create(&dm_kcopyd_throttle);
if (IS_ERR(s->kcopyd_client)) {
r = PTR_ERR(s->kcopyd_client);
@@ -1560,6 +1579,7 @@ static void copy_callback(int read_err, unsigned long write_err, void *context)
}
list_add(&pe->out_of_order_entry, lh);
}
+ up(&s->cow_count);
}
/*
@@ -1583,6 +1603,7 @@ static void start_copy(struct dm_snap_pending_exception *pe)
dest.count = src.count;
/* Hand over to kcopyd */
+ down(&s->cow_count);
dm_kcopyd_copy(s->kcopyd_client, &src, 1, &dest, 0, copy_callback, pe);
}
@@ -1602,6 +1623,7 @@ static void start_full_bio(struct dm_snap_pending_exception *pe,
pe->full_bio = bio;
pe->full_bio_end_io = bio->bi_end_io;
+ down(&s->cow_count);
callback_data = dm_kcopyd_prepare_callback(s->kcopyd_client,
copy_callback, pe);
--
2.19.1
next prev parent reply other threads:[~2019-01-24 19:59 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-24 19:19 [PATCH 4.14 00/63] 4.14.96-stable review Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 4.14 01/63] ipv6: Consider sk_bound_dev_if when binding a socket to a v4 mapped address Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 4.14 02/63] mlxsw: spectrum: Disable lag port TX before removing it Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 4.14 03/63] mlxsw: spectrum_switchdev: Set PVID correctly during VLAN deletion Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 4.14 04/63] net, skbuff: do not prefer skb allocation fails early Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 4.14 05/63] qmi_wwan: add MTU default to qmap network interface Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 4.14 06/63] r8169: Add support for new Realtek Ethernet Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 4.14 07/63] ipv6: Take rcu_read_lock in __inet6_bind for mapped addresses Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 4.14 08/63] net: dsa: mv88x6xxx: mv88e6390 errata Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 4.14 09/63] gpio: pl061: Move irq_chip definition inside struct pl061 Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 4.14 10/63] platform/x86: asus-wmi: Tell the EC the OS will handle the display off hotkey Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 11/63] e1000e: allow non-monotonic SYSTIM readings Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 12/63] writeback: dont decrement wb->refcnt if !wb->bdi Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 13/63] serial: set suppress_bind_attrs flag only if builtin Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 14/63] ALSA: oxfw: add support for APOGEE duet FireWire Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 15/63] x86/mce: Fix -Wmissing-prototypes warnings Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 16/63] MIPS: SiByte: Enable swiotlb for SWARM, LittleSur and BigSur Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 17/63] arm64: perf: set suppress_bind_attrs flag to true Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 18/63] usb: gadget: udc: renesas_usb3: add a safety connection way for forced_b_device Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 19/63] selinux: always allow mounting submounts Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 20/63] rxe: IB_WR_REG_MR does not capture MRs iova field Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 21/63] jffs2: Fix use of uninitialized delayed_work, lockdep breakage Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 22/63] clk: imx: make mux parent strings const Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 23/63] pstore/ram: Do not treat empty buffers as valid Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 24/63] powerpc/xmon: Fix invocation inside lock region Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 25/63] powerpc/pseries/cpuidle: Fix preempt warning Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 26/63] media: firewire: Fix app_info parameter type in avc_ca{,_app}_info Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 27/63] media: venus: core: Set dma maximum segment size Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 28/63] net: call sk_dst_reset when set SO_DONTROUTE Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 29/63] scsi: target: use consistent left-aligned ASCII INQUIRY data Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 30/63] selftests: do not macro-expand failed assertion expressions Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 31/63] clk: imx6q: reset exclusive gates on init Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 32/63] arm64: Fix minor issues with the dcache_by_line_op macro Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 33/63] kconfig: fix file name and line number of warn_ignored_character() Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 34/63] kconfig: fix memory leak when EOF is encountered in quotation Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 35/63] mmc: atmel-mci: do not assume idle after atmci_request_end Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 36/63] btrfs: improve error handling of btrfs_add_link Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 37/63] tty/serial: do not free trasnmit buffer page under port lock Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 38/63] perf intel-pt: Fix error with config term "pt=0" Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 39/63] perf svghelper: Fix unchecked usage of strncpy() Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 40/63] perf parse-events: " Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 41/63] netfilter: ipt_CLUSTERIP: check MAC address when duplicate config is set Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 42/63] dm crypt: use u64 instead of sector_t to store iv_offset Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 43/63] dm kcopyd: Fix bug causing workqueue stalls Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 44/63] tools lib subcmd: Dont add the kernel sources to the include path Greg Kroah-Hartman
2019-01-24 19:20 ` Greg Kroah-Hartman [this message]
2019-01-24 19:20 ` [PATCH 4.14 46/63] quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 47/63] clocksource/drivers/integrator-ap: Add missing of_node_put() Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 48/63] ALSA: bebob: fix model-id of unit for Apogee Ensemble Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 49/63] sysfs: Disable lockdep for driver bind/unbind files Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 50/63] IB/usnic: Fix potential deadlock Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 51/63] scsi: smartpqi: correct lun reset issues Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 52/63] scsi: smartpqi: call pqi_free_interrupts() in pqi_shutdown() Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 53/63] scsi: megaraid: fix out-of-bound array accesses Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 54/63] ocfs2: fix panic due to unrecovered local alloc Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 55/63] mm/page-writeback.c: dont break integrity writeback on ->writepage() error Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 56/63] mm/swap: use nr_node_ids for avail_lists in swap_info_struct Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 57/63] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 58/63] nfs: fix a deadlock in nfs client initialization Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 59/63] ipmi:pci: Blacklist a Realtek "IPMI" device Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 60/63] cifs: allow disabling insecure dialects in the config Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 61/63] drm/i915/gvt: Fix mmap range check Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 62/63] PCI: dwc: Move interrupt acking into the proper callback Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 4.14 63/63] ipmi:ssif: Fix handling of multi-part return messages Greg Kroah-Hartman
2019-01-25 14:46 ` [PATCH 4.14 00/63] 4.14.96-stable review shuah
2019-01-25 16:25 ` Naresh Kamboju
2019-01-25 23:18 ` Guenter Roeck
2019-01-26 12:08 ` Jon Hunter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190124190200.637308047@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=iliastsi@arrikto.com \
--cc=linux-kernel@vger.kernel.org \
--cc=ntsironis@arrikto.com \
--cc=sashal@kernel.org \
--cc=snitzer@redhat.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).