All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Nikos Tsironis <ntsironis@arrikto.com>,
	Ilias Tsitsimpis <iliastsi@arrikto.com>,
	Mike Snitzer <snitzer@redhat.com>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH 3.18 48/52] dm snapshot: Fix excessive memory usage and workqueue stalls
Date: Thu, 24 Jan 2019 20:20:12 +0100	[thread overview]
Message-ID: <20190124190150.637871706@linuxfoundation.org> (raw)
In-Reply-To: <20190124190140.879495253@linuxfoundation.org>

3.18-stable review patch.  If anyone has any objections, please let me know.

------------------

[ Upstream commit 721b1d98fb517ae99ab3b757021cf81db41e67be ]

kcopyd has no upper limit to the number of jobs one can allocate and
issue. Under certain workloads this can lead to excessive memory usage
and workqueue stalls. For example, when creating multiple dm-snapshot
targets with a 4K chunk size and then writing to the origin through the
page cache. Syncing the page cache causes a large number of BIOs to be
issued to the dm-snapshot origin target, which itself issues an even
larger (because of the BIO splitting taking place) number of kcopyd
jobs.

Running the following test, from the device mapper test suite [1],

  dmtest run --suite snapshot -n many_snapshots_of_same_volume_N

, with 8 active snapshots, results in the kcopyd job slab cache growing
to 10G. Depending on the available system RAM this can lead to the OOM
killer killing user processes:

[463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
              nodemask=(null), order=1, oom_score_adj=0
[463.492894] kthreadd cpuset=/ mems_allowed=0
[463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
[463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[463.492952] Call Trace:
[463.492964]  dump_stack+0x7d/0xbb
[463.492973]  dump_header+0x6b/0x2fc
[463.492987]  ? lockdep_hardirqs_on+0xee/0x190
[463.493012]  oom_kill_process+0x302/0x370
[463.493021]  out_of_memory+0x113/0x560
[463.493030]  __alloc_pages_slowpath+0xf40/0x1020
[463.493055]  __alloc_pages_nodemask+0x348/0x3c0
[463.493067]  cache_grow_begin+0x81/0x8b0
[463.493072]  ? cache_grow_begin+0x874/0x8b0
[463.493078]  fallback_alloc+0x1e4/0x280
[463.493092]  kmem_cache_alloc_node+0xd6/0x370
[463.493098]  ? copy_process.part.31+0x1c5/0x20d0
[463.493105]  copy_process.part.31+0x1c5/0x20d0
[463.493115]  ? __lock_acquire+0x3cc/0x1550
[463.493121]  ? __switch_to_asm+0x34/0x70
[463.493129]  ? kthread_create_worker_on_cpu+0x70/0x70
[463.493135]  ? finish_task_switch+0x90/0x280
[463.493165]  _do_fork+0xe0/0x6d0
[463.493191]  ? kthreadd+0x19f/0x220
[463.493233]  kernel_thread+0x25/0x30
[463.493235]  kthreadd+0x1bf/0x220
[463.493242]  ? kthread_create_on_cpu+0x90/0x90
[463.493248]  ret_from_fork+0x3a/0x50
[463.493279] Mem-Info:
[463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
[463.493285]  active_file:80216 inactive_file:80107 isolated_file:435
[463.493285]  unevictable:0 dirty:51266 writeback:109372 unstable:0
[463.493285]  slab_reclaimable:31191 slab_unreclaimable:3483521
[463.493285]  mapped:526 shmem:4903 pagetables:1759 bounce:0
[463.493285]  free:33623 free_pcp:2392 free_cma:0
...
[463.493489] Unreclaimable slab info:
[463.493513] Name                      Used          Total
[463.493522] bio-6                   1028KB       1028KB
[463.493525] bio-5                   1028KB       1028KB
[463.493528] dm_snap_pending_exception     236783KB     243789KB
[463.493531] dm_exception              41KB         42KB
[463.493534] bio-4                   1216KB       1216KB
[463.493537] bio-3                 439396KB     439396KB
[463.493539] kcopyd_job           6973427KB    6973427KB
...
[463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
[463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
[463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Moreover, issuing a large number of kcopyd jobs results in kcopyd
hogging the CPU, while processing them. As a result, processing of work
items, queued for execution on the same CPU as the currently running
kcopyd thread, is stalled for long periods of time, hurting performance.
Running the aforementioned test we get, in dmesg, messages like the
following:

[67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
[67501.195586] Showing busy workqueues and worker pools:
[67501.195591] workqueue events: flags=0x0
[67501.195597]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195611]     pending: cache_reap
[67501.195641] workqueue mm_percpu_wq: flags=0x8
[67501.195645]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195656]     pending: vmstat_update
[67501.195682] workqueue kblockd: flags=0x18
[67501.195687]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
[67501.195698]     pending: blk_timeout_work
[67501.195753] workqueue kcopyd: flags=0x8
[67501.195757]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195768]     pending: do_work [dm_mod]
[67501.195802] workqueue kcopyd: flags=0x8
[67501.195806]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195817]     pending: do_work [dm_mod]
[67501.195834] workqueue kcopyd: flags=0x8
[67501.195838]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195848]     pending: do_work [dm_mod]
[67501.195881] workqueue kcopyd: flags=0x8
[67501.195885]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
[67501.195896]     pending: do_work [dm_mod]
[67501.195920] workqueue kcopyd: flags=0x8
[67501.195924]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
[67501.195935]     in-flight: 67:do_work [dm_mod]
[67501.195945]     pending: do_work [dm_mod]
[67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765

The root cause for these issues is the way dm-snapshot uses kcopyd. In
particular, the lack of an explicit or implicit limit to the maximum
number of in-flight COW jobs. The merging path is not affected because
it implicitly limits the in-flight kcopyd jobs to one.

Fix these issues by using a semaphore to limit the maximum number of
in-flight kcopyd jobs. We grab the semaphore before allocating a new
kcopyd job in start_copy() and start_full_bio() and release it after the
job finishes in copy_callback().

The initial semaphore value is configurable through a module parameter,
to allow fine tuning the maximum number of in-flight COW jobs. Setting
this parameter to zero initializes the semaphore to INT_MAX.

A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
value was decided experimentally as a trade-off between memory
consumption, stalling the kernel's workqueues and maintaining a high
enough throughput.

Re-running the aforementioned test:

  * Workqueue stalls are eliminated
  * kcopyd's job slab cache uses a maximum of 130MB
  * The time taken by the test to write to the snapshot-origin target is
    reduced from 05m20.48s to 03m26.38s

[1] https://github.com/jthornber/device-mapper-test-suite

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/md/dm-snap.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index b91b39caf7a9..0a304e7de260 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -19,6 +19,7 @@
 #include <linux/vmalloc.h>
 #include <linux/log2.h>
 #include <linux/dm-kcopyd.h>
+#include <linux/semaphore.h>
 
 #include "dm.h"
 
@@ -98,6 +99,9 @@ struct dm_snapshot {
 	/* The on disk metadata handler */
 	struct dm_exception_store *store;
 
+	/* Maximum number of in-flight COW jobs. */
+	struct semaphore cow_count;
+
 	struct dm_kcopyd_client *kcopyd_client;
 
 	/* Wait for events based on state_bits */
@@ -138,6 +142,19 @@ struct dm_snapshot {
 #define RUNNING_MERGE          0
 #define SHUTDOWN_MERGE         1
 
+/*
+ * Maximum number of chunks being copied on write.
+ *
+ * The value was decided experimentally as a trade-off between memory
+ * consumption, stalling the kernel's workqueues and maintaining a high enough
+ * throughput.
+ */
+#define DEFAULT_COW_THRESHOLD 2048
+
+static int cow_threshold = DEFAULT_COW_THRESHOLD;
+module_param_named(snapshot_cow_threshold, cow_threshold, int, 0644);
+MODULE_PARM_DESC(snapshot_cow_threshold, "Maximum number of chunks being copied on write");
+
 DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(snapshot_copy_throttle,
 		"A percentage of time allocated for copy on write");
 
@@ -1173,6 +1190,8 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 		goto bad_hash_tables;
 	}
 
+	sema_init(&s->cow_count, (cow_threshold > 0) ? cow_threshold : INT_MAX);
+
 	s->kcopyd_client = dm_kcopyd_client_create(&dm_kcopyd_throttle);
 	if (IS_ERR(s->kcopyd_client)) {
 		r = PTR_ERR(s->kcopyd_client);
@@ -1545,6 +1564,7 @@ static void copy_callback(int read_err, unsigned long write_err, void *context)
 		}
 		list_add(&pe->out_of_order_entry, lh);
 	}
+	up(&s->cow_count);
 }
 
 /*
@@ -1568,6 +1588,7 @@ static void start_copy(struct dm_snap_pending_exception *pe)
 	dest.count = src.count;
 
 	/* Hand over to kcopyd */
+	down(&s->cow_count);
 	dm_kcopyd_copy(s->kcopyd_client, &src, 1, &dest, 0, copy_callback, pe);
 }
 
@@ -1588,6 +1609,7 @@ static void start_full_bio(struct dm_snap_pending_exception *pe,
 	pe->full_bio_end_io = bio->bi_end_io;
 	pe->full_bio_private = bio->bi_private;
 
+	down(&s->cow_count);
 	callback_data = dm_kcopyd_prepare_callback(s->kcopyd_client,
 						   copy_callback, pe);
 
-- 
2.19.1




  parent reply	other threads:[~2019-01-24 20:17 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-24 19:19 [PATCH 3.18 00/52] 3.18.133-stable review Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 01/52] sparc32: Fix inverted invalid_frame_pointer checks on sigreturns Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 02/52] CIFS: Do not hide EINTR after sending network packets Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 03/52] cifs: Fix potential OOB access of lock element array Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 04/52] usb: cdc-acm: send ZLP for Telit 3G Intel based modems Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 05/52] USB: storage: dont insert sane sense for SPC3+ when bad sense specified Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 06/52] USB: storage: add quirk for SMI SM3350 Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 07/52] slab: alien caches must not be initialized if the allocation of the alien cache failed Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 08/52] ACPI: power: Skip duplicate power resource references in _PRx Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 09/52] i2c: dev: prevent adapter retries and timeout being set as minus value Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 10/52] crypto: cts - fix crash on short inputs Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 11/52] sunrpc: use-after-free in svc_process_common() Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 12/52] tty/ldsem: Wake up readers after timed out down_write() Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 13/52] can: gw: ensure DLC boundaries after CAN frame modification Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 14/52] media: em28xx: Fix misplaced reset of dev->v4l::field_count Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 15/52] ipv6: fix kernel-infoleak in ipv6_local_error() Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 16/52] packet: Do not leak dev refcounts on error exit Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 17/52] net: bridge: fix a bug on using a neighbour cache entry without checking its state Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 18/52] crypto: authenc - fix parsing key with misaligned rta_len Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 19/52] btrfs: wait on ordered extents on abort cleanup Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 20/52] Yama: Check for pid death before checking ancestry Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 21/52] scsi: sd: Fix cache_type_store() Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 22/52] mfd: tps6586x: Handle interrupts on suspend Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 23/52] Disable MSI also when pcie-octeon.pcie_disable on Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 24/52] omap2fb: Fix stack memory disclosure Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 25/52] media: vivid: fix error handling of kthread_run Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 26/52] media: vivid: set min width/height to a value > 0 Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 27/52] media: vb2: vb2_mmap: move lock up Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 28/52] sunrpc: handle ENOMEM in rpcb_getport_async Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 29/52] selinux: fix GPF on invalid policy Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 30/52] sctp: allocate sctp_sockaddr_entry with kzalloc Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 31/52] block/loop: Use global lock for ioctl() operation Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 32/52] drm/fb-helper: Ignore the value of fb_var_screeninfo.pixclock Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 33/52] media: vb2: be sure to unlock mutex on errors Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 34/52] r8169: Add support for new Realtek Ethernet Greg Kroah-Hartman
2019-01-24 19:19 ` [PATCH 3.18 35/52] MIPS: SiByte: Enable swiotlb for SWARM, LittleSur and BigSur Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 36/52] jffs2: Fix use of uninitialized delayed_work, lockdep breakage Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 37/52] pstore/ram: Do not treat empty buffers as valid Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 38/52] powerpc/pseries/cpuidle: Fix preempt warning Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 39/52] media: firewire: Fix app_info parameter type in avc_ca{,_app}_info Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 40/52] net: call sk_dst_reset when set SO_DONTROUTE Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 41/52] scsi: target: use consistent left-aligned ASCII INQUIRY data Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 42/52] clk: imx6q: reset exclusive gates on init Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 43/52] kconfig: fix memory leak when EOF is encountered in quotation Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 44/52] mmc: atmel-mci: do not assume idle after atmci_request_end Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 45/52] perf svghelper: Fix unchecked usage of strncpy() Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 46/52] perf parse-events: " Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 47/52] dm kcopyd: Fix bug causing workqueue stalls Greg Kroah-Hartman
2019-01-24 19:20 ` Greg Kroah-Hartman [this message]
2019-01-24 19:20 ` [PATCH 3.18 49/52] ALSA: bebob: fix model-id of unit for Apogee Ensemble Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 50/52] sysfs: Disable lockdep for driver bind/unbind files Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 51/52] ocfs2: fix panic due to unrecovered local alloc Greg Kroah-Hartman
2019-01-24 19:20 ` [PATCH 3.18 52/52] mm, proc: be more verbose about unstable VMA flags in /proc/<pid>/smaps Greg Kroah-Hartman
2019-01-25 14:39 ` [PATCH 3.18 00/52] 3.18.133-stable review shuah
2019-01-25 23:16 ` Guenter Roeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190124190150.637871706@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=iliastsi@arrikto.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ntsironis@arrikto.com \
    --cc=sashal@kernel.org \
    --cc=snitzer@redhat.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.