stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Maxim Patlasov <mpatlasov@virtuozzo.com>,
	Chris Mason <clm@fb.com>
Subject: [PATCH 4.9 02/83] btrfs: limit async_work allocation and worker func duration
Date: Wed,  4 Jan 2017 21:05:54 +0100	[thread overview]
Message-ID: <20170104200446.656204806@linuxfoundation.org> (raw)
In-Reply-To: <20170104200446.541604386@linuxfoundation.org>

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Maxim Patlasov <mpatlasov@virtuozzo.com>

commit 2939e1a86f758b55cdba73e29397dd3d94df13bc upstream.

Problem statement: unprivileged user who has read-write access to more than
one btrfs subvolume may easily consume all kernel memory (eventually
triggering oom-killer).

Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):

[root@kteam1 ~]# cat prep.sh

DEV=/dev/sdb
mkfs.btrfs -f $DEV
mount $DEV /mnt
for i in `seq 1 16`
do
	mkdir /mnt/$i
	btrfs subvolume create /mnt/SV_$i
	ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
	mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
	chmod a+rwx /mnt/$i
done

[root@kteam1 ~]# sh prep.sh

[maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done

[root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done
kmalloc-128        10144  10144    128   32    1 : tunables    0    0    0 : slabdata    317    317      0
kmalloc-128       9992352 9992352    128   32    1 : tunables    0    0    0 : slabdata 312261 312261      0
kmalloc-128       24226752 24226752    128   32    1 : tunables    0    0    0 : slabdata 757086 757086      0
kmalloc-128       42754240 42754240    128   32    1 : tunables    0    0    0 : slabdata 1336070 1336070      0

The huge numbers above come from insane number of async_work-s allocated
and queued by btrfs_wq_run_delayed_node.

The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
worker func (btrfs_async_run_delayed_root) processes at least
BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
works as expected while the list is almost empty. As soon as it is getting
bigger, worker func starts to process more than one item at a time, it takes
longer, and the chances to have async_works queued more than needed is getting
higher.

The problem above is worsened by another flaw of delayed-inode implementation:
if async_work was queued in a throttling branch (number of items >=
BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
the func occupies CPU infinitely (up to 30sec in my experiments): while the
func is trying to drain the list, the user activity may add more and more
items to the list.

The patch fixes both problems in straightforward way: refuse queuing too
many works in btrfs_wq_run_delayed_node and bail out of worker func if
at least BTRFS_DELAYED_WRITEBACK items are processed.

Changed in v2: remove support of thresh == NO_THRESHOLD.

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/btrfs/async-thread.c  |   14 ++++++++++++++
 fs/btrfs/async-thread.h  |    1 +
 fs/btrfs/delayed-inode.c |    6 ++++--
 3 files changed, 19 insertions(+), 2 deletions(-)

--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -86,6 +86,20 @@ btrfs_work_owner(struct btrfs_work *work
 	return work->wq->fs_info;
 }
 
+bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq)
+{
+	/*
+	 * We could compare wq->normal->pending with num_online_cpus()
+	 * to support "thresh == NO_THRESHOLD" case, but it requires
+	 * moving up atomic_inc/dec in thresh_queue/exec_hook. Let's
+	 * postpone it until someone needs the support of that case.
+	 */
+	if (wq->normal->thresh == NO_THRESHOLD)
+		return false;
+
+	return atomic_read(&wq->normal->pending) > wq->normal->thresh * 2;
+}
+
 BTRFS_WORK_HELPER(worker_helper);
 BTRFS_WORK_HELPER(delalloc_helper);
 BTRFS_WORK_HELPER(flush_delalloc_helper);
--- a/fs/btrfs/async-thread.h
+++ b/fs/btrfs/async-thread.h
@@ -84,4 +84,5 @@ void btrfs_workqueue_set_max(struct btrf
 void btrfs_set_work_high_priority(struct btrfs_work *work);
 struct btrfs_fs_info *btrfs_work_owner(struct btrfs_work *work);
 struct btrfs_fs_info *btrfs_workqueue_owner(struct __btrfs_workqueue *wq);
+bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq);
 #endif
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1353,7 +1353,8 @@ release_path:
 	total_done++;
 
 	btrfs_release_prepared_delayed_node(delayed_node);
-	if (async_work->nr == 0 || total_done < async_work->nr)
+	if ((async_work->nr == 0 && total_done < BTRFS_DELAYED_WRITEBACK) ||
+	    total_done < async_work->nr)
 		goto again;
 
 free_path:
@@ -1369,7 +1370,8 @@ static int btrfs_wq_run_delayed_node(str
 {
 	struct btrfs_async_delayed_work *async_work;
 
-	if (atomic_read(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND)
+	if (atomic_read(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND ||
+	    btrfs_workqueue_normal_congested(fs_info->delayed_workers))
 		return 0;
 
 	async_work = kmalloc(sizeof(*async_work), GFP_NOFS);



  parent reply	other threads:[~2017-01-04 20:07 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20170104200710epcas5p1339068db1097bb4b4df5379cb539f508@epcas5p1.samsung.com>
2017-01-04 20:05 ` [PATCH 4.9 00/83] 4.9.1-stable review Greg Kroah-Hartman
2017-01-04 20:05   ` [PATCH 4.9 01/83] hotplug: Make register and unregister notifier API symmetric Greg Kroah-Hartman
2017-01-04 20:05   ` Greg Kroah-Hartman [this message]
2017-01-04 20:05   ` [PATCH 4.9 03/83] Btrfs: fix BUG_ON in btrfs_mark_buffer_dirty Greg Kroah-Hartman
2017-01-04 20:05   ` [PATCH 4.9 04/83] Btrfs: fix deadlock caused by fsync when logging directory entries Greg Kroah-Hartman
2017-01-04 20:05   ` [PATCH 4.9 05/83] Btrfs: fix tree search logic when replaying directory entry deletes Greg Kroah-Hartman
2017-01-04 20:05   ` [PATCH 4.9 06/83] Btrfs: fix relocation incorrectly dropping data references Greg Kroah-Hartman
2017-01-04 20:05   ` [PATCH 4.9 07/83] btrfs: store and load values of stripes_min/stripes_max in balance status item Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 08/83] Btrfs: fix emptiness check for dirtied extent buffers at check_leaf() Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 09/83] Btrfs: fix qgroup rescan worker initialization Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 10/83] USB: serial: option: add support for Telit LE922A PIDs 0x1040, 0x1041 Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 11/83] USB: serial: option: add dlink dwm-158 Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 12/83] USB: serial: kl5kusb105: fix open error path Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 13/83] USB: cdc-acm: add device id for GW Instek AFG-125 Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 14/83] usb: dwc3: gadget: set PCM1 field of isochronous-first TRBs Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 15/83] usb: hub: Fix auto-remount of safely removed or ejected USB-3 devices Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 17/83] usb: gadget: f_uac2: fix error handling at afunc_bind Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 18/83] usb: gadget: composite: correctly initialize ep->maxpacket Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 19/83] USB: UHCI: report non-PME wakeup signalling for Intel hardware Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 20/83] usbip: vudc: fix: Clear already_seen flag also for ep0 Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 21/83] ALSA: usb-audio: Add QuickCam Communicate Deluxe/S7500 to volume_control_quirks Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 22/83] ALSA: hiface: Fix M2Tech hiFace driver sampling rate change Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 23/83] ALSA: hda/ca0132 - Add quirk for Alienware 15 R2 2016 Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 24/83] ALSA: hda - ignore the assoc and seq when comparing pin configurations Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 25/83] ALSA: hda - fix headset-mic problem on a Dell laptop Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 26/83] ALSA: hda - Gate the mic jack on HP Z1 Gen3 AiO Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 27/83] ALSA: hda: when comparing pin configurations, ignore assoc in addition to seq Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 28/83] clk: ti: omap36xx: Work around sprz319 advisory 2.1 Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 29/83] exec: Ensure mm->user_ns contains the execed files Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 30/83] fs: exec: apply CLOEXEC before changing dumpable task flags Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 31/83] splice: reinstate SIGPIPE/EPIPE handling Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 32/83] block_dev: dont test bdev->bd_contains when it is not stable Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 33/83] mm: Add a user_ns owner to mm_struct and fix ptrace permission checks Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 34/83] vfs,mm: fix return value of read() at s_maxbytes Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 35/83] ptrace: Capture the ptracers creds not PT_PTRACE_CAP Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 36/83] ptrace: Dont allow accessing an undumpable mm Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 38/83] ext4: dont lock buffer in ext4_commit_super if holding spinlock Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 39/83] ext4: fix mballoc breakage with 64k block size Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 40/83] ext4: fix stack memory corruption " Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 41/83] ext4: use more strict checks for inodes_per_block on mount Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 42/83] ext4: fix in-superblock mount options processing Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 43/83] ext4: add sanity checking to count_overhead() Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 44/83] ext4: reject inodes with negative size Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 45/83] ext4: return -ENOMEM instead of success Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 46/83] ext4: do not perform data journaling when data is encrypted Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 47/83] Revert "f2fs: use percpu_counter for # of dirty pages in inode" Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 48/83] f2fs: set ->owner for debugfs status files file_operations Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 49/83] f2fs: fix overflow due to condition check order Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 50/83] f2fs: fix to determine start_cp_addr by sbi->cur_cp_pack Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 51/83] loop: return proper error from loop_queue_rq() Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 52/83] nvmet: Fix possible infinite loop triggered on hot namespace removal Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 53/83] mm/vmscan.c: set correct defer count for shrinker Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 54/83] mm, page_alloc: keep pcp count and list contents in sync if struct page is corrupted Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 55/83] usb: gadget: composite: always set ep->mult to a sensible value Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 56/83] PM / OPP: Pass opp_table to dev_pm_opp_put_regulator() Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 57/83] PM / OPP: Dont use OPP structure outside of rcu protected section Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 58/83] blk-mq: Do not invoke .queue_rq() for a stopped queue Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 59/83] dm table: fix all_blk_mq inconsistency when an empty table is loaded Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 60/83] dm table: an all_blk_mq table must be loaded for a blk-mq DM device Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 61/83] dm flakey: return -EINVAL on interval bounds error in flakey_ctr() Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 62/83] dm crypt: mark key as invalid until properly loaded Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 63/83] dm rq: fix a race condition in rq_completed() Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 64/83] dm raid: fix discard support regression Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 65/83] dm space map metadata: fix struct sm_metadata leak on failed create Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 66/83] ASoC: intel: Fix crash at suspend/resume without card registration Greg Kroah-Hartman
2017-01-04 20:06   ` [PATCH 4.9 67/83] cifs: Fix smbencrypt() to stop pointing a scatterlist at the stack Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 68/83] CIFS: Fix a possible memory corruption during reconnect Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 69/83] CIFS: Fix missing nls unload in smb2_reconnect() Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 70/83] CIFS: Fix a possible double locking of mutex during reconnect Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 71/83] CIFS: Decrease verbosity of ioctl call Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 72/83] CIFS: Fix a possible memory corruption in push locks Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 73/83] kernel/watchdog: use nmi registers snapshot in hardlockup handler Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 74/83] watchdog: mei_wdt: request stop on reboot to prevent false positive event Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 75/83] watchdog: qcom: fix kernel panic due to external abort on non-linefetch Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 76/83] kernel/debug/debug_core.c: more properly delay for secondary CPUs Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 77/83] tpm xen: Remove bogus tpm_chip_unregister Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 78/83] xen/gntdev: Use VM_MIXEDMAP instead of VM_IO to avoid NUMA balancing Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 79/83] arm/xen: Use alloc_percpu rather than __alloc_percpu Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 80/83] xfs: fix up xfs_swap_extent_forks inline extent handling Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 81/83] xfs: set AGI buffer type in xlog_recover_clear_agi_bucket Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 82/83] builddeb: fix cross-building to arm64 producing host-arch debs Greg Kroah-Hartman
2017-01-04 20:07   ` [PATCH 4.9 83/83] x86/kbuild: enable modversions for symbols exported from asm Greg Kroah-Hartman
2017-01-05  0:41   ` [PATCH 4.9 00/83] 4.9.1-stable review Shuah Khan
2017-01-05  7:52     ` Greg Kroah-Hartman
2017-01-05  4:50   ` Guenter Roeck
2017-01-05  7:53     ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170104200446.656204806@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=clm@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpatlasov@virtuozzo.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).