From: Mike Christie <mchristi@redhat.com>
To: Shakeel Butt <shakeelb@google.com>,
Andrew Morton <akpm@linux-foundation.org>
Cc: linux-api@vger.kernel.org, idryomov@gmail.com,
Michal Hocko <mhocko@kernel.org>,
david@fromorbit.com, Linux MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>,
linux-scsi@vger.kernel.org,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
linux-block@vger.kernel.org, martin@urbackup.org,
Damien.LeMoal@wdc.com, Michal Hocko <mhocko@suse.com>,
Masato Suzuki <masato.suzuki@wdc.com>
Subject: Re: [PATCH] Add prctl support for controlling mem reclaim V4
Date: Fri, 24 Jan 2020 10:22:33 -0600 [thread overview]
Message-ID: <5E2B19C9.6080907@redhat.com> (raw)
In-Reply-To: <CALvZod47XyD2x8TuZcb9PgeVY14JBwNhsUpN3RAeAt+RJJC=hg@mail.gmail.com>
On 12/05/2019 04:43 PM, Shakeel Butt wrote:
> On Mon, Nov 11, 2019 at 4:19 PM Mike Christie <mchristi@redhat.com> wrote:
>>
>> There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
>> amd nbd that have userspace components that can run in the IO path. For
>> example, iscsi and nbd's userspace deamons may need to recreate a socket
>> and/or send IO on it, and dm-multipath's daemon multipathd may need to
>> send SG IO or read/write IO to figure out the state of paths and re-set
>> them up.
>>
>> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
>> memalloc_*_save/restore functions to control the allocation behavior,
>> but for userspace we would end up hitting an allocation that ended up
>> writing data back to the same device we are trying to allocate for.
>> The device is then in a state of deadlock, because to execute IO the
>> device needs to allocate memory, but to allocate memory the memory
>> layers want execute IO to the device.
>>
>> Here is an example with nbd using a local userspace daemon that performs
>> network IO to a remote server. We are using XFS on top of the nbd device,
>> but it can happen with any FS or other modules layered on top of the nbd
>> device that can write out data to free memory. Here a nbd daemon helper
>> thread, msgr-worker-1, is performing a write/sendmsg on a socket to execute
>> a request. This kicks off a reclaim operation which results in a WRITE to
>> the nbd device and the nbd thread calling back into the mm layer.
>>
>> [ 1626.609191] msgr-worker-1 D 0 1026 1 0x00004000
>> [ 1626.609193] Call Trace:
>> [ 1626.609195] ? __schedule+0x29b/0x630
>> [ 1626.609197] ? wait_for_completion+0xe0/0x170
>> [ 1626.609198] schedule+0x30/0xb0
>> [ 1626.609200] schedule_timeout+0x1f6/0x2f0
>> [ 1626.609202] ? blk_finish_plug+0x21/0x2e
>> [ 1626.609204] ? _xfs_buf_ioapply+0x2e6/0x410
>> [ 1626.609206] ? wait_for_completion+0xe0/0x170
>> [ 1626.609208] wait_for_completion+0x108/0x170
>> [ 1626.609210] ? wake_up_q+0x70/0x70
>> [ 1626.609212] ? __xfs_buf_submit+0x12e/0x250
>> [ 1626.609214] ? xfs_bwrite+0x25/0x60
>> [ 1626.609215] xfs_buf_iowait+0x22/0xf0
>> [ 1626.609218] __xfs_buf_submit+0x12e/0x250
>> [ 1626.609220] xfs_bwrite+0x25/0x60
>> [ 1626.609222] xfs_reclaim_inode+0x2e8/0x310
>> [ 1626.609224] xfs_reclaim_inodes_ag+0x1b6/0x300
>> [ 1626.609227] xfs_reclaim_inodes_nr+0x31/0x40
>> [ 1626.609228] super_cache_scan+0x152/0x1a0
>> [ 1626.609231] do_shrink_slab+0x12c/0x2d0
>> [ 1626.609233] shrink_slab+0x9c/0x2a0
>> [ 1626.609235] shrink_node+0xd7/0x470
>> [ 1626.609237] do_try_to_free_pages+0xbf/0x380
>> [ 1626.609240] try_to_free_pages+0xd9/0x1f0
>> [ 1626.609245] __alloc_pages_slowpath+0x3a4/0xd30
>> [ 1626.609251] ? ___slab_alloc+0x238/0x560
>> [ 1626.609254] __alloc_pages_nodemask+0x30c/0x350
>> [ 1626.609259] skb_page_frag_refill+0x97/0xd0
>> [ 1626.609274] sk_page_frag_refill+0x1d/0x80
>> [ 1626.609279] tcp_sendmsg_locked+0x2bb/0xdd0
>> [ 1626.609304] tcp_sendmsg+0x27/0x40
>> [ 1626.609307] sock_sendmsg+0x54/0x60
>> [ 1626.609308] ___sys_sendmsg+0x29f/0x320
>> [ 1626.609313] ? sock_poll+0x66/0xb0
>> [ 1626.609318] ? ep_item_poll.isra.15+0x40/0xc0
>> [ 1626.609320] ? ep_send_events_proc+0xe6/0x230
>> [ 1626.609322] ? hrtimer_try_to_cancel+0x54/0xf0
>> [ 1626.609324] ? ep_read_events_proc+0xc0/0xc0
>> [ 1626.609326] ? _raw_write_unlock_irq+0xa/0x20
>> [ 1626.609327] ? ep_scan_ready_list.constprop.19+0x218/0x230
>> [ 1626.609329] ? __hrtimer_init+0xb0/0xb0
>> [ 1626.609331] ? _raw_spin_unlock_irq+0xa/0x20
>> [ 1626.609334] ? ep_poll+0x26c/0x4a0
>> [ 1626.609337] ? tcp_tsq_write.part.54+0xa0/0xa0
>> [ 1626.609339] ? release_sock+0x43/0x90
>> [ 1626.609341] ? _raw_spin_unlock_bh+0xa/0x20
>> [ 1626.609342] __sys_sendmsg+0x47/0x80
>> [ 1626.609347] do_syscall_64+0x5f/0x1c0
>> [ 1626.609349] ? prepare_exit_to_usermode+0x75/0xa0
>> [ 1626.609351] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> This patch adds a new prctl command that daemons can use after they have
>> done their initial setup, and before they start to do allocations that
>> are in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE
>> flags so both userspace block and FS threads can use it to avoid the
>> allocation recursion and try to prevent from being throttled while
>> writing out data to free up memory.
>>
>> Signed-off-by: Mike Christie <mchristi@redhat.com>
>> Acked-by: Michal Hocko <mhocko@suse.com>
>> Tested-by: Masato Suzuki <masato.suzuki@wdc.com>
>> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
>
> I suppose this patch should be routed through MM tree, so, CCing Andrew.
>
Andrew and other mm/storage developers,
Do I need to handle anything else for this patch, or are there any other
concerns? Is this maybe something we want to talk about at a quick LSF
session?
I have retested it with Linus's current tree. It still applies cleanly
(just some offsets), and fixes the problem described above we have been
hitting.
next prev parent reply other threads:[~2020-01-24 16:22 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-12 0:19 [PATCH] Add prctl support for controlling mem reclaim V4 Mike Christie
2019-11-27 18:07 ` Bart Van Assche
2019-12-05 22:43 ` Shakeel Butt
2020-01-24 16:22 ` Mike Christie [this message]
2020-01-24 21:16 ` Dave Chinner
2020-01-27 13:02 ` Christian Brauner
2020-01-30 14:08 ` Christian Brauner
2020-01-24 22:09 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5E2B19C9.6080907@redhat.com \
--to=mchristi@redhat.com \
--cc=Damien.LeMoal@wdc.com \
--cc=akpm@linux-foundation.org \
--cc=david@fromorbit.com \
--cc=idryomov@gmail.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-scsi@vger.kernel.org \
--cc=martin@urbackup.org \
--cc=masato.suzuki@wdc.com \
--cc=mhocko@kernel.org \
--cc=mhocko@suse.com \
--cc=shakeelb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.