From: Andrew Morton <akpm@linux-foundation.org>
To: Ming Lei <ming.lei@canonical.com>
Cc: linux-kernel@vger.kernel.org,
Alan Stern <stern@rowland.harvard.edu>,
Oliver Neukum <oneukum@suse.de>, Minchan Kim <minchan@kernel.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
"Rafael J. Wysocki" <rjw@sisk.pl>, Jens Axboe <axboe@kernel.dk>,
"David S. Miller" <davem@davemloft.net>,
netdev@vger.kernel.org, linux-usb@vger.kernel.org,
linux-pm@vger.kernel.org, linux-mm@kvack.org,
Jiri Kosina <jiri.kosina@suse.com>, Mel Gorman <mel@csn.ul.ie>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Michal Hocko <mhocko@suse.cz>, Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v4 1/6] mm: teach mm by current context info to not do I/O during memory allocation
Date: Tue, 6 Nov 2012 15:23:54 -0800 [thread overview]
Message-ID: <20121106152354.90150a3b.akpm@linux-foundation.org> (raw)
In-Reply-To: <1351931714-11689-2-git-send-email-ming.lei@canonical.com>
On Sat, 3 Nov 2012 16:35:09 +0800
Ming Lei <ming.lei@canonical.com> wrote:
> This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
> 'struct task_struct'), so that the flag can be set by one task
> to avoid doing I/O inside memory allocation in the task's context.
>
> The patch trys to solve one deadlock problem caused by block device,
> and the problem may happen at least in the below situations:
>
> - during block device runtime resume, if memory allocation with
> GFP_KERNEL is called inside runtime resume callback of any one
> of its ancestors(or the block device itself), the deadlock may be
> triggered inside the memory allocation since it might not complete
> until the block device becomes active and the involed page I/O finishes.
> The situation is pointed out first by Alan Stern. It is not a good
> approach to convert all GFP_KERNEL[1] in the path into GFP_NOIO because
> several subsystems may be involved(for example, PCI, USB and SCSI may
> be involved for usb mass stoarage device, network devices involved too
> in the iSCSI case)
>
> - during block device runtime suspend, because runtime resume need
> to wait for completion of concurrent runtime suspend.
>
> - during error handling of usb mass storage deivce, USB bus reset
> will be put on the device, so there shouldn't have any
> memory allocation with GFP_KERNEL during USB bus reset, otherwise
> the deadlock similar with above may be triggered. Unfortunately, any
> usb device may include one mass storage interface in theory, so it
> requires all usb interface drivers to handle the situation. In fact,
> most usb drivers don't know how to handle bus reset on the device
> and don't provide .pre_set() and .post_reset() callback at all, so
> USB core has to unbind and bind driver for these devices. So it
> is still not practical to resort to GFP_NOIO for solving the problem.
>
> Also the introduced solution can be used by block subsystem or block
> drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
> actual I/O transfer.
>
> It is not a good idea to convert all these GFP_KERNEL in the
> affected path into GFP_NOIO because these functions doing that may be
> implemented as library and will be called in many other contexts.
>
> In fact, memalloc_noio() can convert some of current static GFP_NOIO
> allocation into GFP_KERNEL back in other non-affected contexts, at least
> almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL
> after applying the approach and make allocation with GFP_IO
> only happen in runtime resume/bus reset/block I/O transfer contexts
> generally.
It's unclear from the description why we're also clearing __GFP_FS in
this situation.
If we can avoid doing this then there will be a very small gain: there
are some situations in which a filesystem can clean pagecache without
performing I/O.
It doesn't appear that the patch will add overhead to the alloc/free
hotpaths, which is good.
>
> ...
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1805,6 +1805,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> #define PF_FROZEN 0x00010000 /* frozen for system suspend */
> #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
> #define PF_KSWAPD 0x00040000 /* I am kswapd */
> +#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */
> #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
> #define PF_KTHREAD 0x00200000 /* I am a kernel thread */
> #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */
> @@ -1842,6 +1843,15 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
> #define used_math() tsk_used_math(current)
>
> +#define memalloc_noio() (current->flags & PF_MEMALLOC_NOIO)
> +#define memalloc_noio_save(flag) do { \
> + (flag) = current->flags & PF_MEMALLOC_NOIO; \
> + current->flags |= PF_MEMALLOC_NOIO; \
> +} while (0)
> +#define memalloc_noio_restore(flag) do { \
> + current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flag; \
> +} while (0)
> +
Again with the ghastly macros. Please, do this properly in regular old
C, as previously discussed. It really doesn't matter what daft things
local_irq_save() did 20 years ago. Just do it right!
Also, you can probably put the unlikely() inside memalloc_noio() and
avoid repeating it at all the callsites.
And it might be neater to do:
/*
* Nice comment goes here
*/
static inline gfp_t memalloc_noio_flags(gfp_t flags)
{
if (unlikely(current->flags & PF_MEMALLOC_NOIO))
flags &= ~GFP_IOFS;
return flags;
}
> * task->jobctl flags
> */
>
> ...
>
> @@ -2304,6 +2304,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> .gfp_mask = sc.gfp_mask,
> };
>
> + if (unlikely(memalloc_noio())) {
> + gfp_mask &= ~GFP_IOFS;
> + sc.gfp_mask = gfp_mask;
> + shrink.gfp_mask = sc.gfp_mask;
> + }
We can avoid writing to shrink.gfp_mask twice. And maybe sc.gfp_mask
as well. Unclear, I didn't think about it too hard ;)
> throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
>
> /*
>
> ...
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Andrew Morton <akpm@linux-foundation.org>
To: Ming Lei <ming.lei@canonical.com>
Cc: linux-kernel@vger.kernel.org,
Alan Stern <stern@rowland.harvard.edu>,
Oliver Neukum <oneukum@suse.de>, Minchan Kim <minchan@kernel.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
"Rafael J. Wysocki" <rjw@sisk.pl>, Jens Axboe <axboe@kernel.dk>,
"David S. Miller" <davem@davemloft.net>,
netdev@vger.kernel.org, linux-usb@vger.kernel.org,
linux-pm@vger.kernel.org, linux-mm@kvack.org,
Jiri Kosina <jiri.kosina@suse.com>, Mel Gorman <mel@csn.ul.ie>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
Michal Hocko <mhocko@suse.cz>, Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v4 1/6] mm: teach mm by current context info to not do I/O during memory allocation
Date: Tue, 6 Nov 2012 15:23:54 -0800 [thread overview]
Message-ID: <20121106152354.90150a3b.akpm@linux-foundation.org> (raw)
In-Reply-To: <1351931714-11689-2-git-send-email-ming.lei@canonical.com>
On Sat, 3 Nov 2012 16:35:09 +0800
Ming Lei <ming.lei@canonical.com> wrote:
> This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
> 'struct task_struct'), so that the flag can be set by one task
> to avoid doing I/O inside memory allocation in the task's context.
>
> The patch trys to solve one deadlock problem caused by block device,
> and the problem may happen at least in the below situations:
>
> - during block device runtime resume, if memory allocation with
> GFP_KERNEL is called inside runtime resume callback of any one
> of its ancestors(or the block device itself), the deadlock may be
> triggered inside the memory allocation since it might not complete
> until the block device becomes active and the involed page I/O finishes.
> The situation is pointed out first by Alan Stern. It is not a good
> approach to convert all GFP_KERNEL[1] in the path into GFP_NOIO because
> several subsystems may be involved(for example, PCI, USB and SCSI may
> be involved for usb mass stoarage device, network devices involved too
> in the iSCSI case)
>
> - during block device runtime suspend, because runtime resume need
> to wait for completion of concurrent runtime suspend.
>
> - during error handling of usb mass storage deivce, USB bus reset
> will be put on the device, so there shouldn't have any
> memory allocation with GFP_KERNEL during USB bus reset, otherwise
> the deadlock similar with above may be triggered. Unfortunately, any
> usb device may include one mass storage interface in theory, so it
> requires all usb interface drivers to handle the situation. In fact,
> most usb drivers don't know how to handle bus reset on the device
> and don't provide .pre_set() and .post_reset() callback at all, so
> USB core has to unbind and bind driver for these devices. So it
> is still not practical to resort to GFP_NOIO for solving the problem.
>
> Also the introduced solution can be used by block subsystem or block
> drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
> actual I/O transfer.
>
> It is not a good idea to convert all these GFP_KERNEL in the
> affected path into GFP_NOIO because these functions doing that may be
> implemented as library and will be called in many other contexts.
>
> In fact, memalloc_noio() can convert some of current static GFP_NOIO
> allocation into GFP_KERNEL back in other non-affected contexts, at least
> almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL
> after applying the approach and make allocation with GFP_IO
> only happen in runtime resume/bus reset/block I/O transfer contexts
> generally.
It's unclear from the description why we're also clearing __GFP_FS in
this situation.
If we can avoid doing this then there will be a very small gain: there
are some situations in which a filesystem can clean pagecache without
performing I/O.
It doesn't appear that the patch will add overhead to the alloc/free
hotpaths, which is good.
>
> ...
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1805,6 +1805,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> #define PF_FROZEN 0x00010000 /* frozen for system suspend */
> #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
> #define PF_KSWAPD 0x00040000 /* I am kswapd */
> +#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */
> #define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
> #define PF_KTHREAD 0x00200000 /* I am a kernel thread */
> #define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */
> @@ -1842,6 +1843,15 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
> #define used_math() tsk_used_math(current)
>
> +#define memalloc_noio() (current->flags & PF_MEMALLOC_NOIO)
> +#define memalloc_noio_save(flag) do { \
> + (flag) = current->flags & PF_MEMALLOC_NOIO; \
> + current->flags |= PF_MEMALLOC_NOIO; \
> +} while (0)
> +#define memalloc_noio_restore(flag) do { \
> + current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flag; \
> +} while (0)
> +
Again with the ghastly macros. Please, do this properly in regular old
C, as previously discussed. It really doesn't matter what daft things
local_irq_save() did 20 years ago. Just do it right!
Also, you can probably put the unlikely() inside memalloc_noio() and
avoid repeating it at all the callsites.
And it might be neater to do:
/*
* Nice comment goes here
*/
static inline gfp_t memalloc_noio_flags(gfp_t flags)
{
if (unlikely(current->flags & PF_MEMALLOC_NOIO))
flags &= ~GFP_IOFS;
return flags;
}
> * task->jobctl flags
> */
>
> ...
>
> @@ -2304,6 +2304,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> .gfp_mask = sc.gfp_mask,
> };
>
> + if (unlikely(memalloc_noio())) {
> + gfp_mask &= ~GFP_IOFS;
> + sc.gfp_mask = gfp_mask;
> + shrink.gfp_mask = sc.gfp_mask;
> + }
We can avoid writing to shrink.gfp_mask twice. And maybe sc.gfp_mask
as well. Unclear, I didn't think about it too hard ;)
> throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
>
> /*
>
> ...
>
next prev parent reply other threads:[~2012-11-06 23:23 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-03 8:35 [PATCH v4 0/6] solve deadlock caused by memory allocation with I/O Ming Lei
2012-11-03 8:35 ` Ming Lei
2012-11-03 8:35 ` [PATCH v4 1/6] mm: teach mm by current context info to not do I/O during memory allocation Ming Lei
2012-11-03 8:35 ` Ming Lei
2012-11-06 23:23 ` Andrew Morton [this message]
2012-11-06 23:23 ` Andrew Morton
2012-11-07 3:11 ` Ming Lei
2012-11-07 3:11 ` Ming Lei
2012-11-07 3:48 ` Andrew Morton
2012-11-07 3:48 ` Andrew Morton
2012-11-07 4:35 ` Ming Lei
2012-11-07 4:35 ` Ming Lei
2012-11-03 8:35 ` [PATCH v4 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio() Ming Lei
2012-11-03 8:35 ` Ming Lei
[not found] ` <1351931714-11689-3-git-send-email-ming.lei-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>
2012-11-06 23:24 ` Andrew Morton
2012-11-06 23:24 ` Andrew Morton
2012-11-06 23:24 ` Andrew Morton
2012-11-07 3:32 ` Ming Lei
2012-11-07 3:32 ` Ming Lei
2012-11-03 8:35 ` [PATCH v4 3/6] block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices Ming Lei
2012-11-03 8:35 ` Ming Lei
2012-11-06 23:24 ` Andrew Morton
2012-11-06 23:24 ` Andrew Morton
2012-11-03 8:35 ` [PATCH v4 4/6] net/core: apply pm_runtime_set_memalloc_noio on network devices Ming Lei
2012-11-03 8:35 ` Ming Lei
2012-11-03 8:35 ` [PATCH v4 5/6] PM / Runtime: force memory allocation with no I/O during Runtime PM callbcack Ming Lei
2012-11-03 8:35 ` Ming Lei
2012-11-03 8:35 ` [PATCH v4 6/6] USB: forbid memory allocation with I/O during bus reset Ming Lei
2012-11-03 8:35 ` Ming Lei
2012-11-06 23:23 ` [PATCH v4 0/6] solve deadlock caused by memory allocation with I/O Andrew Morton
2012-11-06 23:23 ` Andrew Morton
2012-11-06 23:23 ` Andrew Morton
2012-11-07 3:37 ` Ming Lei
2012-11-07 3:37 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121106152354.90150a3b.akpm@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=davem@davemloft.net \
--cc=gregkh@linuxfoundation.org \
--cc=jiri.kosina@suse.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-pm@vger.kernel.org \
--cc=linux-usb@vger.kernel.org \
--cc=mel@csn.ul.ie \
--cc=mhocko@suse.cz \
--cc=minchan@kernel.org \
--cc=ming.lei@canonical.com \
--cc=mingo@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=oneukum@suse.de \
--cc=peterz@infradead.org \
--cc=rjw@sisk.pl \
--cc=stern@rowland.harvard.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.