* [PATCH v3 0/4] improve memalloc scope APIs usage
@ 2020-04-09 14:17 colyli
2020-04-09 14:17 ` [PATCH v3 1/4] md: use memalloc scope APIs in mddev_suspend()/mddev_resume() colyli
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: colyli @ 2020-04-09 14:17 UTC (permalink / raw)
To: songliubraving, linux-raid
Cc: mhocko, kent.overstreet, guoqing.jiang, Coly Li
From: Coly Li <colyli@suse.de>
Hi folks,
The motivation of this series is to fix the incorrect GFP_NOIO flag
usage in drivers/md/raid5.c:resize_chunks(). I take the suggestion
from Michal Hocko to use memalloc scope APIs in unified entry point
mddev_suspend()/mddev_resume(). Also I get rid of the incorect GFP_NOIO
usage for scribble_alloc(), and remove redundant memalloc scope APIs
usage in mddev_create_serial_pool(), also as Song Liu suggested, update
the code comments on the header of scribble_alloc().
Thank you in advance for the review and comments.
Coly Li
---
Changelog:
v3: Minor code cleanup.
v2: Add memalloc scope APIs in raid array suspend context.
v1: Original version to add memalloc scope APIs in resize_chunks().
Coly Li (4):
md: use memalloc scope APIs in mddev_suspend()/mddev_resume()
raid5: remove gfp flags from scribble_alloc()
raid5: update code comment of scribble_alloc()
md: remove redundant memalloc scope API usage
drivers/md/md.c | 12 ++++++++----
drivers/md/md.h | 1 +
drivers/md/raid5.c | 22 ++++++++++++++--------
3 files changed, 23 insertions(+), 12 deletions(-)
--
2.25.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 1/4] md: use memalloc scope APIs in mddev_suspend()/mddev_resume()
2020-04-09 14:17 [PATCH v3 0/4] improve memalloc scope APIs usage colyli
@ 2020-04-09 14:17 ` colyli
2020-04-09 15:05 ` Michal Hocko
2020-04-09 14:17 ` [PATCH v3 2/4] raid5: remove gfp flags from scribble_alloc() colyli
` (2 subsequent siblings)
3 siblings, 1 reply; 7+ messages in thread
From: colyli @ 2020-04-09 14:17 UTC (permalink / raw)
To: songliubraving, linux-raid
Cc: mhocko, kent.overstreet, guoqing.jiang, Coly Li
From: Coly Li <colyli@suse.de>
In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
flag, then it is sent into kvmalloc_array() inside scribble_alloc().
The problem is kvmalloc_array() eventually calls kvmalloc_node() which
does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
kmalloc_node() is called indeed to allocate physically continuous
pages. When system memory is under heavy pressure, and the requesting
size is large, there is high probability that allocating continueous
pages will fail.
But simply using GFP_KERNEL flag to call kvmalloc_array() is also
progblematic. In the code path where scribble_alloc() is called, the
raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
and such I/Os go back to the suspend raid array, deadlock will happen.
What is desired here is to allocate non-physically (a.k.a virtually)
continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
to use the mmealloc sceope APIs to restrict memory reclaim I/O in
allocating context, specifically to call memalloc_noio_save() when
suspend the raid array and to call memalloc_noio_restore() when
resume the raid array.
This patch adds the memalloc scope APIs in mddev_suspend() and
mddev_resume(), to restrict memory reclaim I/Os during the raid array
is suspended. The benifit of adding the memalloc scope API in the
unified entry point mddev_suspend()/mddev_resume() is, no matter which
md raid array type (personality), we are sure the deadlock by recursive
memory reclaim I/O won't happen on the suspending context.
Please notice that the memalloc scope APIs only take effect on the raid
array suspending context, if the memory allocation is from another new
created kthread after raid array suspended, the recursive memory reclaim
I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are
used for the critical section where the raid metadata is modifying,
creating a kthread to allocate memory inside the critical section is
queer and very probably being buggy.
Fixes: b330e6a49dc3 ("md: convert to kvmalloc")
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/md.c | 4 ++++
drivers/md/md.h | 1 +
2 files changed, 5 insertions(+)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 271e8a587354..1a8e1bb3a7f4 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -527,11 +527,15 @@ void mddev_suspend(struct mddev *mddev)
wait_event(mddev->sb_wait, !test_bit(MD_UPDATING_SB, &mddev->flags));
del_timer_sync(&mddev->safemode_timer);
+ /* restrict memory reclaim I/O during raid array is suspend */
+ mddev->noio_flag = memalloc_noio_save();
}
EXPORT_SYMBOL_GPL(mddev_suspend);
void mddev_resume(struct mddev *mddev)
{
+ /* entred the memalloc scope from mddev_suspend() */
+ memalloc_noio_restore(mddev->noio_flag);
lockdep_assert_held(&mddev->reconfig_mutex);
if (--mddev->suspended)
return;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index acd681939112..612814d07d35 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -497,6 +497,7 @@ struct mddev {
void (*sync_super)(struct mddev *mddev, struct md_rdev *rdev);
struct md_cluster_info *cluster_info;
unsigned int good_device_nr; /* good device num within cluster raid */
+ unsigned int noio_flag; /* for memalloc scope API */
bool has_superblocks:1;
bool fail_last_dev:1;
--
2.25.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v3 2/4] raid5: remove gfp flags from scribble_alloc()
2020-04-09 14:17 [PATCH v3 0/4] improve memalloc scope APIs usage colyli
2020-04-09 14:17 ` [PATCH v3 1/4] md: use memalloc scope APIs in mddev_suspend()/mddev_resume() colyli
@ 2020-04-09 14:17 ` colyli
2020-04-09 14:17 ` [PATCH v3 3/4] raid5: update code comment of scribble_alloc() colyli
2020-04-09 14:17 ` [PATCH v3 4/4] md: remove redundant memalloc scope API usage colyli
3 siblings, 0 replies; 7+ messages in thread
From: colyli @ 2020-04-09 14:17 UTC (permalink / raw)
To: songliubraving, linux-raid
Cc: mhocko, kent.overstreet, guoqing.jiang, Coly Li
From: Coly Li <colyli@suse.de>
Using GFP_NOIO flag to call scribble_alloc() from resize_chunk() does
not have the expected behavior. kvmalloc_array() inside scribble_alloc()
which receives the GFP_NOIO flag will eventually call kmalloc_node() to
allocate physically continuous pages.
Now we have memalloc scope APIs in mddev_suspend()/mddev_resume() to
prevent memory reclaim I/Os during raid array suspend context, calling
to kvmalloc_array() with GFP_KERNEL flag may avoid deadlock of recursive
I/O as expected.
This patch removes the useless gfp flags from parameters list of
scribble_alloc(), and call kvmalloc_array() with GFP_KERNEL flag. The
incorrect GFP_NOIO flag does not exist anymore.
Fixes: b330e6a49dc3 ("md: convert to kvmalloc")
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/raid5.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ba00e9877f02..190dd70db514 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2228,14 +2228,19 @@ static int grow_stripes(struct r5conf *conf, int num)
* of the P and Q blocks.
*/
static int scribble_alloc(struct raid5_percpu *percpu,
- int num, int cnt, gfp_t flags)
+ int num, int cnt)
{
size_t obj_size =
sizeof(struct page *) * (num+2) +
sizeof(addr_conv_t) * (num+2);
void *scribble;
- scribble = kvmalloc_array(cnt, obj_size, flags);
+ /*
+ * If here is in raid array suspend context, it is in memalloc noio
+ * context as well, there is no potential recursive memory reclaim
+ * I/Os with the GFP_KERNEL flag.
+ */
+ scribble = kvmalloc_array(cnt, obj_size, GFP_KERNEL);
if (!scribble)
return -ENOMEM;
@@ -2267,8 +2272,7 @@ static int resize_chunks(struct r5conf *conf, int new_disks, int new_sectors)
percpu = per_cpu_ptr(conf->percpu, cpu);
err = scribble_alloc(percpu, new_disks,
- new_sectors / STRIPE_SECTORS,
- GFP_NOIO);
+ new_sectors / STRIPE_SECTORS);
if (err)
break;
}
@@ -6759,8 +6763,7 @@ static int alloc_scratch_buffer(struct r5conf *conf, struct raid5_percpu *percpu
conf->previous_raid_disks),
max(conf->chunk_sectors,
conf->prev_chunk_sectors)
- / STRIPE_SECTORS,
- GFP_KERNEL)) {
+ / STRIPE_SECTORS)) {
free_scratch_buffer(conf, percpu);
return -ENOMEM;
}
--
2.25.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v3 3/4] raid5: update code comment of scribble_alloc()
2020-04-09 14:17 [PATCH v3 0/4] improve memalloc scope APIs usage colyli
2020-04-09 14:17 ` [PATCH v3 1/4] md: use memalloc scope APIs in mddev_suspend()/mddev_resume() colyli
2020-04-09 14:17 ` [PATCH v3 2/4] raid5: remove gfp flags from scribble_alloc() colyli
@ 2020-04-09 14:17 ` colyli
2020-04-09 14:17 ` [PATCH v3 4/4] md: remove redundant memalloc scope API usage colyli
3 siblings, 0 replies; 7+ messages in thread
From: colyli @ 2020-04-09 14:17 UTC (permalink / raw)
To: songliubraving, linux-raid
Cc: mhocko, kent.overstreet, guoqing.jiang, Coly Li
From: Coly Li <colyli@suse.de>
Code comments of scribble_alloc() is outdated for a while. This patch
update the comments in function header for the new parameter list.
Suggested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
drivers/md/raid5.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 190dd70db514..ab8067f9ce8c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2215,10 +2215,13 @@ static int grow_stripes(struct r5conf *conf, int num)
}
/**
- * scribble_len - return the required size of the scribble region
+ * scribble_alloc - allocate percpu scribble buffer for required size
+ * of the scribble region
+ * @percpu - from for_each_present_cpu() of the caller
* @num - total number of disks in the array
+ * @cnt - scribble objs count for required size of the scribble region
*
- * The size must be enough to contain:
+ * The scribble buffer size must be enough to contain:
* 1/ a struct page pointer for each device in the array +2
* 2/ room to convert each entry in (1) to its corresponding dma
* (dma_map_page()) or page (page_address()) address.
--
2.25.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v3 4/4] md: remove redundant memalloc scope API usage
2020-04-09 14:17 [PATCH v3 0/4] improve memalloc scope APIs usage colyli
` (2 preceding siblings ...)
2020-04-09 14:17 ` [PATCH v3 3/4] raid5: update code comment of scribble_alloc() colyli
@ 2020-04-09 14:17 ` colyli
3 siblings, 0 replies; 7+ messages in thread
From: colyli @ 2020-04-09 14:17 UTC (permalink / raw)
To: songliubraving, linux-raid
Cc: mhocko, kent.overstreet, guoqing.jiang, Coly Li
From: Coly Li <colyli@suse.de>
In mddev_create_serial_pool(), memalloc scope APIs memalloc_noio_save()
and memalloc_noio_restore() are used when allocating memory by calling
mempool_create_kmalloc_pool(). After adding the memalloc scope APIs in
raid array suspend context, it is unncessary to explict call them around
mempool_create_kmalloc_pool() any longer.
This patch removes the redundant memalloc scope APIs in
mddev_create_serial_pool().
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
---
drivers/md/md.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1a8e1bb3a7f4..76ecf7b8cfc8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -227,13 +227,13 @@ void mddev_create_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
goto abort;
if (mddev->serial_info_pool == NULL) {
- unsigned int noio_flag;
-
- noio_flag = memalloc_noio_save();
+ /*
+ * already in memalloc noio context by
+ * mddev_suspend()
+ */
mddev->serial_info_pool =
mempool_create_kmalloc_pool(NR_SERIAL_INFOS,
sizeof(struct serial_info));
- memalloc_noio_restore(noio_flag);
if (!mddev->serial_info_pool) {
rdevs_uninit_serial(mddev);
pr_err("can't alloc memory pool for serialization\n");
--
2.25.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v3 1/4] md: use memalloc scope APIs in mddev_suspend()/mddev_resume()
2020-04-09 14:17 ` [PATCH v3 1/4] md: use memalloc scope APIs in mddev_suspend()/mddev_resume() colyli
@ 2020-04-09 15:05 ` Michal Hocko
2020-04-09 15:16 ` Coly Li
0 siblings, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2020-04-09 15:05 UTC (permalink / raw)
To: colyli; +Cc: songliubraving, linux-raid, kent.overstreet, guoqing.jiang
On Thu 09-04-20 22:17:20, colyli@suse.de wrote:
> From: Coly Li <colyli@suse.de>
>
> In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
> flag, then it is sent into kvmalloc_array() inside scribble_alloc().
>
> The problem is kvmalloc_array() eventually calls kvmalloc_node() which
> does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
> kmalloc_node() is called indeed to allocate physically continuous
> pages. When system memory is under heavy pressure, and the requesting
> size is large, there is high probability that allocating continueous
> pages will fail.
>
> But simply using GFP_KERNEL flag to call kvmalloc_array() is also
> progblematic. In the code path where scribble_alloc() is called, the
> raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
> and such I/Os go back to the suspend raid array, deadlock will happen.
>
> What is desired here is to allocate non-physically (a.k.a virtually)
> continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
> to use the mmealloc sceope APIs to restrict memory reclaim I/O in
> allocating context, specifically to call memalloc_noio_save() when
> suspend the raid array and to call memalloc_noio_restore() when
> resume the raid array.
>
> This patch adds the memalloc scope APIs in mddev_suspend() and
> mddev_resume(), to restrict memory reclaim I/Os during the raid array
> is suspended. The benifit of adding the memalloc scope API in the
> unified entry point mddev_suspend()/mddev_resume() is, no matter which
> md raid array type (personality), we are sure the deadlock by recursive
> memory reclaim I/O won't happen on the suspending context.
I am not familiar with the mdraid code so I cannot really judge the
correctness here but if mddev_suspend really acts as a potential reclaim
recursion deadlock entry then this is the right way to use the API.
Essentially all the allocations in that scope will have an implicit NOIO
semantic.
Thing to be careful about is the make sure that mddev_suspend cannot
be nested. And also that there are no callers of scribble_alloc outside
of mddev_suspend scope which would be reclaim deadlock prone. If they
are their scope should be handled in the similar way.
Thanks!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 1/4] md: use memalloc scope APIs in mddev_suspend()/mddev_resume()
2020-04-09 15:05 ` Michal Hocko
@ 2020-04-09 15:16 ` Coly Li
0 siblings, 0 replies; 7+ messages in thread
From: Coly Li @ 2020-04-09 15:16 UTC (permalink / raw)
To: Michal Hocko; +Cc: songliubraving, linux-raid, kent.overstreet, guoqing.jiang
On 2020/4/9 11:05 下午, Michal Hocko wrote:
> On Thu 09-04-20 22:17:20, colyli@suse.de wrote:
>> From: Coly Li <colyli@suse.de>
>>
>> In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
>> flag, then it is sent into kvmalloc_array() inside scribble_alloc().
>>
>> The problem is kvmalloc_array() eventually calls kvmalloc_node() which
>> does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
>> kmalloc_node() is called indeed to allocate physically continuous
>> pages. When system memory is under heavy pressure, and the requesting
>> size is large, there is high probability that allocating continueous
>> pages will fail.
>>
>> But simply using GFP_KERNEL flag to call kvmalloc_array() is also
>> progblematic. In the code path where scribble_alloc() is called, the
>> raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
>> and such I/Os go back to the suspend raid array, deadlock will happen.
>>
>> What is desired here is to allocate non-physically (a.k.a virtually)
>> continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
>> to use the mmealloc sceope APIs to restrict memory reclaim I/O in
>> allocating context, specifically to call memalloc_noio_save() when
>> suspend the raid array and to call memalloc_noio_restore() when
>> resume the raid array.
>>
>> This patch adds the memalloc scope APIs in mddev_suspend() and
>> mddev_resume(), to restrict memory reclaim I/Os during the raid array
>> is suspended. The benifit of adding the memalloc scope API in the
>> unified entry point mddev_suspend()/mddev_resume() is, no matter which
>> md raid array type (personality), we are sure the deadlock by recursive
>> memory reclaim I/O won't happen on the suspending context.
>
> I am not familiar with the mdraid code so I cannot really judge the
> correctness here but if mddev_suspend really acts as a potential reclaim
> recursion deadlock entry then this is the right way to use the API.
> Essentially all the allocations in that scope will have an implicit NOIO
> semantic.
>
> Thing to be careful about is the make sure that mddev_suspend cannot
> be nested. And also that there are no callers of scribble_alloc outside
> of mddev_suspend scope which would be reclaim deadlock prone. If they
> are their scope should be handled in the similar way.
Thank you for the confirmation, and the always constructive discussion :-)
--
Coly Li
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-04-09 15:16 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-09 14:17 [PATCH v3 0/4] improve memalloc scope APIs usage colyli
2020-04-09 14:17 ` [PATCH v3 1/4] md: use memalloc scope APIs in mddev_suspend()/mddev_resume() colyli
2020-04-09 15:05 ` Michal Hocko
2020-04-09 15:16 ` Coly Li
2020-04-09 14:17 ` [PATCH v3 2/4] raid5: remove gfp flags from scribble_alloc() colyli
2020-04-09 14:17 ` [PATCH v3 3/4] raid5: update code comment of scribble_alloc() colyli
2020-04-09 14:17 ` [PATCH v3 4/4] md: remove redundant memalloc scope API usage colyli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).