* [PATCH] ext4: fix memory leak in ext4_fill_super
@ 2021-04-28 22:19 Alexey Makhalov
2021-05-21 4:43 ` Theodore Y. Ts'o
0 siblings, 1 reply; 16+ messages in thread
From: Alexey Makhalov @ 2021-04-28 22:19 UTC (permalink / raw)
To: linux-ext4; +Cc: Alexey Makhalov, stable, Theodore Ts'o, Andreas Dilger
I've recently discovered that doing infinite loop of
systemctl start <ext4_on_lvm>.mount, and
systemctl stop <ext4_on_lvm>.mount
linearly increases percpu allocator memory consumption.
In several hours, it might lead to system instability by
consuming most of the memory.
Bug is not reproducible when the ext4 filesystem is on
physical partition, but it is persistent when ext4 is on
logical volume.
During debugging it was found that most of active percpu
allocations are from /system.slice/<ext4_on_lvm>.mount
memory cgroups (created by systemd for each mount). All
of these cgroups are in dying state with refcount equal
to 2. And most interesting that each mount/umount itera-
tion creates exactly one dying memory cgroup.
Tracking down the remaining refcounts showed that it was
charged from ext4_fill_super(). And the page is always
0 index in the page cache mapping.
The issue was hidden behind initial super block read using
logical blocksize from bdev and adjusting blocksize later
after reading actual block size from superblock.
If blocksizes differ, sb_set_blocksize() will kill current
buffers and page cache by using kill_bdev(). And then super
block will be reread again but using correct blocksize this
time. sb_set_blocksize() didn't fully free superblock page
and buffers as buffer pointed by bh variable remained busy.
So buffer and its page remains in the memory (leak). Super
block reread logic does not happen when ext4 filesystem is
on physical partition as blocksize is correct for initial
superblock read.
brelse(bh), where bh is a buffer head of superblock page,
must be called and bh references must be released before
kill_bdev(). kill_bdev() subfunctions (see callstack below)
won't be able to free not released buffer (even if it's
clean) and superblock page won't be freed as well.
callstack:
kill_bdev()
->truncate_inode_pages()
->truncate_inode_pages_range()
->truncate_cleanup_page()
->do_invalidatepage
->block_invalidatepage()
->try_to_release_page() == fail to release
->try_to_free_buffers() == fail to free
->drop_buffers()
->buffer_busy() == yes
Incorrect order of brelse() and kill_bdev() in ext4_fill_super()
was introduced by commit ce40733ce93d ("ext4: Check for return
value from sb_set_blocksize") 13 years ago! Thanks to memory
hungry percpu, it was easy to detect this issue now.
Fix this by moving the brelse() before sb_set_blocksize() and
add a comment about the dependency.
In addition, fix similar issue under failed_mount: label (in
the same function) about incorrect order of ext4_blkdev_remove()
vs brelse() introduced by commit ac27a0ec112a ("ext4: initial
copy of files from ext3")
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Cc: stable@vger.kernel.org
Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize")
Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3")
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
---
fs/ext4/super.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b9693680463a..6c8f68309834 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4451,14 +4451,20 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
}
if (sb->s_blocksize != blocksize) {
+ /*
+ * bh must be released before kill_bdev(), otherwise
+ * it won't be freed and its page also. kill_bdev()
+ * is called by sb_set_blocksize().
+ */
+ brelse(bh);
/* Validate the filesystem blocksize */
if (!sb_set_blocksize(sb, blocksize)) {
ext4_msg(sb, KERN_ERR, "bad block size %d",
blocksize);
+ bh = NULL;
goto failed_mount;
}
- brelse(bh);
logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE;
offset = do_div(logical_sb_block, blocksize);
bh = ext4_sb_bread_unmovable(sb, logical_sb_block);
@@ -5178,8 +5184,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
kfree(get_qf_name(sb, sbi, i));
#endif
fscrypt_free_dummy_policy(&sbi->s_dummy_enc_policy);
- ext4_blkdev_remove(sbi);
+ /* ext4_blkdev_remove() calls kill_bdev(), release bh before it. */
brelse(bh);
+ ext4_blkdev_remove(sbi);
out_fail:
sb->s_fs_info = NULL;
kfree(sbi->s_blockgroup_lock);
--
2.14.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-04-28 22:19 [PATCH] ext4: fix memory leak in ext4_fill_super Alexey Makhalov
@ 2021-05-21 4:43 ` Theodore Y. Ts'o
2021-05-21 7:43 ` Alexey Makhalov
0 siblings, 1 reply; 16+ messages in thread
From: Theodore Y. Ts'o @ 2021-05-21 4:43 UTC (permalink / raw)
To: Alexey Makhalov; +Cc: linux-ext4, stable, Andreas Dilger
On Wed, Apr 28, 2021 at 10:19:28PM +0000, Alexey Makhalov wrote:
> I've recently discovered that doing infinite loop of
> systemctl start <ext4_on_lvm>.mount, and
> systemctl stop <ext4_on_lvm>.mount
> linearly increases percpu allocator memory consumption.
> In several hours, it might lead to system instability by
> consuming most of the memory.
>
> Bug is not reproducible when the ext4 filesystem is on
> physical partition, but it is persistent when ext4 is on
> logical volume.
Why is this the case? It sounds like we're looking a buffer for each
mount where the block size is not 1k. It shouldn't matter whether it
is a physical partition or not.
- Ted
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-05-21 4:43 ` Theodore Y. Ts'o
@ 2021-05-21 7:43 ` Alexey Makhalov
2021-05-21 7:55 ` [PATCH v2] " Alexey Makhalov
2021-05-21 14:29 ` [PATCH] " Theodore Y. Ts'o
0 siblings, 2 replies; 16+ messages in thread
From: Alexey Makhalov @ 2021-05-21 7:43 UTC (permalink / raw)
To: Theodore Y. Ts'o; +Cc: linux-ext4, stable, Andreas Dilger
[-- Attachment #1: Type: text/plain, Size: 958 bytes --]
Hi Ted,
Good point! This paragraph can be just dropped as the next one
describes the issue with superblock re-read. Will send v2 shortly.
Thanks,
—Alexey
> On May 20, 2021, at 9:43 PM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Wed, Apr 28, 2021 at 10:19:28PM +0000, Alexey Makhalov wrote:
>> I've recently discovered that doing infinite loop of
>> systemctl start <ext4_on_lvm>.mount, and
>> systemctl stop <ext4_on_lvm>.mount
>> linearly increases percpu allocator memory consumption.
>> In several hours, it might lead to system instability by
>> consuming most of the memory.
>>
>> Bug is not reproducible when the ext4 filesystem is on
>> physical partition, but it is persistent when ext4 is on
>> logical volume.
>
> Why is this the case? It sounds like we're looking a buffer for each
> mount where the block size is not 1k. It shouldn't matter whether it
> is a physical partition or not.
>
> - Ted
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2] ext4: fix memory leak in ext4_fill_super
2021-05-21 7:43 ` Alexey Makhalov
@ 2021-05-21 7:55 ` Alexey Makhalov
2021-05-21 14:29 ` [PATCH] " Theodore Y. Ts'o
1 sibling, 0 replies; 16+ messages in thread
From: Alexey Makhalov @ 2021-05-21 7:55 UTC (permalink / raw)
To: Theodore Y . Ts'o; +Cc: linux-ext4, stable, Andreas Dilger, Alexey Makhalov
I've recently discovered that doing infinite loop of
systemctl start <ext4_on_lvm>.mount, and
systemctl stop <ext4_on_lvm>.mount
linearly increases percpu allocator memory consumption.
In several hours, it might lead to system instability by
consuming most of the memory.
During debugging it was found that most of active percpu
allocations are from /system.slice/<ext4_on_lvm>.mount
memory cgroups (created by systemd for each mount). All
of these cgroups are in dying state with refcount equal
to 2. And most interesting that each mount/umount itera-
tion creates exactly one dying memory cgroup.
Tracking down the remaining refcounts showed that it was
charged from ext4_fill_super(). And the page is always
0 index in the page cache mapping.
The issue was hidden behind initial super block read using
logical blocksize from bdev and adjusting blocksize later
after reading actual block size from superblock.
If blocksizes differ, sb_set_blocksize() will kill current
buffers and page cache by using kill_bdev(). And then super
block will be reread again but using correct blocksize this
time. sb_set_blocksize() didn't fully free superblock page
and buffers as buffer pointed by bh variable remained busy.
So buffer and its page remains in the memory (leak). Super
block reread logic does not happen when ext4 filesystem is
on physical partition as blocksize is correct for initial
superblock read.
brelse(bh), where bh is a buffer head of superblock page,
must be called and bh references must be released before
kill_bdev(). kill_bdev() subfunctions (see callstack below)
won't be able to free not released buffer (even if it's
clean) and superblock page won't be freed as well.
callstack:
kill_bdev()
->truncate_inode_pages()
->truncate_inode_pages_range()
->truncate_cleanup_page()
->do_invalidatepage
->block_invalidatepage()
->try_to_release_page() == fail to release
->try_to_free_buffers() == fail to free
->drop_buffers()
->buffer_busy() == yes
Incorrect order of brelse() and kill_bdev() in ext4_fill_super()
was introduced by commit ce40733ce93d ("ext4: Check for return
value from sb_set_blocksize") 13 years ago! Thanks to memory
hungry percpu, it was easy to detect this issue now.
Fix this by moving the brelse() before sb_set_blocksize() and
add a comment about the dependency.
In addition, fix similar issue under failed_mount: label (in
the same function) about incorrect order of ext4_blkdev_remove()
vs brelse() introduced by commit ac27a0ec112a ("ext4: initial
copy of files from ext3")
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Cc: stable@vger.kernel.org
Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize")
Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3")
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
---
fs/ext4/super.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b9693680463a..6c8f68309834 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4451,14 +4451,20 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
}
if (sb->s_blocksize != blocksize) {
+ /*
+ * bh must be released before kill_bdev(), otherwise
+ * it won't be freed and its page also. kill_bdev()
+ * is called by sb_set_blocksize().
+ */
+ brelse(bh);
/* Validate the filesystem blocksize */
if (!sb_set_blocksize(sb, blocksize)) {
ext4_msg(sb, KERN_ERR, "bad block size %d",
blocksize);
+ bh = NULL;
goto failed_mount;
}
- brelse(bh);
logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE;
offset = do_div(logical_sb_block, blocksize);
bh = ext4_sb_bread_unmovable(sb, logical_sb_block);
@@ -5178,8 +5184,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
kfree(get_qf_name(sb, sbi, i));
#endif
fscrypt_free_dummy_policy(&sbi->s_dummy_enc_policy);
- ext4_blkdev_remove(sbi);
+ /* ext4_blkdev_remove() calls kill_bdev(), release bh before it. */
brelse(bh);
+ ext4_blkdev_remove(sbi);
out_fail:
sb->s_fs_info = NULL;
kfree(sbi->s_blockgroup_lock);
--
2.14.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-05-21 7:43 ` Alexey Makhalov
2021-05-21 7:55 ` [PATCH v2] " Alexey Makhalov
@ 2021-05-21 14:29 ` Theodore Y. Ts'o
2021-05-21 16:12 ` Alexey Makhalov
1 sibling, 1 reply; 16+ messages in thread
From: Theodore Y. Ts'o @ 2021-05-21 14:29 UTC (permalink / raw)
To: Alexey Makhalov; +Cc: linux-ext4, stable, Andreas Dilger
On Fri, May 21, 2021 at 12:43:46AM -0700, Alexey Makhalov wrote:
> Hi Ted,
>
> Good point! This paragraph can be just dropped as the next one
> describes the issue with superblock re-read. Will send v2 shortly.
Thanks; for what it's worth, I'm going to be editing the commit
description anyway; it's really helpful during the patch review to
understand how you found the problem, and how to reproduce it.
However, the commit description when it lands upstream will include a
link to this mail thread on lore.kernel.org, and so including a long
stack trace isn't really necessary.
I'm going to cut it down to something like this:
------------------------------
ext4: fix memory leak in ext4_fill_super
Buffer head references must be released before calling kill_bdev();
otherwise the buffer head (and its page referenced by b_data) will not
be freed by kill_bdev, and subsequently that bh will be leaked.
If blocksizes differ, sb_set_blocksize() will kill current buffers and
page cache by using kill_bdev(). And then super block will be reread
again but using correct blocksize this time. sb_set_blocksize() didn't
fully free superblock page and buffer head, and being busy, they were
not freed and instead leaked.
This can easily be reproduced by calling an infinite loop of:
systemctl start <ext4_on_lvm>.mount, and
systemctl stop <ext4_on_lvm>.mount
... since systemd creates a cgroup for each slice which it mounts, and
the bh leak get amplified by a dying memory cgroup that also never
gets freed, and memory consumption is much more easily noticed.
Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/459B4724-842E-4B47-B2E7-D29805431E69@vmware.com
Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize")
Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3")
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
------------------------------
The commit description above is 18 lines (exclusive of trailers and
headers) versus 71 lines, and is much more to the point for someone
who might be doing code archeology via "git log".
When submitting this as a patch, you can either use a separate cover
letter (git format-patch --cover-letter) or including the explanation
after the --- in the diff, so that it disappears before the commit is
added via "git am". But it's not that hard for me to rework a commit
description, so I'll take care of it for this patch; no need to send a
V3.
Cheers,
- Ted
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-05-21 14:29 ` [PATCH] " Theodore Y. Ts'o
@ 2021-05-21 16:12 ` Alexey Makhalov
0 siblings, 0 replies; 16+ messages in thread
From: Alexey Makhalov @ 2021-05-21 16:12 UTC (permalink / raw)
To: Theodore Y. Ts'o; +Cc: linux-ext4, stable, Andreas Dilger
[-- Attachment #1: Type: text/plain, Size: 3243 bytes --]
Sounds good! Thanks for the guidance and v3) —Alexey
> On May 21, 2021, at 7:29 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Fri, May 21, 2021 at 12:43:46AM -0700, Alexey Makhalov wrote:
>> Hi Ted,
>>
>> Good point! This paragraph can be just dropped as the next one
>> describes the issue with superblock re-read. Will send v2 shortly.
>
> Thanks; for what it's worth, I'm going to be editing the commit
> description anyway; it's really helpful during the patch review to
> understand how you found the problem, and how to reproduce it.
> However, the commit description when it lands upstream will include a
> link to this mail thread on lore.kernel.org, and so including a long
> stack trace isn't really necessary.
>
> I'm going to cut it down to something like this:
>
> ------------------------------
> ext4: fix memory leak in ext4_fill_super
>
> Buffer head references must be released before calling kill_bdev();
> otherwise the buffer head (and its page referenced by b_data) will not
> be freed by kill_bdev, and subsequently that bh will be leaked.
>
> If blocksizes differ, sb_set_blocksize() will kill current buffers and
> page cache by using kill_bdev(). And then super block will be reread
> again but using correct blocksize this time. sb_set_blocksize() didn't
> fully free superblock page and buffer head, and being busy, they were
> not freed and instead leaked.
>
> This can easily be reproduced by calling an infinite loop of:
>
> systemctl start <ext4_on_lvm>.mount, and
> systemctl stop <ext4_on_lvm>.mount
>
> ... since systemd creates a cgroup for each slice which it mounts, and
> the bh leak get amplified by a dying memory cgroup that also never
> gets freed, and memory consumption is much more easily noticed.
>
> Signed-off-by: Alexey Makhalov <amakhalov@vmware.com>
> Cc: stable@vger.kernel.org
> Link: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fr%2F459B4724-842E-4B47-B2E7-D29805431E69%40vmware.com&data=04%7C01%7Camakhalov%40vmware.com%7Ce5ae2270f1134d9a3cce08d91c64cb26%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637572041508854286%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2Fa%2FqTcBfL1tYkIq0byM46DXmpxTFOraAly2Ib9sbghE%3D&reserved=0
> Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize")
> Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3")
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> ------------------------------
>
> The commit description above is 18 lines (exclusive of trailers and
> headers) versus 71 lines, and is much more to the point for someone
> who might be doing code archeology via "git log".
>
> When submitting this as a patch, you can either use a separate cover
> letter (git format-patch --cover-letter) or including the explanation
> after the --- in the diff, so that it disappears before the commit is
> added via "git am". But it's not that hard for me to rework a commit
> description, so I'll take care of it for this patch; no need to send a
> V3.
>
> Cheers,
>
> - Ted
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] ext4: fix memory leak in ext4_fill_super
@ 2021-04-28 17:28 Pavel Skripkin
2021-04-29 10:01 ` Vegard Nossum
0 siblings, 1 reply; 16+ messages in thread
From: Pavel Skripkin @ 2021-04-28 17:28 UTC (permalink / raw)
To: tytso, adilger.kernel
Cc: linux-ext4, linux-kernel, Pavel Skripkin,
syzbot+d9e482e303930fa4f6ff
syzbot reported memory leak in ext4 subsyetem.
The problem appears, when thread_stop() call happens
before wake_up_process().
Normally, this data will be freed by
created thread, but if kthread_stop()
returned -EINTR, this data should be freed manually
Reported-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com
Tested-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
---
fs/ext4/super.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b9693680463a..9c33e97bd5c5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5156,8 +5156,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
failed_mount3:
flush_work(&sbi->s_error_work);
del_timer_sync(&sbi->s_err_report);
- if (sbi->s_mmp_tsk)
- kthread_stop(sbi->s_mmp_tsk);
+ if (sbi->s_mmp_tsk) {
+ if (kthread_stop(sbi->s_mmp_tsk) == -EINTR)
+ kfree(kthread_data(sbi->s_mmp_tsk));
+ }
failed_mount2:
rcu_read_lock();
group_desc = rcu_dereference(sbi->s_group_desc);
--
2.31.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-04-28 17:28 Pavel Skripkin
@ 2021-04-29 10:01 ` Vegard Nossum
2021-04-29 11:08 ` Pavel Skripkin
2021-04-29 11:33 ` Pavel Skripkin
0 siblings, 2 replies; 16+ messages in thread
From: Vegard Nossum @ 2021-04-29 10:01 UTC (permalink / raw)
To: Pavel Skripkin, tytso, adilger.kernel
Cc: linux-ext4, linux-kernel, syzbot+d9e482e303930fa4f6ff
On 2021-04-28 19:28, Pavel Skripkin wrote:
> syzbot reported memory leak in ext4 subsyetem.
> The problem appears, when thread_stop() call happens
> before wake_up_process().
>
> Normally, this data will be freed by
> created thread, but if kthread_stop()
> returned -EINTR, this data should be freed manually
>
> Reported-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com
> Tested-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com
> Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
> ---
> fs/ext4/super.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index b9693680463a..9c33e97bd5c5 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5156,8 +5156,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> failed_mount3:
> flush_work(&sbi->s_error_work);
> del_timer_sync(&sbi->s_err_report);
> - if (sbi->s_mmp_tsk)
> - kthread_stop(sbi->s_mmp_tsk);
> + if (sbi->s_mmp_tsk) {
> + if (kthread_stop(sbi->s_mmp_tsk) == -EINTR)
> + kfree(kthread_data(sbi->s_mmp_tsk));
> + }
> failed_mount2:
> rcu_read_lock();
> group_desc = rcu_dereference(sbi->s_group_desc);
>
So I've looked at this, and the puzzling thing is that ext4 uses
kthread_run() which immediately calls wake_up_process() -- according to
the kerneldoc for kthread_stop(), it shouldn't return -EINTR in this
case:
* Returns the result of threadfn(), or %-EINTR if wake_up_process()
* was never called.
*/
int kthread_stop(struct task_struct *k)
So it really looks like kthread_stop() can return -EINTR even when
wake_up_process() has been called but the thread hasn't had a chance to
run yet?
If this is true, then we either have to fix kthread_create() to make
sure it respects the behaviour that is claimed by the comment OR we have
to audit every single kthread_stop() in the kernel which does not check
for -EINTR.
Vegard
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-04-29 10:01 ` Vegard Nossum
@ 2021-04-29 11:08 ` Pavel Skripkin
2021-04-29 11:33 ` Pavel Skripkin
1 sibling, 0 replies; 16+ messages in thread
From: Pavel Skripkin @ 2021-04-29 11:08 UTC (permalink / raw)
To: Vegard Nossum
Cc: tytso, adilger.kernel, linux-ext4, linux-kernel,
syzbot+d9e482e303930fa4f6ff
On Thu, 29 Apr 2021 12:01:46 +0200
Vegard Nossum <vegard.nossum@oracle.com> wrote:
>
> On 2021-04-28 19:28, Pavel Skripkin wrote:
> > syzbot reported memory leak in ext4 subsyetem.
> > The problem appears, when thread_stop() call happens
> > before wake_up_process().
> >
> > Normally, this data will be freed by
> > created thread, but if kthread_stop()
> > returned -EINTR, this data should be freed manually
> >
> > Reported-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com
> > Tested-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com
> > Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
> > ---
> > fs/ext4/super.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index b9693680463a..9c33e97bd5c5 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -5156,8 +5156,10 @@ static int ext4_fill_super(struct
> > super_block *sb, void *data, int silent) failed_mount3:
> > flush_work(&sbi->s_error_work);
> > del_timer_sync(&sbi->s_err_report);
> > - if (sbi->s_mmp_tsk)
> > - kthread_stop(sbi->s_mmp_tsk);
> > + if (sbi->s_mmp_tsk) {
> > + if (kthread_stop(sbi->s_mmp_tsk) == -EINTR)
> > + kfree(kthread_data(sbi->s_mmp_tsk));
> > + }
> > failed_mount2:
> > rcu_read_lock();
> > group_desc = rcu_dereference(sbi->s_group_desc);
> >
>
> So I've looked at this, and the puzzling thing is that ext4 uses
> kthread_run() which immediately calls wake_up_process() -- according
> to the kerneldoc for kthread_stop(), it shouldn't return -EINTR in
> this case:
>
> * Returns the result of threadfn(), or %-EINTR if wake_up_process()
> * was never called.
> */
> int kthread_stop(struct task_struct *k)
>
> So it really looks like kthread_stop() can return -EINTR even when
> wake_up_process() has been called but the thread hasn't had a chance
> to run yet?
>
> If this is true, then we either have to fix kthread_create() to make
> sure it respects the behaviour that is claimed by the comment OR we
> have to audit every single kthread_stop() in the kernel which does
> not check for -EINTR.
>
Me and Vegard found the root case of this bug:
static int kthread(void *_create)
{
....
ret = -EINTR;
if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
cgroup_kthread_ready();
__kthread_parkme(self);
ret = threadfn(data);
}
do_exit(ret);
}
There is a change, that kthread_stop() call will happen before this
. It means, that all kthread_stop() return value must be checked
everywhere
Vegard wrote code snippet, which reproduces this behavior:
#include <linux/printk.h>
#include <linux/proc_fs.h>
#include <linux/kthread.h>
static int test_thread(void *data)
{
printk(KERN_ERR "test_thread()\n");
return 0;
}
static int test_show(struct seq_file *seq, void *data)
{
struct task_struct *t = kthread_run(test_thread, NULL, "test");
if (!IS_ERR(t)) {
int ret = kthread_stop(t);
printk(KERN_ERR "kthread_stop() = %d\n", ret);
}
return 0;
}
static void __init init_test(void)
{
proc_create_single("test", 0444, NULL, &test_show);
}
late_initcall(init_test);
>
> Vegard
With regards,
Pavel Skripkin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-04-29 10:01 ` Vegard Nossum
2021-04-29 11:08 ` Pavel Skripkin
@ 2021-04-29 11:33 ` Pavel Skripkin
2021-04-29 17:05 ` Theodore Ts'o
1 sibling, 1 reply; 16+ messages in thread
From: Pavel Skripkin @ 2021-04-29 11:33 UTC (permalink / raw)
To: Vegard Nossum, akpm, peterz, axboe, pmladek
Cc: tytso, adilger.kernel, linux-ext4, linux-kernel,
syzbot+d9e482e303930fa4f6ff
On Thu, 29 Apr 2021 12:01:46 +0200
Vegard Nossum <vegard.nossum@oracle.com> wrote:
>
> On 2021-04-28 19:28, Pavel Skripkin wrote:
> > syzbot reported memory leak in ext4 subsyetem.
> > The problem appears, when thread_stop() call happens
> > before wake_up_process().
> >
> > Normally, this data will be freed by
> > created thread, but if kthread_stop()
> > returned -EINTR, this data should be freed manually
> >
> > Reported-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com
> > Tested-by: syzbot+d9e482e303930fa4f6ff@syzkaller.appspotmail.com
> > Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
> > ---
> > fs/ext4/super.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index b9693680463a..9c33e97bd5c5 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -5156,8 +5156,10 @@ static int ext4_fill_super(struct
> > super_block *sb, void *data, int silent) failed_mount3:
> > flush_work(&sbi->s_error_work);
> > del_timer_sync(&sbi->s_err_report);
> > - if (sbi->s_mmp_tsk)
> > - kthread_stop(sbi->s_mmp_tsk);
> > + if (sbi->s_mmp_tsk) {
> > + if (kthread_stop(sbi->s_mmp_tsk) == -EINTR)
> > + kfree(kthread_data(sbi->s_mmp_tsk));
> > + }
> > failed_mount2:
> > rcu_read_lock();
> > group_desc = rcu_dereference(sbi->s_group_desc);
> >
>
> So I've looked at this, and the puzzling thing is that ext4 uses
> kthread_run() which immediately calls wake_up_process() -- according
> to the kerneldoc for kthread_stop(), it shouldn't return -EINTR in
> this case:
>
> * Returns the result of threadfn(), or %-EINTR if wake_up_process()
> * was never called.
> */
> int kthread_stop(struct task_struct *k)
>
> So it really looks like kthread_stop() can return -EINTR even when
> wake_up_process() has been called but the thread hasn't had a chance
> to run yet?
>
> If this is true, then we either have to fix kthread_create() to make
> sure it respects the behaviour that is claimed by the comment OR we
> have to audit every single kthread_stop() in the kernel which does
> not check for -EINTR.
>
>
> Vegard
I am sorry for my complitely broken mail client :(
Me and Vegard found the root case of this bug:
static int kthread(void *_create)
{
....
ret = -EINTR;
if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
cgroup_kthread_ready();
__kthread_parkme(self);
ret = threadfn(data);
}
do_exit(ret);
}
There is a chance, that kthread_stop() call will happen before
threadfn call. It means, that kthread_stop() return value must be checked everywhere,
isn't it? Otherwise, there are a lot of potential memory leaks,
because some developers rely on the fact, that data allocated for the thread will
be freed _inside_ thread function.
Vegard wrote the code snippet, which reproduces this behavior:
#include <linux/printk.h>
#include <linux/proc_fs.h>
#include <linux/kthread.h>
static int test_thread(void *data)
{
printk(KERN_ERR "test_thread()\n");
return 0;
}
static int test_show(struct seq_file *seq, void *data)
{
struct task_struct *t = kthread_run(test_thread, NULL, "test");
if (!IS_ERR(t)) {
int ret = kthread_stop(t);
printk(KERN_ERR "kthread_stop() = %d\n", ret);
}
return 0;
}
static void __init init_test(void)
{
proc_create_single("test", 0444, NULL, &test_show);
}
late_initcall(init_test);
So, is this behavior is expected or not? Should maintainers rewrite
code, which doesn't check kthread_stop() return value?
With regards,
Pavel Skripkin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-04-29 11:33 ` Pavel Skripkin
@ 2021-04-29 17:05 ` Theodore Ts'o
2021-04-29 19:20 ` Pavel Skripkin
2021-04-29 20:09 ` Pavel Skripkin
0 siblings, 2 replies; 16+ messages in thread
From: Theodore Ts'o @ 2021-04-29 17:05 UTC (permalink / raw)
To: Pavel Skripkin
Cc: Vegard Nossum, akpm, peterz, axboe, pmladek, adilger.kernel,
linux-ext4, linux-kernel, syzbot+d9e482e303930fa4f6ff
On Thu, Apr 29, 2021 at 02:33:54PM +0300, Pavel Skripkin wrote:
>
> There is a chance, that kthread_stop() call will happen before
> threadfn call. It means, that kthread_stop() return value must be checked everywhere,
> isn't it? Otherwise, there are a lot of potential memory leaks,
> because some developers rely on the fact, that data allocated for the thread will
> be freed _inside_ thread function.
That's not the only potential way that we could leak memory. Earlier
in kthread(), if this memory allocation fails,
self = kzalloc(sizeof(*self), GFP_KERNEL);
we will exit with -ENOMEM. So at the very least all callers of
kthread_stop() also need to check for -ENOMEM as well as -EINTR ---
or, be somehow sure that the thread function was successfully called
and started. In this particular case, the ext4 mount code had just
started the kmmpd thread, and then detected that something else had
gone wrong, and failed the mount before the kmmpd thread ever had a
chance to run.
I think if we want to fix this more generally across the whole kernel,
we would need to have a variant of kthread_run which supplies two
functions --- one which is the thread function, and the other which is
a cleanup function. The cleanup function could just be kfree, but
there will be other cases where the cleanup function will need to do
other work before freeing the data structure (e.g., brelse((struct
mmpd_data *)data->bh)).
Is it worth it to provide such a cleanup function, which if present
would be called any time the thread exits or is killed? I dunno.
It's probably simpler to just strongly recommend that the cleanup work
should never be done in the thread function, but after kthread_stop()
is called, whether it returns an error or not. That's probably the
right fix for ext4, I think.
(Although note that kthread_stop(sbi->s_mmp_task) is called in
multiple places in fs/ext4/super.c, not just in the single location
which this patch touches.)
- Ted
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-04-29 17:05 ` Theodore Ts'o
@ 2021-04-29 19:20 ` Pavel Skripkin
2021-04-29 20:09 ` Pavel Skripkin
1 sibling, 0 replies; 16+ messages in thread
From: Pavel Skripkin @ 2021-04-29 19:20 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Vegard Nossum, akpm, peterz, axboe, pmladek, adilger.kernel,
linux-ext4, linux-kernel, syzbot+d9e482e303930fa4f6ff
Hi! Thanks for your reply.
On Thu, 29 Apr 2021 13:05:01 -0400
"Theodore Ts'o" <tytso@mit.edu> wrote:
> On Thu, Apr 29, 2021 at 02:33:54PM +0300, Pavel Skripkin wrote:
> >
> > There is a chance, that kthread_stop() call will happen before
> > threadfn call. It means, that kthread_stop() return value must be
> > checked everywhere, isn't it? Otherwise, there are a lot of
> > potential memory leaks, because some developers rely on the fact,
> > that data allocated for the thread will be freed _inside_ thread
> > function.
>
> That's not the only potential way that we could leak memory. Earlier
> in kthread(), if this memory allocation fails,
>
> self = kzalloc(sizeof(*self), GFP_KERNEL);
>
> we will exit with -ENOMEM. So at the very least all callers of
> kthread_stop() also need to check for -ENOMEM as well as -EINTR ---
> or, be somehow sure that the thread function was successfully called
> and started. In this particular case, the ext4 mount code had just
> started the kmmpd thread, and then detected that something else had
> gone wrong, and failed the mount before the kmmpd thread ever had a
> chance to run.
>
> I think if we want to fix this more generally across the whole kernel,
> we would need to have a variant of kthread_run which supplies two
> functions --- one which is the thread function, and the other which is
> a cleanup function. The cleanup function could just be kfree, but
> there will be other cases where the cleanup function will need to do
> other work before freeing the data structure (e.g., brelse((struct
> mmpd_data *)data->bh)).
I skimmed through kernel code and I didn't find any code
examples, except ext4, where kthread is freeing something. Maybe, this
API isn't required, but, as Vegard said, comment over
kthread_stop() should be changed, because it's confusing.
I have already added kthread.c developers (I hope, I chose
the right emails) to CC. Maybe, they will think about this API.
>
> Is it worth it to provide such a cleanup function, which if present
> would be called any time the thread exits or is killed? I dunno.
> It's probably simpler to just strongly recommend that the cleanup work
> should never be done in the thread function, but after kthread_stop()
> is called, whether it returns an error or not. That's probably the
> right fix for ext4, I think.
>
> (Although note that kthread_stop(sbi->s_mmp_task) is called in
> multiple places in fs/ext4/super.c, not just in the single location
> which this patch touches.)
>
Good point, I'll add this and -ENOMEM checks and will send v2.
Thanks!
> - Ted
With regards,
Pavel Skripkin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-04-29 17:05 ` Theodore Ts'o
2021-04-29 19:20 ` Pavel Skripkin
@ 2021-04-29 20:09 ` Pavel Skripkin
2021-04-29 21:41 ` Theodore Ts'o
1 sibling, 1 reply; 16+ messages in thread
From: Pavel Skripkin @ 2021-04-29 20:09 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Vegard Nossum, akpm, peterz, axboe, pmladek, adilger.kernel,
linux-ext4, linux-kernel, syzbot+d9e482e303930fa4f6ff
On Thu, 29 Apr 2021 13:05:01 -0400
"Theodore Ts'o" <tytso@mit.edu> wrote:
> On Thu, Apr 29, 2021 at 02:33:54PM +0300, Pavel Skripkin wrote:
> >
> > There is a chance, that kthread_stop() call will happen before
> > threadfn call. It means, that kthread_stop() return value must be
> > checked everywhere, isn't it? Otherwise, there are a lot of
> > potential memory leaks, because some developers rely on the fact,
> > that data allocated for the thread will be freed _inside_ thread
> > function.
>
> That's not the only potential way that we could leak memory. Earlier
> in kthread(), if this memory allocation fails,
>
> self = kzalloc(sizeof(*self), GFP_KERNEL);
>
> we will exit with -ENOMEM. So at the very least all callers of
> kthread_stop() also need to check for -ENOMEM as well as -EINTR ---
> or, be somehow sure that the thread function was successfully called
> and started. In this particular case, the ext4 mount code had just
> started the kmmpd thread, and then detected that something else had
> gone wrong, and failed the mount before the kmmpd thread ever had a
> chance to run.
There is a small problem about -ENOMEM:
static int kmmpd(void *data)
{
...
retval = read_mmp_block(sb, &bh_check, mmp_block);
if (retval) {
ext4_error_err(sb, -retval,
"error reading MMP data: %d",
retval);
goto exit_thread;
}
...
exit_thread:
EXT4_SB(sb)->s_mmp_tsk = NULL;
kfree(data);
brelse(bh);
return retval;
}
read_mmp_block can return -ENOMEM. In this case double free will happen.
I believe, we can change `return retval;` to `return 0;`, because
kthread return value isn't checked anywhere.
What do You think about it?
With regards,
Pavel Skripkin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-04-29 20:09 ` Pavel Skripkin
@ 2021-04-29 21:41 ` Theodore Ts'o
2021-04-29 22:05 ` Pavel Skripkin
0 siblings, 1 reply; 16+ messages in thread
From: Theodore Ts'o @ 2021-04-29 21:41 UTC (permalink / raw)
To: Pavel Skripkin
Cc: Vegard Nossum, akpm, peterz, axboe, pmladek, adilger.kernel,
linux-ext4, linux-kernel, syzbot+d9e482e303930fa4f6ff
On Thu, Apr 29, 2021 at 11:09:56PM +0300, Pavel Skripkin wrote:
> > we will exit with -ENOMEM. So at the very least all callers of
> > kthread_stop() also need to check for -ENOMEM as well as -EINTR ---
> > or, be somehow sure that the thread function was successfully called
> > and started. In this particular case, the ext4 mount code had just
> > started the kmmpd thread, and then detected that something else had
> > gone wrong, and failed the mount before the kmmpd thread ever had a
> > chance to run.
>
> There is a small problem about -ENOMEM...
What I'd suggest is that we simply move
> exit_thread:
> EXT4_SB(sb)->s_mmp_tsk = NULL;
> kfree(data);
> brelse(bh);
> return retval;
> }
out of the thread function. That means hanging struct mmpd_data off
the struct ext4_sb_info structure, and then adding a function like
this to fs/ext4/mmp.c
static void ext4_stop_mmpd(struct ext4_sb_info *sbi)
{
if (sbi->s_mmp_tsk) {
kthread_stop(sbi->s_mmp_tsk);
brelse(sbi->s_mmp_data->bh);
kfree(sbi->s_mmp_data);
sbi->s_mmp_data = NULL;
sbi->s_mmp_tsk = NULL;
}
}
Basically, just move all of the cleanup so it is done after the
kthread is stopped, so we don't have to do any fancy error checking.
We just do it unconditionally.
- Ted
P.S. Actually, we could drop the struct mmpd_data altogether, and
just hang the buffer head as sbi->s_mmp_bh. Then we can just pass the
struct super * as the private pointer to kmmpd().
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-04-29 21:41 ` Theodore Ts'o
@ 2021-04-29 22:05 ` Pavel Skripkin
2021-04-30 3:44 ` Theodore Ts'o
0 siblings, 1 reply; 16+ messages in thread
From: Pavel Skripkin @ 2021-04-29 22:05 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Vegard Nossum, akpm, peterz, axboe, pmladek, adilger.kernel,
linux-ext4, linux-kernel, syzbot+d9e482e303930fa4f6ff
On Thu, 29 Apr 2021 17:41:12 -0400
"Theodore Ts'o" <tytso@mit.edu> wrote:
> On Thu, Apr 29, 2021 at 11:09:56PM +0300, Pavel Skripkin wrote:
> > > we will exit with -ENOMEM. So at the very least all callers of
> > > kthread_stop() also need to check for -ENOMEM as well as -EINTR
> > > --- or, be somehow sure that the thread function was successfully
> > > called and started. In this particular case, the ext4 mount code
> > > had just started the kmmpd thread, and then detected that
> > > something else had gone wrong, and failed the mount before the
> > > kmmpd thread ever had a chance to run.
> >
> > There is a small problem about -ENOMEM...
>
> What I'd suggest is that we simply move
>
> > exit_thread:
> > EXT4_SB(sb)->s_mmp_tsk = NULL;
> > kfree(data);
> > brelse(bh);
> > return retval;
> > }
>
> out of the thread function. That means hanging struct mmpd_data off
> the struct ext4_sb_info structure, and then adding a function like
> this to fs/ext4/mmp.c
>
> static void ext4_stop_mmpd(struct ext4_sb_info *sbi)
> {
> if (sbi->s_mmp_tsk) {
> kthread_stop(sbi->s_mmp_tsk);
> brelse(sbi->s_mmp_data->bh);
> kfree(sbi->s_mmp_data);
> sbi->s_mmp_data = NULL;
> sbi->s_mmp_tsk = NULL;
> }
> }
>
> Basically, just move all of the cleanup so it is done after the
> kthread is stopped, so we don't have to do any fancy error checking.
> We just do it unconditionally.
This sound much better than my idea. Will do it in v2.
Thanks!
>
> - Ted
>
> P.S. Actually, we could drop the struct mmpd_data altogether, and
> just hang the buffer head as sbi->s_mmp_bh. Then we can just pass the
> struct super * as the private pointer to kmmpd().
>
With regards,
Pavel Skripkin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] ext4: fix memory leak in ext4_fill_super
2021-04-29 22:05 ` Pavel Skripkin
@ 2021-04-30 3:44 ` Theodore Ts'o
0 siblings, 0 replies; 16+ messages in thread
From: Theodore Ts'o @ 2021-04-30 3:44 UTC (permalink / raw)
To: Pavel Skripkin
Cc: Vegard Nossum, akpm, peterz, axboe, pmladek, adilger.kernel,
linux-ext4, linux-kernel, syzbot+d9e482e303930fa4f6ff
On Fri, Apr 30, 2021 at 01:05:47AM +0300, Pavel Skripkin wrote:
> > out of the thread function. That means hanging struct mmpd_data off
> > the struct ext4_sb_info structure, and then adding a function like
> > this to fs/ext4/mmp.c
> >
> > static void ext4_stop_mmpd(struct ext4_sb_info *sbi)
That should "extern void ...", since it will be called from
fs/ext4/super.c. I had originally was thinking to put this function
in fs/ext4/super.c, but from the perspective of keeping the MMP code
in the same source file, it probably makes sense to keep this function
with the other MMP functions.
> > {
> > if (sbi->s_mmp_tsk) {
> > kthread_stop(sbi->s_mmp_tsk);
> > brelse(sbi->s_mmp_data->bh);
> > kfree(sbi->s_mmp_data);
> > sbi->s_mmp_data = NULL;
> > sbi->s_mmp_tsk = NULL;
> > }
> > }
> >
> > Basically, just move all of the cleanup so it is done after the
> > kthread is stopped, so we don't have to do any fancy error checking.
> > We just do it unconditionally.
Cheers,
- Ted
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2021-05-21 16:12 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-04-28 22:19 [PATCH] ext4: fix memory leak in ext4_fill_super Alexey Makhalov
2021-05-21 4:43 ` Theodore Y. Ts'o
2021-05-21 7:43 ` Alexey Makhalov
2021-05-21 7:55 ` [PATCH v2] " Alexey Makhalov
2021-05-21 14:29 ` [PATCH] " Theodore Y. Ts'o
2021-05-21 16:12 ` Alexey Makhalov
-- strict thread matches above, loose matches on Subject: below --
2021-04-28 17:28 Pavel Skripkin
2021-04-29 10:01 ` Vegard Nossum
2021-04-29 11:08 ` Pavel Skripkin
2021-04-29 11:33 ` Pavel Skripkin
2021-04-29 17:05 ` Theodore Ts'o
2021-04-29 19:20 ` Pavel Skripkin
2021-04-29 20:09 ` Pavel Skripkin
2021-04-29 21:41 ` Theodore Ts'o
2021-04-29 22:05 ` Pavel Skripkin
2021-04-30 3:44 ` Theodore Ts'o
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).