memory leak: data=journal and {collapse,insert,zero}

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* memory leak: data=journal and {collapse,insert,zero}_range
@ 2015-10-17 16:02 Theodore Ts'o
  2015-10-20 12:06 ` Namjae Jeon
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2015-10-17 16:02 UTC (permalink / raw)
  To: linux-ext4; +Cc: Namjae Jeon

On the last conference call I mentioned that I had recently been
running into memory leaks with data=journal.  I've traced it down to
{collapse,insert,zero}_range calls.

It bisected down to 1ce01c4a199: "ext4: fix COLLAPSE_RANGE test
failure in data journalling mode", and specifically, this bit of code
the collapse/insert/zero range functions:

	/* Call ext4_force_commit to flush all data in case of data=journal. */
	if (ext4_should_journal_data(inode)) {
		ret = ext4_force_commit(inode->i_sb);
		if (ret)
			return ret;
	}

This is needed to prevent data from being lost with the insert and
collapse range calls.  (It is not obvious to me whether this bit of
code, and the filemap_write_and_wait_range() call, is needed for
zero_range, but I haven't had a chance to investigate this more
deeply.)

Further investigation shows that the problem seems to be a buffer_head
leak.  Running generic/091 with data=journal in a loop, and using the
following between each test runs:

#!/bin/sh
# /tmp/hooks/logger
echo 3 > /proc/sys/vm/drop_caches
logger $(grep MemFree /proc/meminfo)
logger $(grep buffer_head /proc/slabinfo)
logger $(grep jbd2_journal_head /proc/slabinfo)

what we find is:

logger: MemFree: 7348684 kB
logger: buffer_head 608 1600 128 32 1 : tunables 32 16 8 : slabdata 50 50 0 : globalstat 108003 36592 3294 0 0 1 0 0 0 : cpustat 102424 6803 101863 6760
logger: jbd2_journal_head 120 120 136 30 1 : tunables 32 16 8 : slabdata 4 4 0 : globalstat 203 120 4 0 0 0 0 0 0 : cpustat 179 14 84 0

logger: MemFree: 7155576 kB
logger: buffer_head 45008 65920 128 32 1 : tunables 32 16 8 : slabdata 2060 2060 0 : globalstat 207971 79408 5753 0 0 1 0 0 0 : cpustat 354158 14482 311984 11664
logger: jbd2_journal_head 206 690 136 30 1 : tunables 32 16 8 : slabdata 23 23 128 : globalstat 38817 11086 1142 0 0 0 0 0 0 : cpustat 68275 3482 68402 3326

kernel: logger (5838): drop_caches: 3
logger: MemFree: 6967656 kB
logger: buffer_head 89552 111616 128 32 1 : tunables 32 16 8 : slabdata 3488 3488 0 : globalstat 308435 124352 7589 0 0 1 0 0 0 : cpustat 606287 22193 522363 16591
logger: jbd2_journal_head 210 660 136 30 1 : tunables 32 16 8 : slabdata 22 22 128 : globalstat 77205 11086 2158 0 0 0 0 0 0 : cpustat 136335 6954 136593 6669

logger: MemFree: 6780176 kB
logger: buffer_head 134080 155328 128 32 1 : tunables 32 16 8 : slabdata 4854 4854 0 : globalstat 408163 168144 9369 0 0 1 0 0 0 : cpustat 857710 29874 732032 21489
logger: jbd2_journal_head 210 570 136 30 1 : tunables 32 16 8 : slabdata 19 19 128 : globalstat 115685 11086 3198 0 0 0 0 0 0 : cpustat 204408 10417 204798 9998

The problem then appears to be with the buffer heads associated with
pages that have been recently written in data=journal mode are not
getting freed.  I haven't had time to take this any further, but I
wanted to document what I've found to date as a checkpoint.  If
someone has time to take this further, or has any additional insight,
I'd certainly appreciate it!

							- Ted

P.S.  I will shortly be releasing an enhancement to kvm-xfstests and
gce-xfstests which allows collapse/insert/zero range to be easily
disabled, like this:

gce-xfstests --hooks /tmp/hooks  -C 20 -c data_journal --no-collapse --no-zero --no-insert generic/091

(The above traces were taken without --no-collapse --no-zero and
--noinsert.  With those options added, the memory leak goes away.)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: memory leak: data=journal and {collapse,insert,zero}_range
  2015-10-17 16:02 memory leak: data=journal and {collapse,insert,zero}_range Theodore Ts'o
@ 2015-10-20 12:06 ` Namjae Jeon
  2015-10-20 15:54   ` Theodore Ts'o
  0 siblings, 1 reply; 13+ messages in thread
From: Namjae Jeon @ 2015-10-20 12:06 UTC (permalink / raw)
  To: 'Theodore Ts'o'; +Cc: linux-ext4

Hi Ted,

I will check this issue.

Thanks!

> -----Original Message-----
> From: Theodore Ts'o [mailto:tytso@mit.edu]
> Sent: Sunday, October 18, 2015 1:03 AM
> To: linux-ext4@vger.kernel.org
> Cc: Namjae Jeon
> Subject: memory leak: data=journal and {collapse,insert,zero}_range
> 
> On the last conference call I mentioned that I had recently been
> running into memory leaks with data=journal.  I've traced it down to
> {collapse,insert,zero}_range calls.
> 
> It bisected down to 1ce01c4a199: "ext4: fix COLLAPSE_RANGE test
> failure in data journalling mode", and specifically, this bit of code
> the collapse/insert/zero range functions:
> 
> 	/* Call ext4_force_commit to flush all data in case of data=journal. */
> 	if (ext4_should_journal_data(inode)) {
> 		ret = ext4_force_commit(inode->i_sb);
> 		if (ret)
> 			return ret;
> 	}
> 
> This is needed to prevent data from being lost with the insert and
> collapse range calls.  (It is not obvious to me whether this bit of
> code, and the filemap_write_and_wait_range() call, is needed for
> zero_range, but I haven't had a chance to investigate this more
> deeply.)
> 
> Further investigation shows that the problem seems to be a buffer_head
> leak.  Running generic/091 with data=journal in a loop, and using the
> following between each test runs:
> 
> #!/bin/sh
> # /tmp/hooks/logger
> echo 3 > /proc/sys/vm/drop_caches
> logger $(grep MemFree /proc/meminfo)
> logger $(grep buffer_head /proc/slabinfo)
> logger $(grep jbd2_journal_head /proc/slabinfo)
> 
> what we find is:
> 
> logger: MemFree: 7348684 kB
> logger: buffer_head 608 1600 128 32 1 : tunables 32 16 8 : slabdata 50 50 0 : globalstat 108003 36592
> 3294 0 0 1 0 0 0 : cpustat 102424 6803 101863 6760
> logger: jbd2_journal_head 120 120 136 30 1 : tunables 32 16 8 : slabdata 4 4 0 : globalstat 203 120 4
> 0 0 0 0 0 0 : cpustat 179 14 84 0
> 
> logger: MemFree: 7155576 kB
> logger: buffer_head 45008 65920 128 32 1 : tunables 32 16 8 : slabdata 2060 2060 0 : globalstat 207971
> 79408 5753 0 0 1 0 0 0 : cpustat 354158 14482 311984 11664
> logger: jbd2_journal_head 206 690 136 30 1 : tunables 32 16 8 : slabdata 23 23 128 : globalstat 38817
> 11086 1142 0 0 0 0 0 0 : cpustat 68275 3482 68402 3326
> 
> kernel: logger (5838): drop_caches: 3
> logger: MemFree: 6967656 kB
> logger: buffer_head 89552 111616 128 32 1 : tunables 32 16 8 : slabdata 3488 3488 0 : globalstat
> 308435 124352 7589 0 0 1 0 0 0 : cpustat 606287 22193 522363 16591
> logger: jbd2_journal_head 210 660 136 30 1 : tunables 32 16 8 : slabdata 22 22 128 : globalstat 77205
> 11086 2158 0 0 0 0 0 0 : cpustat 136335 6954 136593 6669
> 
> logger: MemFree: 6780176 kB
> logger: buffer_head 134080 155328 128 32 1 : tunables 32 16 8 : slabdata 4854 4854 0 : globalstat
> 408163 168144 9369 0 0 1 0 0 0 : cpustat 857710 29874 732032 21489
> logger: jbd2_journal_head 210 570 136 30 1 : tunables 32 16 8 : slabdata 19 19 128 : globalstat 115685
> 11086 3198 0 0 0 0 0 0 : cpustat 204408 10417 204798 9998
> 
> The problem then appears to be with the buffer heads associated with
> pages that have been recently written in data=journal mode are not
> getting freed.  I haven't had time to take this any further, but I
> wanted to document what I've found to date as a checkpoint.  If
> someone has time to take this further, or has any additional insight,
> I'd certainly appreciate it!
> 
> 							- Ted
> 
> P.S.  I will shortly be releasing an enhancement to kvm-xfstests and
> gce-xfstests which allows collapse/insert/zero range to be easily
> disabled, like this:
> 
> gce-xfstests --hooks /tmp/hooks  -C 20 -c data_journal --no-collapse --no-zero --no-insert generic/091
> 
> (The above traces were taken without --no-collapse --no-zero and
> --noinsert.  With those options added, the memory leak goes away.)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: memory leak: data=journal and {collapse,insert,zero}_range
  2015-10-20 12:06 ` Namjae Jeon
@ 2015-10-20 15:54   ` Theodore Ts'o
  2015-10-21  9:44     ` Namjae Jeon
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2015-10-20 15:54 UTC (permalink / raw)
  To: Namjae Jeon; +Cc: linux-ext4

On Tue, Oct 20, 2015 at 09:06:08PM +0900, Namjae Jeon wrote:
> Hi Ted,
> 
> I will check this issue.

Thanks!

To be clear, the commit that I bisected this down to was fixing a real
issue, so I'm not really blaming your commit.  Furthermore, it wasn't
causing test failures until very recently, and part of this is
probably caused by a change in the test environment, and also probably
a change in the most recent kernels about how memory reclaim and how
we handle low memory situations.  However, once I started seeing test
failures, I started looking harder for memory leaks, and it was clear
that we had memory leaks in the data=journal scenario starting with
1ce01c4a199.

Interestingly we're not seeing these memory leaks on the truncate
path, so I suspect the issue is in how collapse range is clearing
pages from the page cache, especially pages that were freshly written
to the journal by the commit but which hadn't yet been writtten to
disk and then marked as complete so we can allow the relevant
transaction to be checkpointed.  (Although we're not leaking the
journal head structures, but only the buffer heads, so the story most
be a bit more complicated than that.)

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: memory leak: data=journal and {collapse,insert,zero}_range
  2015-10-20 15:54   ` Theodore Ts'o
@ 2015-10-21  9:44     ` Namjae Jeon
  2015-10-21 14:52       ` Theodore Ts'o
  0 siblings, 1 reply; 13+ messages in thread
From: Namjae Jeon @ 2015-10-21  9:44 UTC (permalink / raw)
  To: 'Theodore Ts'o'; +Cc: linux-ext4

> 
> On Tue, Oct 20, 2015 at 09:06:08PM +0900, Namjae Jeon wrote:
> > Hi Ted,
> >
> > I will check this issue.
> 
> Thanks!
> 
> To be clear, the commit that I bisected this down to was fixing a real
> issue, so I'm not really blaming your commit.  Furthermore, it wasn't
> causing test failures until very recently, and part of this is
> probably caused by a change in the test environment, and also probably
> a change in the most recent kernels about how memory reclaim and how
> we handle low memory situations.  However, once I started seeing test
> failures, I started looking harder for memory leaks, and it was clear
> that we had memory leaks in the data=journal scenario starting with
> 1ce01c4a199.
> 
> Interestingly we're not seeing these memory leaks on the truncate
> path, so I suspect the issue is in how collapse range is clearing
> pages from the page cache, especially pages that were freshly written
> to the journal by the commit but which hadn't yet been writtten to
> disk and then marked as complete so we can allow the relevant
> transaction to be checkpointed.  (Although we're not leaking the
> journal head structures, but only the buffer heads, so the story most
> be a bit more complicated than that.)

Okay, Thanks for sharing your view and points !!

Currently I can reproduce memory leak issue without collase/insert/zero range.
conditions like the following.(collase/insert/zero range are disable with -I -C -z option and add -y option instead of -W)
  1. small size parition(1GB)
  2. run fsx with these options "./fsx -N 30000 -o 128000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -y -I -C -z testfile"
And same result with generic/091 is showing (buffer_head leak)

So I am starting to find root-cause base on your points.
I will share the result or the patch.

Thanks~
> 
> Cheers,
> 
> 					- Ted
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: memory leak: data=journal and {collapse,insert,zero}_range
  2015-10-21  9:44     ` Namjae Jeon
@ 2015-10-21 14:52       ` Theodore Ts'o
  2015-11-09  5:21         ` Namjae Jeon
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2015-10-21 14:52 UTC (permalink / raw)
  To: Namjae Jeon; +Cc: linux-ext4

On Wed, Oct 21, 2015 at 06:44:10PM +0900, Namjae Jeon wrote:
> > Interestingly we're not seeing these memory leaks on the truncate
> > path, so I suspect the issue is in how collapse range is clearing
> > pages from the page cache, especially pages that were freshly written
> > to the journal by the commit but which hadn't yet been writtten to
> > disk and then marked as complete so we can allow the relevant
> > transaction to be checkpointed.  (Although we're not leaking the
> > journal head structures, but only the buffer heads, so the story most
> > be a bit more complicated than that.)
> 
> Okay, Thanks for sharing your view and points !!
> 
> Currently I can reproduce memory leak issue without collase/insert/zero range.
> conditions like the following.(collase/insert/zero range are disable with -I -C -z option and add -y option instead of -W)
>   1. small size parition(1GB)
>   2. run fsx with these options "./fsx -N 30000 -o 128000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -y -I -C -z testfile"
> And same result with generic/091 is showing (buffer_head leak)
> 
> So I am starting to find root-cause base on your points.
> I will share the result or the patch.

Thanks, that's very interesting data point.  So this makes it appear
that the problem *is* probably with how we deal with checkpointing
buffers after the pages get discarded using either a truncate or a
collapse_range, since the 'y' option causes a lot fsync's, and hence
commits, some of which are happening after a truncate command.

Thanks for a taking a look at this.  I really appreciate it.

Cheers,

					- Ted










^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: memory leak: data=journal and {collapse,insert,zero}_range
  2015-10-21 14:52       ` Theodore Ts'o
@ 2015-11-09  5:21         ` Namjae Jeon
  2015-11-10 14:49           ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Namjae Jeon @ 2015-11-09  5:21 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

> 
> On Wed, Oct 21, 2015 at 06:44:10PM +0900, Namjae Jeon wrote:
> > > Interestingly we're not seeing these memory leaks on the truncate
> > > path, so I suspect the issue is in how collapse range is clearing
> > > pages from the page cache, especially pages that were freshly written
> > > to the journal by the commit but which hadn't yet been writtten to
> > > disk and then marked as complete so we can allow the relevant
> > > transaction to be checkpointed.  (Although we're not leaking the
> > > journal head structures, but only the buffer heads, so the story most
> > > be a bit more complicated than that.)
> >
> > Okay, Thanks for sharing your view and points !!
> >
> > Currently I can reproduce memory leak issue without collase/insert/zero range.
> > conditions like the following.(collase/insert/zero range are disable with -I -C -z option and add -y
> option instead of -W)
> >   1. small size parition(1GB)
> >   2. run fsx with these options "./fsx -N 30000 -o 128000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -y -
> I -C -z testfile"
> > And same result with generic/091 is showing (buffer_head leak)
> >
> > So I am starting to find root-cause base on your points.
> > I will share the result or the patch.
> 
> Thanks, that's very interesting data point.  So this makes it appear
> that the problem *is* probably with how we deal with checkpointing
> buffers after the pages get discarded using either a truncate or a
> collapse_range, since the 'y' option causes a lot fsync's, and hence
> commits, some of which are happening after a truncate command.
> 
> Thanks for a taking a look at this.  I really appreciate it.
> 
> Cheers,


Hi Ted,

Could you review this patch?

Thanks!
---------------------------------------------------------------------------------
Subject: [PATCH] jbd2: try to free buffers from truncated page after
 checkpoint

when ext4 is mounted in data=journal mode, and truncate operation
such as settatr(size), collopse, insert and zero range are used, there are
are many truncated pages with NULL page->mapping. Such truncated pages
pile up quickly due to truncate_pagecache on data pages associated with journal.
As page->mapping is NULL for such truncated pages, they are not freed
by drop cache(3) or umount. As a result, MemFree in /proc/meminfo decreases
quickly and active buffer_head slab objects grow in /proc/slabinfo.
This patch attempts to free buffers from such pages at the end of jbd2
checkpoint, if pages do not have any busy buffers and NULL mapping. 

---
 fs/jbd2/checkpoint.c | 11 +++++++++++
 fs/jbd2/commit.c     |  2 +-
 include/linux/jbd2.h |  1 +
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 4227dc4..bf68442 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -529,6 +529,7 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
 	transaction_t *transaction;
 	journal_t *journal;
 	int ret = 0;
+	struct buffer_head *bh;
 
 	JBUFFER_TRACE(jh, "entry");
 
@@ -538,10 +539,20 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
 	}
 	journal = transaction->t_journal;
 
+	bh = jh2bh(jh);
+	get_bh(bh);
 	JBUFFER_TRACE(jh, "removing from transaction");
 	__buffer_unlink(jh);
 	jh->b_cp_transaction = NULL;
 	jbd2_journal_put_journal_head(jh);
+	/*
+	 * if journal head is freed, try to free buffers from a truncated
+	 * page, if page buffers are not busy and page->mapping is NULL
+	 */
+	if (!buffer_jbd(bh))
+		release_buffer_page(bh);
+	else
+		__brelse(bh);
 
 	if (transaction->t_checkpoint_list != NULL ||
 	    transaction->t_checkpoint_io_list != NULL)
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b73e021..d94ec3f 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -63,7 +63,7 @@ static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
  * Called under lock_journal(), and possibly under journal_datalist_lock.  The
  * caller provided us with a ref against the buffer, and we drop that here.
  */
-static void release_buffer_page(struct buffer_head *bh)
+void release_buffer_page(struct buffer_head *bh)
 {
 	struct page *page;
 
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index edb640a..523f345 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1039,6 +1039,7 @@ int __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
 void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
 
 /* Commit management */
+extern void release_buffer_page(struct buffer_head *);
 extern void jbd2_journal_commit_transaction(journal_t *);
 
 /* Checkpoint list management */

----------------------------------------------------------
> 
> 					- Ted
> 
> 
> 
> 
> 
> 
> 



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: memory leak: data=journal and {collapse,insert,zero}_range
  2015-11-09  5:21         ` Namjae Jeon
@ 2015-11-10 14:49           ` Jan Kara
  2015-11-17  4:47             ` Namjae Jeon
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2015-11-10 14:49 UTC (permalink / raw)
  To: Namjae Jeon; +Cc: Theodore Ts'o, linux-ext4

On Mon 09-11-15 14:21:11, Namjae Jeon wrote:
> > 
> > On Wed, Oct 21, 2015 at 06:44:10PM +0900, Namjae Jeon wrote:
> > > > Interestingly we're not seeing these memory leaks on the truncate
> > > > path, so I suspect the issue is in how collapse range is clearing
> > > > pages from the page cache, especially pages that were freshly written
> > > > to the journal by the commit but which hadn't yet been writtten to
> > > > disk and then marked as complete so we can allow the relevant
> > > > transaction to be checkpointed.  (Although we're not leaking the
> > > > journal head structures, but only the buffer heads, so the story most
> > > > be a bit more complicated than that.)
> > >
> > > Okay, Thanks for sharing your view and points !!
> > >
> > > Currently I can reproduce memory leak issue without collase/insert/zero range.
> > > conditions like the following.(collase/insert/zero range are disable with -I -C -z option and add -y
> > option instead of -W)
> > >   1. small size parition(1GB)
> > >   2. run fsx with these options "./fsx -N 30000 -o 128000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -y -
> > I -C -z testfile"
> > > And same result with generic/091 is showing (buffer_head leak)
> > >
> > > So I am starting to find root-cause base on your points.
> > > I will share the result or the patch.
> > 
> > Thanks, that's very interesting data point.  So this makes it appear
> > that the problem *is* probably with how we deal with checkpointing
> > buffers after the pages get discarded using either a truncate or a
> > collapse_range, since the 'y' option causes a lot fsync's, and hence
> > commits, some of which are happening after a truncate command.
> > 
> > Thanks for a taking a look at this.  I really appreciate it.
> > 
> > Cheers,
> 
> 
> Hi Ted,
> 
> Could you review this patch?
> 
> Thanks!
> ---------------------------------------------------------------------------------
> Subject: [PATCH] jbd2: try to free buffers from truncated page after
>  checkpoint
> 
> when ext4 is mounted in data=journal mode, and truncate operation
> such as settatr(size), collopse, insert and zero range are used, there are
> are many truncated pages with NULL page->mapping. Such truncated pages
> pile up quickly due to truncate_pagecache on data pages associated with journal.
> As page->mapping is NULL for such truncated pages, they are not freed
> by drop cache(3) or umount. As a result, MemFree in /proc/meminfo decreases
> quickly and active buffer_head slab objects grow in /proc/slabinfo.
> This patch attempts to free buffers from such pages at the end of jbd2
> checkpoint, if pages do not have any busy buffers and NULL mapping. 

Hum, why such pages didn't get freed by release_buffer_page() call
happening when processing transaction's forget list? Because the idea is
that such pages should be discarded at that point...

								Honza
> 
> ---
>  fs/jbd2/checkpoint.c | 11 +++++++++++
>  fs/jbd2/commit.c     |  2 +-
>  include/linux/jbd2.h |  1 +
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 4227dc4..bf68442 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -529,6 +529,7 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
>  	transaction_t *transaction;
>  	journal_t *journal;
>  	int ret = 0;
> +	struct buffer_head *bh;
>  
>  	JBUFFER_TRACE(jh, "entry");
>  
> @@ -538,10 +539,20 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
>  	}
>  	journal = transaction->t_journal;
>  
> +	bh = jh2bh(jh);
> +	get_bh(bh);
>  	JBUFFER_TRACE(jh, "removing from transaction");
>  	__buffer_unlink(jh);
>  	jh->b_cp_transaction = NULL;
>  	jbd2_journal_put_journal_head(jh);
> +	/*
> +	 * if journal head is freed, try to free buffers from a truncated
> +	 * page, if page buffers are not busy and page->mapping is NULL
> +	 */
> +	if (!buffer_jbd(bh))
> +		release_buffer_page(bh);
> +	else
> +		__brelse(bh);
>  
>  	if (transaction->t_checkpoint_list != NULL ||
>  	    transaction->t_checkpoint_io_list != NULL)
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b73e021..d94ec3f 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -63,7 +63,7 @@ static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
>   * Called under lock_journal(), and possibly under journal_datalist_lock.  The
>   * caller provided us with a ref against the buffer, and we drop that here.
>   */
> -static void release_buffer_page(struct buffer_head *bh)
> +void release_buffer_page(struct buffer_head *bh)
>  {
>  	struct page *page;
>  
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index edb640a..523f345 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1039,6 +1039,7 @@ int __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
>  void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
>  
>  /* Commit management */
> +extern void release_buffer_page(struct buffer_head *);
>  extern void jbd2_journal_commit_transaction(journal_t *);
>  
>  /* Checkpoint list management */
> 
> ----------------------------------------------------------
> > 
> > 					- Ted
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: memory leak: data=journal and {collapse,insert,zero}_range
  2015-11-10 14:49           ` Jan Kara
@ 2015-11-17  4:47             ` Namjae Jeon
  2015-11-18 21:36               ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Namjae Jeon @ 2015-11-17  4:47 UTC (permalink / raw)
  To: 'Jan Kara'; +Cc: 'Theodore Ts'o', linux-ext4

> On Mon 09-11-15 14:21:11, Namjae Jeon wrote:
> > >
> > > On Wed, Oct 21, 2015 at 06:44:10PM +0900, Namjae Jeon wrote:
> > > > > Interestingly we're not seeing these memory leaks on the truncate
> > > > > path, so I suspect the issue is in how collapse range is clearing
> > > > > pages from the page cache, especially pages that were freshly written
> > > > > to the journal by the commit but which hadn't yet been writtten to
> > > > > disk and then marked as complete so we can allow the relevant
> > > > > transaction to be checkpointed.  (Although we're not leaking the
> > > > > journal head structures, but only the buffer heads, so the story most
> > > > > be a bit more complicated than that.)
> > > >
> > > > Okay, Thanks for sharing your view and points !!
> > > >
> > > > Currently I can reproduce memory leak issue without collase/insert/zero range.
> > > > conditions like the following.(collase/insert/zero range are disable with -I -C -z option and
> add -y
> > > option instead of -W)
> > > >   1. small size parition(1GB)
> > > >   2. run fsx with these options "./fsx -N 30000 -o 128000 -l 500000 -r 4096 -t 512 -w 512 -Z -R
> -y -
> > > I -C -z testfile"
> > > > And same result with generic/091 is showing (buffer_head leak)
> > > >
> > > > So I am starting to find root-cause base on your points.
> > > > I will share the result or the patch.
> > >
> > > Thanks, that's very interesting data point.  So this makes it appear
> > > that the problem *is* probably with how we deal with checkpointing
> > > buffers after the pages get discarded using either a truncate or a
> > > collapse_range, since the 'y' option causes a lot fsync's, and hence
> > > commits, some of which are happening after a truncate command.
> > >
> > > Thanks for a taking a look at this.  I really appreciate it.
> > >
> > > Cheers,
> >
> >
> > Hi Ted,
> >
> > Could you review this patch?
> >
> > Thanks!
> > ---------------------------------------------------------------------------------
> > Subject: [PATCH] jbd2: try to free buffers from truncated page after
> >  checkpoint
> >
> > when ext4 is mounted in data=journal mode, and truncate operation
> > such as settatr(size), collopse, insert and zero range are used, there are
> > are many truncated pages with NULL page->mapping. Such truncated pages
> > pile up quickly due to truncate_pagecache on data pages associated with journal.
> > As page->mapping is NULL for such truncated pages, they are not freed
> > by drop cache(3) or umount. As a result, MemFree in /proc/meminfo decreases
> > quickly and active buffer_head slab objects grow in /proc/slabinfo.
> > This patch attempts to free buffers from such pages at the end of jbd2
> > checkpoint, if pages do not have any busy buffers and NULL mapping.
> 
> Hum, why such pages didn't get freed by release_buffer_page() call
> happening when processing transaction's forget list? Because the idea is
> that such pages should be discarded at that point...
Hi Jan,

When I checked this code, release_buffer_page is not called
when buffer_jbddirty is true. Such buffers with JBD_Dirty are added
to new checkpoint.

                if (buffer_jbddirty(bh)) {
                       JBUFFER_TRACE(jh, "add to new checkpointing trans");
                       __jbd2_journal_insert_checkpoint(jh, commit_transaction);
                       if (is_journal_aborted(journal))
                               clear_buffer_jbddirty(bh);
                } else {
                       J_ASSERT_BH(bh, !buffer_dirty(bh));
                       /*
                        * The buffer on BJ_Forget list and not jbddirty means
                        * it has been freed by this transaction and hence it
                        * could not have been reallocated until this
                        * transaction has committed. *BUT* it could be
                        * reallocated once we have written all the data to
                        * disk and before we process the buffer on BJ_Forget
                        * list.
                        */
                       if (!jh->b_next_transaction)
                               try_to_free = 1;
                }
                JBUFFER_TRACE(jh, "refile or unfile buffer");
                __jbd2_journal_refile_buffer(jh);
                jbd_unlock_bh_state(bh);

Next buffer was unfiled by __jbd2_journal_refile_buffer, JBD_Dirty cleared and
BH_Dirty was set in same function. Later it must have been written back as
BH_Dirty was cleared.
And ext4_wait_for_tail_page_commit-> __ext4_journalled_invalidatepage ->
journal_unmap_buffer zaps the buffer:

                if (!buffer_dirty(bh)) {
                        /* bdflush has written it.  We can drop it now */
                        goto zap_buffer;
                }

next, truncate_pagecache is called, which clears the page mapping.
Eventually, remove checkpoint is called, but such page with NULL mapping was
not freed. So, I had added release_buffer_page at the end of remove checkpoint
to attempt to free such free buffer pages. Please let me know your opinion.

Thanks.
> 								Honza
> >
> > ---
> >  fs/jbd2/checkpoint.c | 11 +++++++++++
> >  fs/jbd2/commit.c     |  2 +-
> >  include/linux/jbd2.h |  1 +
> >  3 files changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> > index 4227dc4..bf68442 100644
> > --- a/fs/jbd2/checkpoint.c
> > +++ b/fs/jbd2/checkpoint.c
> > @@ -529,6 +529,7 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
> >  	transaction_t *transaction;
> >  	journal_t *journal;
> >  	int ret = 0;
> > +	struct buffer_head *bh;
> >
> >  	JBUFFER_TRACE(jh, "entry");
> >
> > @@ -538,10 +539,20 @@ int __jbd2_journal_remove_checkpoint(struct journal_head *jh)
> >  	}
> >  	journal = transaction->t_journal;
> >
> > +	bh = jh2bh(jh);
> > +	get_bh(bh);
> >  	JBUFFER_TRACE(jh, "removing from transaction");
> >  	__buffer_unlink(jh);
> >  	jh->b_cp_transaction = NULL;
> >  	jbd2_journal_put_journal_head(jh);
> > +	/*
> > +	 * if journal head is freed, try to free buffers from a truncated
> > +	 * page, if page buffers are not busy and page->mapping is NULL
> > +	 */
> > +	if (!buffer_jbd(bh))
> > +		release_buffer_page(bh);
> > +	else
> > +		__brelse(bh);
> >
> >  	if (transaction->t_checkpoint_list != NULL ||
> >  	    transaction->t_checkpoint_io_list != NULL)
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index b73e021..d94ec3f 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -63,7 +63,7 @@ static void journal_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
> >   * Called under lock_journal(), and possibly under journal_datalist_lock.  The
> >   * caller provided us with a ref against the buffer, and we drop that here.
> >   */
> > -static void release_buffer_page(struct buffer_head *bh)
> > +void release_buffer_page(struct buffer_head *bh)
> >  {
> >  	struct page *page;
> >
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index edb640a..523f345 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -1039,6 +1039,7 @@ int __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
> >  void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
> >
> >  /* Commit management */
> > +extern void release_buffer_page(struct buffer_head *);
> >  extern void jbd2_journal_commit_transaction(journal_t *);
> >
> >  /* Checkpoint list management */
> >
> > ----------------------------------------------------------
> > >
> > > 					- Ted
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: memory leak: data=journal and {collapse,insert,zero}_range
  2015-11-17  4:47             ` Namjae Jeon
@ 2015-11-18 21:36               ` Jan Kara
  2015-11-19  9:42                 ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2015-11-18 21:36 UTC (permalink / raw)
  To: Namjae Jeon; +Cc: 'Jan Kara', 'Theodore Ts'o', linux-ext4

On Tue 17-11-15 13:47:50, Namjae Jeon wrote:
> > On Mon 09-11-15 14:21:11, Namjae Jeon wrote:
> > > >
> > > > On Wed, Oct 21, 2015 at 06:44:10PM +0900, Namjae Jeon wrote:
> > > > > > Interestingly we're not seeing these memory leaks on the truncate
> > > > > > path, so I suspect the issue is in how collapse range is clearing
> > > > > > pages from the page cache, especially pages that were freshly written
> > > > > > to the journal by the commit but which hadn't yet been writtten to
> > > > > > disk and then marked as complete so we can allow the relevant
> > > > > > transaction to be checkpointed.  (Although we're not leaking the
> > > > > > journal head structures, but only the buffer heads, so the story most
> > > > > > be a bit more complicated than that.)
> > > > >
> > > > > Okay, Thanks for sharing your view and points !!
> > > > >
> > > > > Currently I can reproduce memory leak issue without collase/insert/zero range.
> > > > > conditions like the following.(collase/insert/zero range are disable with -I -C -z option and
> > add -y
> > > > option instead of -W)
> > > > >   1. small size parition(1GB)
> > > > >   2. run fsx with these options "./fsx -N 30000 -o 128000 -l 500000 -r 4096 -t 512 -w 512 -Z -R
> > -y -
> > > > I -C -z testfile"
> > > > > And same result with generic/091 is showing (buffer_head leak)
> > > > >
> > > > > So I am starting to find root-cause base on your points.
> > > > > I will share the result or the patch.
> > > >
> > > > Thanks, that's very interesting data point.  So this makes it appear
> > > > that the problem *is* probably with how we deal with checkpointing
> > > > buffers after the pages get discarded using either a truncate or a
> > > > collapse_range, since the 'y' option causes a lot fsync's, and hence
> > > > commits, some of which are happening after a truncate command.
> > > >
> > > > Thanks for a taking a look at this.  I really appreciate it.
> > > >
> > > > Cheers,
> > >
> > >
> > > Hi Ted,
> > >
> > > Could you review this patch?
> > >
> > > Thanks!
> > > ---------------------------------------------------------------------------------
> > > Subject: [PATCH] jbd2: try to free buffers from truncated page after
> > >  checkpoint
> > >
> > > when ext4 is mounted in data=journal mode, and truncate operation
> > > such as settatr(size), collopse, insert and zero range are used, there are
> > > are many truncated pages with NULL page->mapping. Such truncated pages
> > > pile up quickly due to truncate_pagecache on data pages associated with journal.
> > > As page->mapping is NULL for such truncated pages, they are not freed
> > > by drop cache(3) or umount. As a result, MemFree in /proc/meminfo decreases
> > > quickly and active buffer_head slab objects grow in /proc/slabinfo.
> > > This patch attempts to free buffers from such pages at the end of jbd2
> > > checkpoint, if pages do not have any busy buffers and NULL mapping.
> > 
> > Hum, why such pages didn't get freed by release_buffer_page() call
> > happening when processing transaction's forget list? Because the idea is
> > that such pages should be discarded at that point...
> Hi Jan,
> 
> When I checked this code, release_buffer_page is not called
> when buffer_jbddirty is true. Such buffers with JBD_Dirty are added
> to new checkpoint.
> 
>                 if (buffer_jbddirty(bh)) {
>                        JBUFFER_TRACE(jh, "add to new checkpointing trans");
>                        __jbd2_journal_insert_checkpoint(jh, commit_transaction);
>                        if (is_journal_aborted(journal))
>                                clear_buffer_jbddirty(bh);
>                 } else {
>                        J_ASSERT_BH(bh, !buffer_dirty(bh));
>                        /*
>                         * The buffer on BJ_Forget list and not jbddirty means
>                         * it has been freed by this transaction and hence it
>                         * could not have been reallocated until this
>                         * transaction has committed. *BUT* it could be
>                         * reallocated once we have written all the data to
>                         * disk and before we process the buffer on BJ_Forget
>                         * list.
>                         */
>                        if (!jh->b_next_transaction)
>                                try_to_free = 1;
>                 }
>                 JBUFFER_TRACE(jh, "refile or unfile buffer");
>                 __jbd2_journal_refile_buffer(jh);
>                 jbd_unlock_bh_state(bh);
> 
> Next buffer was unfiled by __jbd2_journal_refile_buffer, JBD_Dirty cleared and
> BH_Dirty was set in same function. Later it must have been written back as
> BH_Dirty was cleared.
> And ext4_wait_for_tail_page_commit-> __ext4_journalled_invalidatepage ->
> journal_unmap_buffer zaps the buffer:
> 
>                 if (!buffer_dirty(bh)) {
>                         /* bdflush has written it.  We can drop it now */
>                         goto zap_buffer;
>                 }
> 
> next, truncate_pagecache is called, which clears the page mapping.
> Eventually, remove checkpoint is called, but such page with NULL mapping was
> not freed. So, I had added release_buffer_page at the end of remove checkpoint
> to attempt to free such free buffer pages. Please let me know your opinion.

OK, thanks for the detailed analysis. But when the buffer gets truncated,
jbd2_journal_invalidatepage() either removes the buffer from the
transaction (obviously didn't happen here) or it sets buffer_freed and
buffer_jbddirty should get cleared when processing the BJ_Forget list. So
why that didn't happen? Can you have a look into what
jbd2_journal_invalidatepage() did to buffer in that page?

Generally truncated buffers should not enter checkpoint list since writing
them is just a waste of disk bandwidth...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: memory leak: data=journal and {collapse,insert,zero}_range
  2015-11-18 21:36               ` Jan Kara
@ 2015-11-19  9:42                 ` Jan Kara
  2015-11-20  4:34                   ` Namjae Jeon
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2015-11-19  9:42 UTC (permalink / raw)
  To: Namjae Jeon; +Cc: 'Jan Kara', 'Theodore Ts'o', linux-ext4

On Wed 18-11-15 22:36:51, Jan Kara wrote:
> On Tue 17-11-15 13:47:50, Namjae Jeon wrote:
> > > On Mon 09-11-15 14:21:11, Namjae Jeon wrote:
> > > > >
> > > > > On Wed, Oct 21, 2015 at 06:44:10PM +0900, Namjae Jeon wrote:
> > > > > > > Interestingly we're not seeing these memory leaks on the truncate
> > > > > > > path, so I suspect the issue is in how collapse range is clearing
> > > > > > > pages from the page cache, especially pages that were freshly written
> > > > > > > to the journal by the commit but which hadn't yet been writtten to
> > > > > > > disk and then marked as complete so we can allow the relevant
> > > > > > > transaction to be checkpointed.  (Although we're not leaking the
> > > > > > > journal head structures, but only the buffer heads, so the story most
> > > > > > > be a bit more complicated than that.)
> > > > > >
> > > > > > Okay, Thanks for sharing your view and points !!
> > > > > >
> > > > > > Currently I can reproduce memory leak issue without collase/insert/zero range.
> > > > > > conditions like the following.(collase/insert/zero range are disable with -I -C -z option and
> > > add -y
> > > > > option instead of -W)
> > > > > >   1. small size parition(1GB)
> > > > > >   2. run fsx with these options "./fsx -N 30000 -o 128000 -l 500000 -r 4096 -t 512 -w 512 -Z -R
> > > -y -
> > > > > I -C -z testfile"
> > > > > > And same result with generic/091 is showing (buffer_head leak)
> > > > > >
> > > > > > So I am starting to find root-cause base on your points.
> > > > > > I will share the result or the patch.
> > > > >
> > > > > Thanks, that's very interesting data point.  So this makes it appear
> > > > > that the problem *is* probably with how we deal with checkpointing
> > > > > buffers after the pages get discarded using either a truncate or a
> > > > > collapse_range, since the 'y' option causes a lot fsync's, and hence
> > > > > commits, some of which are happening after a truncate command.
> > > > >
> > > > > Thanks for a taking a look at this.  I really appreciate it.
> > > > >
> > > > > Cheers,
> > > >
> > > >
> > > > Hi Ted,
> > > >
> > > > Could you review this patch?
> > > >
> > > > Thanks!
> > > > ---------------------------------------------------------------------------------
> > > > Subject: [PATCH] jbd2: try to free buffers from truncated page after
> > > >  checkpoint
> > > >
> > > > when ext4 is mounted in data=journal mode, and truncate operation
> > > > such as settatr(size), collopse, insert and zero range are used, there are
> > > > are many truncated pages with NULL page->mapping. Such truncated pages
> > > > pile up quickly due to truncate_pagecache on data pages associated with journal.
> > > > As page->mapping is NULL for such truncated pages, they are not freed
> > > > by drop cache(3) or umount. As a result, MemFree in /proc/meminfo decreases
> > > > quickly and active buffer_head slab objects grow in /proc/slabinfo.
> > > > This patch attempts to free buffers from such pages at the end of jbd2
> > > > checkpoint, if pages do not have any busy buffers and NULL mapping.
> > > 
> > > Hum, why such pages didn't get freed by release_buffer_page() call
> > > happening when processing transaction's forget list? Because the idea is
> > > that such pages should be discarded at that point...
> > Hi Jan,
> > 
> > When I checked this code, release_buffer_page is not called
> > when buffer_jbddirty is true. Such buffers with JBD_Dirty are added
> > to new checkpoint.
> > 
> >                 if (buffer_jbddirty(bh)) {
> >                        JBUFFER_TRACE(jh, "add to new checkpointing trans");
> >                        __jbd2_journal_insert_checkpoint(jh, commit_transaction);
> >                        if (is_journal_aborted(journal))
> >                                clear_buffer_jbddirty(bh);
> >                 } else {
> >                        J_ASSERT_BH(bh, !buffer_dirty(bh));
> >                        /*
> >                         * The buffer on BJ_Forget list and not jbddirty means
> >                         * it has been freed by this transaction and hence it
> >                         * could not have been reallocated until this
> >                         * transaction has committed. *BUT* it could be
> >                         * reallocated once we have written all the data to
> >                         * disk and before we process the buffer on BJ_Forget
> >                         * list.
> >                         */
> >                        if (!jh->b_next_transaction)
> >                                try_to_free = 1;
> >                 }
> >                 JBUFFER_TRACE(jh, "refile or unfile buffer");
> >                 __jbd2_journal_refile_buffer(jh);
> >                 jbd_unlock_bh_state(bh);
> > 
> > Next buffer was unfiled by __jbd2_journal_refile_buffer, JBD_Dirty cleared and
> > BH_Dirty was set in same function. Later it must have been written back as
> > BH_Dirty was cleared.
> > And ext4_wait_for_tail_page_commit-> __ext4_journalled_invalidatepage ->
> > journal_unmap_buffer zaps the buffer:
> > 
> >                 if (!buffer_dirty(bh)) {
> >                         /* bdflush has written it.  We can drop it now */
> >                         goto zap_buffer;
> >                 }
> > 
> > next, truncate_pagecache is called, which clears the page mapping.
> > Eventually, remove checkpoint is called, but such page with NULL mapping was
> > not freed. So, I had added release_buffer_page at the end of remove checkpoint
> > to attempt to free such free buffer pages. Please let me know your opinion.
> 
> OK, thanks for the detailed analysis. But when the buffer gets truncated,
> jbd2_journal_invalidatepage() either removes the buffer from the
> transaction (obviously didn't happen here) or it sets buffer_freed and
> buffer_jbddirty should get cleared when processing the BJ_Forget list. So
> why that didn't happen? Can you have a look into what
> jbd2_journal_invalidatepage() did to buffer in that page?
> 
> Generally truncated buffers should not enter checkpoint list since writing
> them is just a waste of disk bandwidth...

I thought a bit more about this and my guess is that
jbd2_journal_invalidatepage() fails to invalidate the partial page and
returns EBUSY. However we should see warnings about that from
ext4_journalled_invalidatepage(). Can you see them? This would perfectly
match Ted's observation that the problem happens only for fallocate
operations but not for truncate which does the dance with
ext4_wait_for_tail_page_commit() but fallocate operations don't do it...

Now if I'm right and this is indeed the path we are hitting, we need to put
more thought into how we handle truncation of partial pages in data=journal
mode.

BTW: The ext4_force_commit() call in fallocate operations is definitely
racy as pages can get dirtied and journalled in the running transaction
anytime while we wait for transaction commit (which I suspect is happening
in your testcase)... So that needs some more thought as well.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: memory leak: data=journal and {collapse,insert,zero}_range
  2015-11-19  9:42                 ` Jan Kara
@ 2015-11-20  4:34                   ` Namjae Jeon
  2015-11-23 13:53                     ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Namjae Jeon @ 2015-11-20  4:34 UTC (permalink / raw)
  To: 'Jan Kara'; +Cc: 'Theodore Ts'o', linux-ext4


> 
> On Wed 18-11-15 22:36:51, Jan Kara wrote:
> > On Tue 17-11-15 13:47:50, Namjae Jeon wrote:
> > > > On Mon 09-11-15 14:21:11, Namjae Jeon wrote:
> > > > > >
> > > > > > On Wed, Oct 21, 2015 at 06:44:10PM +0900, Namjae Jeon wrote:
> > > > > > > > Interestingly we're not seeing these memory leaks on the truncate
> > > > > > > > path, so I suspect the issue is in how collapse range is clearing
> > > > > > > > pages from the page cache, especially pages that were freshly written
> > > > > > > > to the journal by the commit but which hadn't yet been writtten to
> > > > > > > > disk and then marked as complete so we can allow the relevant
> > > > > > > > transaction to be checkpointed.  (Although we're not leaking the
> > > > > > > > journal head structures, but only the buffer heads, so the story most
> > > > > > > > be a bit more complicated than that.)
> > > > > > >
> > > > > > > Okay, Thanks for sharing your view and points !!
> > > > > > >
> > > > > > > Currently I can reproduce memory leak issue without collase/insert/zero range.
> > > > > > > conditions like the following.(collase/insert/zero range are disable with -I -C -z option
> and
> > > > add -y
> > > > > > option instead of -W)
> > > > > > >   1. small size parition(1GB)
> > > > > > >   2. run fsx with these options "./fsx -N 30000 -o 128000 -l 500000 -r 4096 -t 512 -w 512
> -Z -R
> > > > -y -
> > > > > > I -C -z testfile"
> > > > > > > And same result with generic/091 is showing (buffer_head leak)
> > > > > > >
> > > > > > > So I am starting to find root-cause base on your points.
> > > > > > > I will share the result or the patch.
> > > > > >
> > > > > > Thanks, that's very interesting data point.  So this makes it appear
> > > > > > that the problem *is* probably with how we deal with checkpointing
> > > > > > buffers after the pages get discarded using either a truncate or a
> > > > > > collapse_range, since the 'y' option causes a lot fsync's, and hence
> > > > > > commits, some of which are happening after a truncate command.
> > > > > >
> > > > > > Thanks for a taking a look at this.  I really appreciate it.
> > > > > >
> > > > > > Cheers,
> > > > >
> > > > >
> > > > > Hi Ted,
> > > > >
> > > > > Could you review this patch?
> > > > >
> > > > > Thanks!
> > > > > ---------------------------------------------------------------------------------
> > > > > Subject: [PATCH] jbd2: try to free buffers from truncated page after
> > > > >  checkpoint
> > > > >
> > > > > when ext4 is mounted in data=journal mode, and truncate operation
> > > > > such as settatr(size), collopse, insert and zero range are used, there are
> > > > > are many truncated pages with NULL page->mapping. Such truncated pages
> > > > > pile up quickly due to truncate_pagecache on data pages associated with journal.
> > > > > As page->mapping is NULL for such truncated pages, they are not freed
> > > > > by drop cache(3) or umount. As a result, MemFree in /proc/meminfo decreases
> > > > > quickly and active buffer_head slab objects grow in /proc/slabinfo.
> > > > > This patch attempts to free buffers from such pages at the end of jbd2
> > > > > checkpoint, if pages do not have any busy buffers and NULL mapping.
> > > >
> > > > Hum, why such pages didn't get freed by release_buffer_page() call
> > > > happening when processing transaction's forget list? Because the idea is
> > > > that such pages should be discarded at that point...
> > > Hi Jan,
> > >
> > > When I checked this code, release_buffer_page is not called
> > > when buffer_jbddirty is true. Such buffers with JBD_Dirty are added
> > > to new checkpoint.
> > >
> > >                 if (buffer_jbddirty(bh)) {
> > >                        JBUFFER_TRACE(jh, "add to new checkpointing trans");
> > >                        __jbd2_journal_insert_checkpoint(jh, commit_transaction);
> > >                        if (is_journal_aborted(journal))
> > >                                clear_buffer_jbddirty(bh);
> > >                 } else {
> > >                        J_ASSERT_BH(bh, !buffer_dirty(bh));
> > >                        /*
> > >                         * The buffer on BJ_Forget list and not jbddirty means
> > >                         * it has been freed by this transaction and hence it
> > >                         * could not have been reallocated until this
> > >                         * transaction has committed. *BUT* it could be
> > >                         * reallocated once we have written all the data to
> > >                         * disk and before we process the buffer on BJ_Forget
> > >                         * list.
> > >                         */
> > >                        if (!jh->b_next_transaction)
> > >                                try_to_free = 1;
> > >                 }
> > >                 JBUFFER_TRACE(jh, "refile or unfile buffer");
> > >                 __jbd2_journal_refile_buffer(jh);
> > >                 jbd_unlock_bh_state(bh);
> > >
> > > Next buffer was unfiled by __jbd2_journal_refile_buffer, JBD_Dirty cleared and
> > > BH_Dirty was set in same function. Later it must have been written back as
> > > BH_Dirty was cleared.
> > > And ext4_wait_for_tail_page_commit-> __ext4_journalled_invalidatepage ->
> > > journal_unmap_buffer zaps the buffer:
> > >
> > >                 if (!buffer_dirty(bh)) {
> > >                         /* bdflush has written it.  We can drop it now */
> > >                         goto zap_buffer;
> > >                 }
> > >
> > > next, truncate_pagecache is called, which clears the page mapping.
> > > Eventually, remove checkpoint is called, but such page with NULL mapping was
> > > not freed. So, I had added release_buffer_page at the end of remove checkpoint
> > > to attempt to free such free buffer pages. Please let me know your opinion.
> >
> > OK, thanks for the detailed analysis. But when the buffer gets truncated,
> > jbd2_journal_invalidatepage() either removes the buffer from the
> > transaction (obviously didn't happen here) or it sets buffer_freed and
> > buffer_jbddirty should get cleared when processing the BJ_Forget list. So
> > why that didn't happen? Can you have a look into what
> > jbd2_journal_invalidatepage() did to buffer in that page?
> >
> > Generally truncated buffers should not enter checkpoint list since writing
> > them is just a waste of disk bandwidth...
> 
> I thought a bit more about this and my guess is that
> jbd2_journal_invalidatepage() fails to invalidate the partial page and
> returns EBUSY. However we should see warnings about that from
> ext4_journalled_invalidatepage(). Can you see them? This would perfectly
> match Ted's observation that the problem happens only for fallocate
> operations but not for truncate which does the dance with
> ext4_wait_for_tail_page_commit() but fallocate operations don't do it...
> 
> Now if I'm right and this is indeed the path we are hitting, we need to put
> more thought into how we handle truncation of partial pages in data=journal
> mode.
Actually, I can quickly create memleak with simple testcase like:
while(1000)
{
Random seek(max 500M) -> write 1M -> fsync ->truncate(random size)
}

As there is significant memleak in this situation also, I started focusing
on this simple testcase only. When use this testcase, I never saw EBUSY path
in jbd2_journal_invalidatepage() -> journal_unmap_buffer().

On the other hand(), for memleak pages jbd2_journal_invalidatepage() ->
journal_unmap_buffer() was almost bailing out every time from
                if (!buffer_dirty(bh)) {
                        /* bdflush has written it.  We can drop it now */
                        goto zap_buffer;
                }

As per debugging, NULL mapping page or buffer is created in below scenario:
Write(or Kjournald) -> jbd2_journal_commit_transaction
                    -> BH_JBDDirty buffer is added to forget list.
                    -> Buffer is added to new checkpoint,
                       additional reference count is taken on b_jcount which
                       keeps buffer busy until remove checkpoint.
                    -> Buffer is unfiled, removed from forget list, BH_JBDDirty
                       is cleared and BH_Dirty is set, it is exposed to VM. 


As BH_Dirty is set, buffer is written by ext4_writepage by flusher.
ext4_writepage finally clears BH_Dirty.

Now, ftruncate is called, almost every time code is returned from below path:
Ftruncate -> jbd2_journal_invalidatepage -> journal_unmap_buffer
                if (!buffer_dirty(bh)) {
                        /* bdflush has written it.  We can drop it now */
                        goto zap_buffer;
                }

page was not released by jbd2_journal_invalidatepage, mainly because b_jcount
was 2 on entry of jbd2_journal_invalidatepage (addional reference by insert
checkpoint above), buffer state was sane.
At exit time jbd2_journal_invalidatepage, b_jcount is still 1.

Next, truncate_pagecache is called which makes page mapping NULL.
Later, __jbd2_journal_remove_checkpoint is called, it makes b_jcount zero and
remove BH_JBD from buffer. Such pages with NULL mapping pages are not freed.
Previous patch fixed this leak if try release_page_buffer at the end of
checkpoint.

> 
> BTW: The ext4_force_commit() call in fallocate operations is definitely
> racy as pages can get dirtied and journalled in the running transaction
> anytime while we wait for transaction commit (which I suspect is happening
> in your testcase)... So that needs some more thought as well.
Need to check more on this. I can reproduce memleak without fsync
(i.e. ext4_force_commit), though it is less.

Thanks!
> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: memory leak: data=journal and {collapse,insert,zero}_range
  2015-11-20  4:34                   ` Namjae Jeon
@ 2015-11-23 13:53                     ` Jan Kara
  2015-11-24  4:21                       ` Namjae Jeon
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2015-11-23 13:53 UTC (permalink / raw)
  To: Namjae Jeon; +Cc: 'Jan Kara', 'Theodore Ts'o', linux-ext4

[-- Attachment #1: Type: text/plain, Size: 1599 bytes --]

On Fri 20-11-15 13:34:36, Namjae Jeon wrote:
> Actually, I can quickly create memleak with simple testcase like:
> while(1000)
> {
> Random seek(max 500M) -> write 1M -> fsync ->truncate(random size)
> }
> 
> As there is significant memleak in this situation also, I started focusing
> on this simple testcase only. When use this testcase, I never saw EBUSY path
> in jbd2_journal_invalidatepage() -> journal_unmap_buffer().
> 
> On the other hand(), for memleak pages jbd2_journal_invalidatepage() ->
> journal_unmap_buffer() was almost bailing out every time from
>                 if (!buffer_dirty(bh)) {
>                         /* bdflush has written it.  We can drop it now */
>                         goto zap_buffer;
>                 }
> 
> As per debugging, NULL mapping page or buffer is created in below scenario:
> Write(or Kjournald) -> jbd2_journal_commit_transaction
>                     -> BH_JBDDirty buffer is added to forget list.
>                     -> Buffer is added to new checkpoint,
>                        additional reference count is taken on b_jcount which
>                        keeps buffer busy until remove checkpoint.
>                     -> Buffer is unfiled, removed from forget list, BH_JBDDirty
>                        is cleared and BH_Dirty is set, it is exposed to VM. 

Ah, I see! Thanks for the easy reproduction testcase and for the debugging.
Attached patch fixes the issue (at least the simple truncate case) for me.
Can you check whether it fixes the issue for you as well? Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-jbd2-Fix-unreclaimed-pages-after-truncate-in-data-jo.patch --]
[-- Type: text/x-patch, Size: 2206 bytes --]

>From 29e43b5691866e8eb363c1ef64e1fc89b2792e86 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 23 Nov 2015 13:34:30 +0100
Subject: [PATCH] jbd2: Fix unreclaimed pages after truncate in data=journal
 mode

Ted and Namjae have reported that truncated pages don't get timely
reclaimed after being truncated in data=journal mode. The following test
triggers the issue easily:

for (i = 0; i < 1000; i++) {
	pwrite(fd, buf, 1024*1024, 0);
	fsync(fd);
	fsync(fd);
	ftruncate(fd, 0);
}

The reason is that journal_unmap_buffer() finds that truncated buffers
are not journalled (jh->b_transaction == NULL), they are part of
checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have
been already written out (!buffer_dirty(bh)). We clean such buffers but
we leave them in the checkpoint list. Since checkpoint transaction holds
a reference to the journal head, these buffers cannot be released until
the checkpoint transaction is cleaned up. And at that point we don't
call release_buffer_page() anymore so pages detached from mapping are
lingering in the system waiting for reclaim to find them and free them.

Fix the problem by removing buffers from transaction checkpoint lists
when journal_unmap_buffer() finds out they don't have to be there
anymore.

Reported-by: Namjae Jeon <namjae.jeon@samsung.com>
Fixes: de1b794130b130e77ffa975bb58cb843744f9ae5
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 89463eee6791..daa1514259e0 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2152,6 +2152,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 
 		if (!buffer_dirty(bh)) {
 			/* bdflush has written it.  We can drop it now */
+			__jbd2_journal_remove_checkpoint(jh);
 			goto zap_buffer;
 		}
 
@@ -2181,6 +2182,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 				/* The orphan record's transaction has
 				 * committed.  We can cleanse this buffer */
 				clear_buffer_jbddirty(bh);
+				__jbd2_journal_remove_checkpoint(jh);
 				goto zap_buffer;
 			}
 		}
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: memory leak: data=journal and {collapse,insert,zero}_range
  2015-11-23 13:53                     ` Jan Kara
@ 2015-11-24  4:21                       ` Namjae Jeon
  0 siblings, 0 replies; 13+ messages in thread
From: Namjae Jeon @ 2015-11-24  4:21 UTC (permalink / raw)
  To: 'Jan Kara'; +Cc: 'Theodore Ts'o', linux-ext4

> 
> On Fri 20-11-15 13:34:36, Namjae Jeon wrote:
> > Actually, I can quickly create memleak with simple testcase like:
> > while(1000)
> > {
> > Random seek(max 500M) -> write 1M -> fsync ->truncate(random size)
> > }
> >
> > As there is significant memleak in this situation also, I started focusing
> > on this simple testcase only. When use this testcase, I never saw EBUSY path
> > in jbd2_journal_invalidatepage() -> journal_unmap_buffer().
> >
> > On the other hand(), for memleak pages jbd2_journal_invalidatepage() ->
> > journal_unmap_buffer() was almost bailing out every time from
> >                 if (!buffer_dirty(bh)) {
> >                         /* bdflush has written it.  We can drop it now */
> >                         goto zap_buffer;
> >                 }
> >
> > As per debugging, NULL mapping page or buffer is created in below scenario:
> > Write(or Kjournald) -> jbd2_journal_commit_transaction
> >                     -> BH_JBDDirty buffer is added to forget list.
> >                     -> Buffer is added to new checkpoint,
> >                        additional reference count is taken on b_jcount which
> >                        keeps buffer busy until remove checkpoint.
> >                     -> Buffer is unfiled, removed from forget list, BH_JBDDirty
> >                        is cleared and BH_Dirty is set, it is exposed to VM.
> 
> Ah, I see! Thanks for the easy reproduction testcase and for the debugging.
> Attached patch fixes the issue (at least the simple truncate case) for me.
> Can you check whether it fixes the issue for you as well? Thanks!
This patch looks good to me! I checked that all leak issues are fixed
with your patch.
Thanks for your help!
> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-11-24  4:21 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-17 16:02 memory leak: data=journal and {collapse,insert,zero}_range Theodore Ts'o
2015-10-20 12:06 ` Namjae Jeon
2015-10-20 15:54   ` Theodore Ts'o
2015-10-21  9:44     ` Namjae Jeon
2015-10-21 14:52       ` Theodore Ts'o
2015-11-09  5:21         ` Namjae Jeon
2015-11-10 14:49           ` Jan Kara
2015-11-17  4:47             ` Namjae Jeon
2015-11-18 21:36               ` Jan Kara
2015-11-19  9:42                 ` Jan Kara
2015-11-20  4:34                   ` Namjae Jeon
2015-11-23 13:53                     ` Jan Kara
2015-11-24  4:21                       ` Namjae Jeon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox