[PATCH] ext4: fix free clusters calculation in bigalloc filesystem

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] ext4: fix free clusters calculation in bigalloc filesystem
@ 2013-02-21  8:01 Lukas Czerner
  2013-02-21 12:15 ` [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... ) Zheng Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Lukas Czerner @ 2013-02-21  8:01 UTC (permalink / raw)
  To: linux-ext4; +Cc: Lukas Czerner

ext4_has_free_clusters() should tell us whether there is enough free
clusters to allocate, however number of free clusters in the file system
is converted to blocks using EXT4_C2B() which is not only wrong use of
the macro (we should have used EXT4_NUM_B2C) but it's also completely
wrong concept since everything else is in cluster units.

Moreover when calculating number of root clusters we should be using
macro EXT4_NUM_B2C() instead of EXT4_C2B() otherwise the result will
usually be off by one.

As a result number of free clusters is much bigger than it should have
been and ext4_has_free_clusters() would return 1 even if there is really
not enough free clusters available.

Fix this by removing the EXT4_C2B() conversion of free clusters and
using EXT4_NUM_B2C() when calculating number of root clusters. This bug
affects number of xfstests tests covering file system ENOSPC situation
handling. With this patch most of the ENOSPC problems with bigalloc file
system disappear, especially the errors caused by delayed allocation not
having enough space when the actual allocation is finally requested.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/balloc.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index cf18217..9f4302a 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -482,11 +482,11 @@ static int ext4_has_free_clusters(struct ext4_sb_info *sbi,

 	free_clusters  = percpu_counter_read_positive(fcc);
 	dirty_clusters = percpu_counter_read_positive(dcc);
-	root_clusters = EXT4_B2C(sbi, ext4_r_blocks_count(sbi->s_es));
+	root_clusters = EXT4_NUM_B2C(sbi, ext4_r_blocks_count(sbi->s_es));

 	if (free_clusters - (nclusters + root_clusters + dirty_clusters) <
 					EXT4_FREECLUSTERS_WATERMARK) {
-		free_clusters  = EXT4_C2B(sbi, percpu_counter_sum_positive(fcc));
+		free_clusters  = percpu_counter_sum_positive(fcc);
 		dirty_clusters = percpu_counter_sum_positive(dcc);
 	}
 	/* Check whether we have space after accounting for current
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-21  8:01 [PATCH] ext4: fix free clusters calculation in bigalloc filesystem Lukas Czerner
@ 2013-02-21 12:15 ` Zheng Liu
  2013-02-21 12:40   ` Lukáš Czerner
  2013-02-22  5:10 ` [PATCH] ext4: fix free clusters calculation in bigalloc filesystem Theodore Ts'o
  2013-02-22  8:39 ` [PATCH v2] " Lukas Czerner
  2 siblings, 1 reply; 19+ messages in thread
From: Zheng Liu @ 2013-02-21 12:15 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4, Theodore Ts'o

On Thu, Feb 21, 2013 at 09:01:05AM +0100, Lukas Czerner wrote:
> ext4_has_free_clusters() should tell us whether there is enough free
> clusters to allocate, however number of free clusters in the file system
> is converted to blocks using EXT4_C2B() which is not only wrong use of
> the macro (we should have used EXT4_NUM_B2C) but it's also completely
> wrong concept since everything else is in cluster units.
> 
> Moreover when calculating number of root clusters we should be using
> macro EXT4_NUM_B2C() instead of EXT4_C2B() otherwise the result will
> usually be off by one.
> 
> As a result number of free clusters is much bigger than it should have
> been and ext4_has_free_clusters() would return 1 even if there is really
> not enough free clusters available.
> 
> Fix this by removing the EXT4_C2B() conversion of free clusters and
> using EXT4_NUM_B2C() when calculating number of root clusters. This bug
> affects number of xfstests tests covering file system ENOSPC situation
> handling. With this patch most of the ENOSPC problems with bigalloc file
> system disappear, especially the errors caused by delayed allocation not
> having enough space when the actual allocation is finally requested.
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>

Great!  Thanks for fixing it.  After applied this patch, xfstests #15
with bigalloc and delalloc won't cause a failure.  You can add
Reviewed-and-tested-by: Zheng Liu <wenqing.lz@taobao.com>

BTW, xfstests (204, 219, 233, 235, 273, and 274) still cause failures in
my test environment, and I still get a warning message which looks like:

kernel: EXT4-fs (sda2): ext4_da_update_reserve_space: ino 3658, allocated 1
with only 0 reserved metadata blocks
kernel:
kernel: ------------[ cut here ]------------
kernel: WARNING: at fs/ext4/inode.c:362 ext4_da_update_reserve_space+0x10f/0x21b
[ext4]()
kernel: Hardware name: OptiPlex 780                 
kernel: Modules linked in: ext4 jbd2 crc16 cpufreq_ondemand ipv6 dm_mirror
dm_region_hash dm_log dm_mod parport_pc parport cspkr i2c_i801 i2c_core
serio_raw sg ehci_pci ehci_hcd button e1000e ext3 jbd sd_mod ahci libahci libata
scsi_mod uhci_hcd
kernel: Pid: 2628, comm: 2372.fsstress.b Tainted: G W    3.8.0+ #7
kernel: Call Trace:
kernel: [<ffffffff82031d68>] warn_slowpath_common+0x85/0x9d
kernel: [<ffffffff82031d9a>] warn_slowpath_null+0x1a/0x1c
kernel: [<ffffffffa0200240>] ext4_da_update_reserve_space+0x10f/0x21b [ext4]
kernel: [<ffffffffa02277cd>] ext4_ext_map_blocks+0xd83/0xf66 [ext4]
kernel: [<ffffffff820ba4a8>] ? release_pages+0x169/0x178
kernel: [<ffffffff820ba011>] ? pagevec_lookup_tag+0x25/0x2e
kernel: [<ffffffffa02018d3>] ? write_cache_pages_da+0x107/0x3c4 [ext4]
kernel: [<ffffffffa0200c36>] ext4_map_blocks+0x135/0x1ef [ext4]
kernel: [<ffffffffa0201451>] mpage_da_map_and_submit+0x111/0x3d8 [ext4]
kernel: [<ffffffffa0201f0e>] ext4_da_writepages+0x37e/0x526 [ext4]
kernel: [<ffffffff820b86d9>] do_writepages+0x20/0x29
kernel: [<ffffffff820b13da>] __filemap_fdatawrite_range+0x50/0x52
kernel: [<ffffffff820b19a5>] filemap_fdatawrite+0x1f/0x21
kernel: [<ffffffff820b19c4>] filemap_write_and_wait+0x1d/0x38
kernel: [<ffffffff820fc4a9>] do_vfs_ioctl+0x2db/0x47f
kernel: [<ffffffff820fc6ab>] sys_ioctl+0x5e/0x82
kernel: [<ffffffff82386942>] system_call_fastpath+0x16/0x1b
kernel: ---[ end trace d96610456f905628 ]---

It is easy to trigger this warning when running xfstests #127 or #225.

Moreover, it seems that there still has an improvement in
ext4_calculate_overhead().  I paste the patch here.

Regards,
                                                - Zheng

Subject: [PATCH] ext4: fix overhead calculation in bigalloc filesystem

From: Zheng Liu <wenqing.lz@taobao.com>

ext4_calculate_overhead() should compute the overhead and stash it in
sbi->s_overhead.  But we miss use EXT4_B2C() to calculate the number of
clusters before first_data_block and the number of journal blocks.  This
commit use EXT4_NUM_B2C() instead of EXT4_B2C() to calculate the
overhead.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/super.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3d4fb81..6165558 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3219,7 +3219,7 @@ int ext4_calculate_overhead(struct super_block *sb)
 	/*
 	 * All of the blocks before first_data_block are overhead
 	 */
-	overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
+	overhead = EXT4_NUM_B2C(sbi, le32_to_cpu(es->s_first_data_block));

 	/*
 	 * Add the overhead found in each block group
@@ -3235,7 +3235,7 @@ int ext4_calculate_overhead(struct super_block *sb)
 	}
 	/* Add the journal blocks as well */
 	if (sbi->s_journal)
-		overhead += EXT4_B2C(sbi, sbi->s_journal->j_maxlen);
+		overhead += EXT4_NUM_B2C(sbi, sbi->s_journal->j_maxlen);

 	sbi->s_overhead = overhead;
 	smp_wmb();
-- 
1.7.12.rc2.18.g61b472e

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-21 12:15 ` [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... ) Zheng Liu
@ 2013-02-21 12:40   ` Lukáš Czerner
  2013-02-21 12:50     ` Lukáš Czerner
  2013-02-21 13:12     ` [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... ) Zheng Liu
  0 siblings, 2 replies; 19+ messages in thread
From: Lukáš Czerner @ 2013-02-21 12:40 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Lukas Czerner, linux-ext4, Theodore Ts'o

On Thu, 21 Feb 2013, Zheng Liu wrote:

> Date: Thu, 21 Feb 2013 20:15:45 +0800
> From: Zheng Liu <gnehzuil.liu@gmail.com>
> To: Lukas Czerner <lczerner@redhat.com>
> Cc: linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>
> Subject: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re:
>     ... )
> 
> On Thu, Feb 21, 2013 at 09:01:05AM +0100, Lukas Czerner wrote:
> > ext4_has_free_clusters() should tell us whether there is enough free
> > clusters to allocate, however number of free clusters in the file system
> > is converted to blocks using EXT4_C2B() which is not only wrong use of
> > the macro (we should have used EXT4_NUM_B2C) but it's also completely
> > wrong concept since everything else is in cluster units.
> > 
> > Moreover when calculating number of root clusters we should be using
> > macro EXT4_NUM_B2C() instead of EXT4_C2B() otherwise the result will
> > usually be off by one.
> > 
> > As a result number of free clusters is much bigger than it should have
> > been and ext4_has_free_clusters() would return 1 even if there is really
> > not enough free clusters available.
> > 
> > Fix this by removing the EXT4_C2B() conversion of free clusters and
> > using EXT4_NUM_B2C() when calculating number of root clusters. This bug
> > affects number of xfstests tests covering file system ENOSPC situation
> > handling. With this patch most of the ENOSPC problems with bigalloc file
> > system disappear, especially the errors caused by delayed allocation not
> > having enough space when the actual allocation is finally requested.
> > 
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> 
> Great!  Thanks for fixing it.  After applied this patch, xfstests #15
> with bigalloc and delalloc won't cause a failure.  You can add
> Reviewed-and-tested-by: Zheng Liu <wenqing.lz@taobao.com>
> 
> BTW, xfstests (204, 219, 233, 235, 273, and 274) still cause failures in
> my test environment, and I still get a warning message which looks like:
> 
> kernel: EXT4-fs (sda2): ext4_da_update_reserve_space: ino 3658, allocated 1
> with only 0 reserved metadata blocks
> kernel:
> kernel: ------------[ cut here ]------------
> kernel: WARNING: at fs/ext4/inode.c:362 ext4_da_update_reserve_space+0x10f/0x21b
> [ext4]()
> kernel: Hardware name: OptiPlex 780                 
> kernel: Modules linked in: ext4 jbd2 crc16 cpufreq_ondemand ipv6 dm_mirror
> dm_region_hash dm_log dm_mod parport_pc parport cspkr i2c_i801 i2c_core
> serio_raw sg ehci_pci ehci_hcd button e1000e ext3 jbd sd_mod ahci libahci libata
> scsi_mod uhci_hcd
> kernel: Pid: 2628, comm: 2372.fsstress.b Tainted: G W    3.8.0+ #7
> kernel: Call Trace:
> kernel: [<ffffffff82031d68>] warn_slowpath_common+0x85/0x9d
> kernel: [<ffffffff82031d9a>] warn_slowpath_null+0x1a/0x1c
> kernel: [<ffffffffa0200240>] ext4_da_update_reserve_space+0x10f/0x21b [ext4]
> kernel: [<ffffffffa02277cd>] ext4_ext_map_blocks+0xd83/0xf66 [ext4]
> kernel: [<ffffffff820ba4a8>] ? release_pages+0x169/0x178
> kernel: [<ffffffff820ba011>] ? pagevec_lookup_tag+0x25/0x2e
> kernel: [<ffffffffa02018d3>] ? write_cache_pages_da+0x107/0x3c4 [ext4]
> kernel: [<ffffffffa0200c36>] ext4_map_blocks+0x135/0x1ef [ext4]
> kernel: [<ffffffffa0201451>] mpage_da_map_and_submit+0x111/0x3d8 [ext4]
> kernel: [<ffffffffa0201f0e>] ext4_da_writepages+0x37e/0x526 [ext4]
> kernel: [<ffffffff820b86d9>] do_writepages+0x20/0x29
> kernel: [<ffffffff820b13da>] __filemap_fdatawrite_range+0x50/0x52
> kernel: [<ffffffff820b19a5>] filemap_fdatawrite+0x1f/0x21
> kernel: [<ffffffff820b19c4>] filemap_write_and_wait+0x1d/0x38
> kernel: [<ffffffff820fc4a9>] do_vfs_ioctl+0x2db/0x47f
> kernel: [<ffffffff820fc6ab>] sys_ioctl+0x5e/0x82
> kernel: [<ffffffff82386942>] system_call_fastpath+0x16/0x1b
> kernel: ---[ end trace d96610456f905628 ]---
> 
> It is easy to trigger this warning when running xfstests #127 or #225.
> 
> Moreover, it seems that there still has an improvement in
> ext4_calculate_overhead().  I paste the patch here.
> 
> Regards,
>                                                 - Zheng

Hi Zheng,

thanks for the review. I know about the other issues and I'm trying
to resolve those as well. Right now I have a patch which includes
the changes ext4_calculate_overhead() you've described below and more,
but even with this I still see some problems remaining.

Hopefully will send another patch soon.

Thanks!
-Lukas

> 
> Subject: [PATCH] ext4: fix overhead calculation in bigalloc filesystem
> 
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> ext4_calculate_overhead() should compute the overhead and stash it in
> sbi->s_overhead.  But we miss use EXT4_B2C() to calculate the number of
> clusters before first_data_block and the number of journal blocks.  This
> commit use EXT4_NUM_B2C() instead of EXT4_B2C() to calculate the
> overhead.
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/ext4/super.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 3d4fb81..6165558 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3219,7 +3219,7 @@ int ext4_calculate_overhead(struct super_block *sb)
>  	/*
>  	 * All of the blocks before first_data_block are overhead
>  	 */
> -	overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
> +	overhead = EXT4_NUM_B2C(sbi, le32_to_cpu(es->s_first_data_block));
>  
>  	/*
>  	 * Add the overhead found in each block group
> @@ -3235,7 +3235,7 @@ int ext4_calculate_overhead(struct super_block *sb)
>  	}
>  	/* Add the journal blocks as well */
>  	if (sbi->s_journal)
> -		overhead += EXT4_B2C(sbi, sbi->s_journal->j_maxlen);
> +		overhead += EXT4_NUM_B2C(sbi, sbi->s_journal->j_maxlen);
>  
>  	sbi->s_overhead = overhead;
>  	smp_wmb();
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-21 12:40   ` Lukáš Czerner
@ 2013-02-21 12:50     ` Lukáš Czerner
  2013-02-21 12:52       ` Lukáš Czerner
  2013-02-21 13:12     ` [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... ) Zheng Liu
  1 sibling, 1 reply; 19+ messages in thread
From: Lukáš Czerner @ 2013-02-21 12:50 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: Zheng Liu, linux-ext4, Theodore Ts'o

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2244 bytes --]

On Thu, 21 Feb 2013, Lukáš Czerner wrote:

..snip...

> 
> Hi Zheng,
> 
> thanks for the review. I know about the other issues and I'm trying
> to resolve those as well. Right now I have a patch which includes
> the changes ext4_calculate_overhead() you've described below and more,
> but even with this I still see some problems remaining.
> 
> Hopefully will send another patch soon.
> 
> Thanks!
> -Lukas
> 
> > 
> > Subject: [PATCH] ext4: fix overhead calculation in bigalloc filesystem
> > 
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > ext4_calculate_overhead() should compute the overhead and stash it in
> > sbi->s_overhead.  But we miss use EXT4_B2C() to calculate the number of
> > clusters before first_data_block and the number of journal blocks.  This
> > commit use EXT4_NUM_B2C() instead of EXT4_B2C() to calculate the
> > overhead.
> > 
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > ---
> >  fs/ext4/super.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 3d4fb81..6165558 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -3219,7 +3219,7 @@ int ext4_calculate_overhead(struct super_block *sb)
> >  	/*
> >  	 * All of the blocks before first_data_block are overhead
> >  	 */
> > -	overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
> > +	overhead = EXT4_NUM_B2C(sbi, le32_to_cpu(es->s_first_data_block));

...except this. I do not think this is right because we do not skip
the first cluster right ? We're still using it, but we can never use
the block before es->s_first_data_block. Please correct me if I am
wrong.


> >  
> >  	/*
> >  	 * Add the overhead found in each block group
> > @@ -3235,7 +3235,7 @@ int ext4_calculate_overhead(struct super_block *sb)
> >  	}
> >  	/* Add the journal blocks as well */
> >  	if (sbi->s_journal)
> > -		overhead += EXT4_B2C(sbi, sbi->s_journal->j_maxlen);
> > +		overhead += EXT4_NUM_B2C(sbi, sbi->s_journal->j_maxlen);

This I already have in my patch I'm testing right now. And as I said
there are other places where we misuse EXT4_B2C().

-Lukas

> >  
> >  	sbi->s_overhead = overhead;
> >  	smp_wmb();
> > 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-21 12:50     ` Lukáš Czerner
@ 2013-02-21 12:52       ` Lukáš Czerner
  2013-02-21 13:49         ` Zheng Liu
  0 siblings, 1 reply; 19+ messages in thread
From: Lukáš Czerner @ 2013-02-21 12:52 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: Zheng Liu, linux-ext4, Theodore Ts'o

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2836 bytes --]

On Thu, 21 Feb 2013, Lukáš Czerner wrote:

> Date: Thu, 21 Feb 2013 13:50:03 +0100 (CET)
> From: Lukáš Czerner <lczerner@redhat.com>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: Zheng Liu <gnehzuil.liu@gmail.com>, linux-ext4@vger.kernel.org,
>     Theodore Ts'o <tytso@mit.edu>
> Subject: Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem
>     (Re: ... )
> 
> On Thu, 21 Feb 2013, Lukáš Czerner wrote:
> 
> ..snip...
> 
> > 
> > Hi Zheng,
> > 
> > thanks for the review. I know about the other issues and I'm trying
> > to resolve those as well. Right now I have a patch which includes
> > the changes ext4_calculate_overhead() you've described below and more,
> > but even with this I still see some problems remaining.
> > 
> > Hopefully will send another patch soon.
> > 
> > Thanks!
> > -Lukas
> > 
> > > 
> > > Subject: [PATCH] ext4: fix overhead calculation in bigalloc filesystem
> > > 
> > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > 
> > > ext4_calculate_overhead() should compute the overhead and stash it in
> > > sbi->s_overhead.  But we miss use EXT4_B2C() to calculate the number of
> > > clusters before first_data_block and the number of journal blocks.  This
> > > commit use EXT4_NUM_B2C() instead of EXT4_B2C() to calculate the
> > > overhead.
> > > 
> > > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > > ---
> > >  fs/ext4/super.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > index 3d4fb81..6165558 100644
> > > --- a/fs/ext4/super.c
> > > +++ b/fs/ext4/super.c
> > > @@ -3219,7 +3219,7 @@ int ext4_calculate_overhead(struct super_block *sb)
> > >  	/*
> > >  	 * All of the blocks before first_data_block are overhead
> > >  	 */
> > > -	overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
> > > +	overhead = EXT4_NUM_B2C(sbi, le32_to_cpu(es->s_first_data_block));
> 
> ...except this. I do not think this is right because we do not skip
> the first cluster right ? We're still using it, but we can never use
> the block before es->s_first_data_block. Please correct me if I am
> wrong.

moreover we do not allow bigalloc file system with block size < 4k.

> 
> 
> > >  
> > >  	/*
> > >  	 * Add the overhead found in each block group
> > > @@ -3235,7 +3235,7 @@ int ext4_calculate_overhead(struct super_block *sb)
> > >  	}
> > >  	/* Add the journal blocks as well */
> > >  	if (sbi->s_journal)
> > > -		overhead += EXT4_B2C(sbi, sbi->s_journal->j_maxlen);
> > > +		overhead += EXT4_NUM_B2C(sbi, sbi->s_journal->j_maxlen);
> 
> This I already have in my patch I'm testing right now. And as I said
> there are other places where we misuse EXT4_B2C().
> 
> -Lukas
> 
> > >  
> > >  	sbi->s_overhead = overhead;
> > >  	smp_wmb();
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-21 12:52       ` Lukáš Czerner
@ 2013-02-21 13:49         ` Zheng Liu
  2013-02-21 14:56           ` Lukáš Czerner
  0 siblings, 1 reply; 19+ messages in thread
From: Zheng Liu @ 2013-02-21 13:49 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: linux-ext4, Theodore Ts'o

On Thu, Feb 21, 2013 at 01:52:58PM +0100, Lukáš Czerner wrote:
> On Thu, 21 Feb 2013, Lukáš Czerner wrote:
> 
> > Date: Thu, 21 Feb 2013 13:50:03 +0100 (CET)
> > From: Lukáš Czerner <lczerner@redhat.com>
> > To: Lukáš Czerner <lczerner@redhat.com>
> > Cc: Zheng Liu <gnehzuil.liu@gmail.com>, linux-ext4@vger.kernel.org,
> >     Theodore Ts'o <tytso@mit.edu>
> > Subject: Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem
> >     (Re: ... )
> > 
> > On Thu, 21 Feb 2013, Lukáš Czerner wrote:
> > 
> > ..snip...
> > 
> > > 
> > > Hi Zheng,
> > > 
> > > thanks for the review. I know about the other issues and I'm trying
> > > to resolve those as well. Right now I have a patch which includes
> > > the changes ext4_calculate_overhead() you've described below and more,
> > > but even with this I still see some problems remaining.
> > > 
> > > Hopefully will send another patch soon.
> > > 
> > > Thanks!
> > > -Lukas
> > > 
> > > > 
> > > > Subject: [PATCH] ext4: fix overhead calculation in bigalloc filesystem
> > > > 
> > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > 
> > > > ext4_calculate_overhead() should compute the overhead and stash it in
> > > > sbi->s_overhead.  But we miss use EXT4_B2C() to calculate the number of
> > > > clusters before first_data_block and the number of journal blocks.  This
> > > > commit use EXT4_NUM_B2C() instead of EXT4_B2C() to calculate the
> > > > overhead.
> > > > 
> > > > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > > > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > > > ---
> > > >  fs/ext4/super.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > > index 3d4fb81..6165558 100644
> > > > --- a/fs/ext4/super.c
> > > > +++ b/fs/ext4/super.c
> > > > @@ -3219,7 +3219,7 @@ int ext4_calculate_overhead(struct super_block *sb)
> > > >  	/*
> > > >  	 * All of the blocks before first_data_block are overhead
> > > >  	 */
> > > > -	overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
> > > > +	overhead = EXT4_NUM_B2C(sbi, le32_to_cpu(es->s_first_data_block));
> > 
> > ...except this. I do not think this is right because we do not skip
> > the first cluster right ? We're still using it, but we can never use
> > the block before es->s_first_data_block. Please correct me if I am
> > wrong.

Yes, I think you are right.

> 
> moreover we do not allow bigalloc file system with block size < 4k.

No, we allow user to use bigalloc with block size < 4k, such as:

  mkfs.ext4 -b 1024 -C 4096 -O bigalloc ${dev}

This command formats a bigalloc filesystem with blocksize = 1k and
clustersize = 4k, at least in e2fsprogs 1.42.7 it works well.

> 
> > 
> > 
> > > >  
> > > >  	/*
> > > >  	 * Add the overhead found in each block group
> > > > @@ -3235,7 +3235,7 @@ int ext4_calculate_overhead(struct super_block *sb)
> > > >  	}
> > > >  	/* Add the journal blocks as well */
> > > >  	if (sbi->s_journal)
> > > > -		overhead += EXT4_B2C(sbi, sbi->s_journal->j_maxlen);
> > > > +		overhead += EXT4_NUM_B2C(sbi, sbi->s_journal->j_maxlen);
> > 
> > This I already have in my patch I'm testing right now. And as I said
> > there are other places where we misuse EXT4_B2C().
> > 
> > -Lukas
> > 
> > > >  
> > > >  	sbi->s_overhead = overhead;
> > > >  	smp_wmb();
> > > > 
> > > 
> > 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-21 13:49         ` Zheng Liu
@ 2013-02-21 14:56           ` Lukáš Czerner
  2013-02-22  3:03             ` Zheng Liu
  0 siblings, 1 reply; 19+ messages in thread
From: Lukáš Czerner @ 2013-02-21 14:56 UTC (permalink / raw)
  To: Zheng Liu; +Cc: Lukáš Czerner, linux-ext4, Theodore Ts'o

On Thu, 21 Feb 2013, Zheng Liu wrote:

..snip..

> > > > >  	/*
> > > > >  	 * All of the blocks before first_data_block are overhead
> > > > >  	 */
> > > > > -	overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
> > > > > +	overhead = EXT4_NUM_B2C(sbi, le32_to_cpu(es->s_first_data_block));
> > > 
> > > ...except this. I do not think this is right because we do not skip
> > > the first cluster right ? We're still using it, but we can never use
> > > the block before es->s_first_data_block. Please correct me if I am
> > > wrong.
> 
> Yes, I think you are right.
> 
> > 
> > moreover we do not allow bigalloc file system with block size < 4k.
> 
> No, we allow user to use bigalloc with block size < 4k, such as:
> 
>   mkfs.ext4 -b 1024 -C 4096 -O bigalloc ${dev}
> 
> This command formats a bigalloc filesystem with blocksize = 1k and
> clustersize = 4k, at least in e2fsprogs 1.42.7 it works well.
> 

Ok, i was pretty sure that we do not allow that, it's good to know.
Also, does it make any sense ? I do not think so, and I would really
consider the fact that we allow that as a bug. We should not allow
that otherwise it unnecessarily extending the test matrix.

What people think about restricting bigalloc _only_ for 4k block
size file systems ?

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-21 14:56           ` Lukáš Czerner
@ 2013-02-22  3:03             ` Zheng Liu
  2013-02-22  4:05               ` Theodore Ts'o
  0 siblings, 1 reply; 19+ messages in thread
From: Zheng Liu @ 2013-02-22  3:03 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: linux-ext4, Theodore Ts'o

On Thu, Feb 21, 2013 at 03:56:51PM +0100, Lukáš Czerner wrote:
> On Thu, 21 Feb 2013, Zheng Liu wrote:
> 
> ..snip..
> 
> > > > > >  	/*
> > > > > >  	 * All of the blocks before first_data_block are overhead
> > > > > >  	 */
> > > > > > -	overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
> > > > > > +	overhead = EXT4_NUM_B2C(sbi, le32_to_cpu(es->s_first_data_block));
> > > > 
> > > > ...except this. I do not think this is right because we do not skip
> > > > the first cluster right ? We're still using it, but we can never use
> > > > the block before es->s_first_data_block. Please correct me if I am
> > > > wrong.
> > 
> > Yes, I think you are right.
> > 
> > > 
> > > moreover we do not allow bigalloc file system with block size < 4k.
> > 
> > No, we allow user to use bigalloc with block size < 4k, such as:
> > 
> >   mkfs.ext4 -b 1024 -C 4096 -O bigalloc ${dev}
> > 
> > This command formats a bigalloc filesystem with blocksize = 1k and
> > clustersize = 4k, at least in e2fsprogs 1.42.7 it works well.
> > 
> 
> Ok, i was pretty sure that we do not allow that, it's good to know.
> Also, does it make any sense ? I do not think so, and I would really
> consider the fact that we allow that as a bug. We should not allow
> that otherwise it unnecessarily extending the test matrix.
> 
> What people think about restricting bigalloc _only_ for 4k block
> size file systems ?

I agree with you that we should forbid user to use bigalloc feature with
block size = 1k or 2k because I guess no one really use it, at least in
Taobao we always use bigalloc feature with block size = 4k.

FWIW, recently I am considering whether we could remove 'data=journal'
and 'data=writeback' mode.  'data=journal' mode hurts performance
dramatically.  Further, 'data=writeback' seems also useless, especially
after we have 'no journal' feature.  TBH, these modes are lack of tests.
Maybe this is a crazy idea in my mind.

Regards,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-22  3:03             ` Zheng Liu
@ 2013-02-22  4:05               ` Theodore Ts'o
  2013-02-22  8:04                 ` Lukáš Czerner
  2013-02-22 13:18                 ` Zheng Liu
  0 siblings, 2 replies; 19+ messages in thread
From: Theodore Ts'o @ 2013-02-22  4:05 UTC (permalink / raw)
  To: Lukáš Czerner, linux-ext4

On Fri, Feb 22, 2013 at 11:03:27AM +0800, Zheng Liu wrote:
> 
> I agree with you that we should forbid user to use bigalloc feature with
> block size = 1k or 2k because I guess no one really use it, at least in
> Taobao we always use bigalloc feature with block size = 4k.

The main reason why file systems with 1k or 2k (with or without
bigalloc) is that it's useful is that it's a good way of testing what
happens on an architecture with a 16k page size and a 4k blocksize.  I
am regularly testing with a 1k blocksize because it catches problems
that would otherwise only show up on PowerPC or Intanium platforms.
I'm not testing with bigalloc plus 1k block size, though, I admit.

> FWIW, recently I am considering whether we could remove 'data=journal'
> and 'data=writeback' mode.  'data=journal' mode hurts performance
> dramatically.  Further, 'data=writeback' seems also useless, especially
> after we have 'no journal' feature.  TBH, these modes are lack of tests.
> Maybe this is a crazy idea in my mind.

As far as data=writeback and data=ordered are concerned, the only
difference is a single call to ext4_jbd2_file_end() in
ext4_ordered_write_end().  In fact, it looks like we could do
something like the following attached patch to simplify the code and
improve code coverage from a testing perspective.  (Note: patch not
yet tested!)

As far as "data=journalled", it's a bit more complicated but I do have
a sentimental attachment to it, since it's something which no other
file system has.  I have been regularly testing it, and it's something
which has been pretty stable for quite a while now.

If there was a mode that I'd be tempted to get rid of, it would be the
combined data=ordered/data=writeback modes.  The main reason why we
keep it is because of a concern of buggy applications that depend on
the implied fsync() of data=ordered mode at each commit.  However,
ext4 has been around for long enough that I think most of the buggy
applications have been fixed by now.  And of course, those buggy
applications will lose in the same way when they are using btrfs and
xfs.  Something to consider....

						- Ted

commit d075b5c3031752a4c41642ff505c3302e5321940
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Feb 21 23:04:59 2013 -0500

    ext4: collapse handling of data=ordered and data=writeback codepaths
    
    The only difference between how we handle data=ordered and
    data=writeback is a single call to ext4_jbd2_file_inode().  Eliminate
    code duplication by factoring out redundant the code paths.
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6e16c18..85dfd49 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1373,6 +1373,7 @@ enum {
 	EXT4_STATE_DIOREAD_LOCK,	/* Disable support for dio read
 					   nolocking */
 	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
+	EXT4_STATE_ORDERED_MODE,	/* data=ordered mode */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field, offset)				\
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 95a0c62..2e338a6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1055,77 +1055,36 @@ static int ext4_generic_write_end(struct file *file,
  * ext4 never places buffers on inode->i_mapping->private_list.  metadata
  * buffers are managed internally.
  */
-static int ext4_ordered_write_end(struct file *file,
-				  struct address_space *mapping,
-				  loff_t pos, unsigned len, unsigned copied,
-				  struct page *page, void *fsdata)
+static int ext4_write_end(struct file *file,
+			  struct address_space *mapping,
+			  loff_t pos, unsigned len, unsigned copied,
+			  struct page *page, void *fsdata)
 {
 	handle_t *handle = ext4_journal_current_handle();
 	struct inode *inode = mapping->host;
 	int ret = 0, ret2;
 
 	trace_ext4_ordered_write_end(inode, pos, len, copied);
-	ret = ext4_jbd2_file_inode(handle, inode);
-
-	if (ret == 0) {
-		ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
-							page, fsdata);
-		copied = ret2;
-		if (pos + len > inode->i_size && ext4_can_truncate(inode))
-			/* if we have allocated more blocks and copied
-			 * less. We will have blocks allocated outside
-			 * inode->i_size. So truncate them
-			 */
-			ext4_orphan_add(handle, inode);
-		if (ret2 < 0)
-			ret = ret2;
-	} else {
-		unlock_page(page);
-		page_cache_release(page);
-	}
-
-	ret2 = ext4_journal_stop(handle);
-	if (!ret)
-		ret = ret2;
-
-	if (pos + len > inode->i_size) {
-		ext4_truncate_failed_write(inode);
-		/*
-		 * If truncate failed early the inode might still be
-		 * on the orphan list; we need to make sure the inode
-		 * is removed from the orphan list in that case.
-		 */
-		if (inode->i_nlink)
-			ext4_orphan_del(NULL, inode);
+	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
+		ret = ext4_jbd2_file_inode(handle, inode);
+		if (ret) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto errout;
+		}
 	}
 
-
-	return ret ? ret : copied;
-}
-
-static int ext4_writeback_write_end(struct file *file,
-				    struct address_space *mapping,
-				    loff_t pos, unsigned len, unsigned copied,
-				    struct page *page, void *fsdata)
-{
-	handle_t *handle = ext4_journal_current_handle();
-	struct inode *inode = mapping->host;
-	int ret = 0, ret2;
-
-	trace_ext4_writeback_write_end(inode, pos, len, copied);
-	ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
-							page, fsdata);
-	copied = ret2;
+	copied = ext4_generic_write_end(file, mapping, pos, len, copied,
+					page, fsdata);
+	if (copied < 0)
+		ret = copied;
 	if (pos + len > inode->i_size && ext4_can_truncate(inode))
 		/* if we have allocated more blocks and copied
 		 * less. We will have blocks allocated outside
 		 * inode->i_size. So truncate them
 		 */
 		ext4_orphan_add(handle, inode);
-
-	if (ret2 < 0)
-		ret = ret2;
-
+errout:
 	ret2 = ext4_journal_stop(handle);
 	if (!ret)
 		ret = ret2;
@@ -2656,18 +2615,9 @@ static int ext4_da_write_end(struct file *file,
 	unsigned long start, end;
 	int write_mode = (int)(unsigned long)fsdata;
 
-	if (write_mode == FALL_BACK_TO_NONDELALLOC) {
-		switch (ext4_inode_journal_mode(inode)) {
-		case EXT4_INODE_ORDERED_DATA_MODE:
-			return ext4_ordered_write_end(file, mapping, pos,
-					len, copied, page, fsdata);
-		case EXT4_INODE_WRITEBACK_DATA_MODE:
-			return ext4_writeback_write_end(file, mapping, pos,
-					len, copied, page, fsdata);
-		default:
-			BUG();
-		}
-	}
+	if (write_mode == FALL_BACK_TO_NONDELALLOC)
+		return ext4_write_end(file, mapping, pos,
+				      len, copied, page, fsdata);
 
 	trace_ext4_da_write_end(inode, pos, len, copied);
 	start = pos & (PAGE_CACHE_SIZE - 1);
@@ -3172,27 +3122,12 @@ static int ext4_journalled_set_page_dirty(struct page *page)
 	return __set_page_dirty_nobuffers(page);
 }
 
-static const struct address_space_operations ext4_ordered_aops = {
+static const struct address_space_operations ext4_aops = {
 	.readpage		= ext4_readpage,
 	.readpages		= ext4_readpages,
 	.writepage		= ext4_writepage,
 	.write_begin		= ext4_write_begin,
-	.write_end		= ext4_ordered_write_end,
-	.bmap			= ext4_bmap,
-	.invalidatepage		= ext4_invalidatepage,
-	.releasepage		= ext4_releasepage,
-	.direct_IO		= ext4_direct_IO,
-	.migratepage		= buffer_migrate_page,
-	.is_partially_uptodate  = block_is_partially_uptodate,
-	.error_remove_page	= generic_error_remove_page,
-};
-
-static const struct address_space_operations ext4_writeback_aops = {
-	.readpage		= ext4_readpage,
-	.readpages		= ext4_readpages,
-	.writepage		= ext4_writepage,
-	.write_begin		= ext4_write_begin,
-	.write_end		= ext4_writeback_write_end,
+	.write_end		= ext4_write_end,
 	.bmap			= ext4_bmap,
 	.invalidatepage		= ext4_invalidatepage,
 	.releasepage		= ext4_releasepage,
@@ -3237,23 +3172,21 @@ void ext4_set_aops(struct inode *inode)
 {
 	switch (ext4_inode_journal_mode(inode)) {
 	case EXT4_INODE_ORDERED_DATA_MODE:
-		if (test_opt(inode->i_sb, DELALLOC))
-			inode->i_mapping->a_ops = &ext4_da_aops;
-		else
-			inode->i_mapping->a_ops = &ext4_ordered_aops;
+		ext4_set_inode_state(inode, EXT4_STATE_ORDERED_MODE);
 		break;
 	case EXT4_INODE_WRITEBACK_DATA_MODE:
-		if (test_opt(inode->i_sb, DELALLOC))
-			inode->i_mapping->a_ops = &ext4_da_aops;
-		else
-			inode->i_mapping->a_ops = &ext4_writeback_aops;
+		ext4_clear_inode_state(inode, EXT4_STATE_ORDERED_MODE);
 		break;
 	case EXT4_INODE_JOURNAL_DATA_MODE:
 		inode->i_mapping->a_ops = &ext4_journalled_aops;
-		break;
+		return;
 	default:
 		BUG();
 	}
+	if (test_opt(inode->i_sb, DELALLOC))
+		inode->i_mapping->a_ops = &ext4_da_aops;
+	else
+		inode->i_mapping->a_ops = &ext4_aops;
 }
 
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-22  4:05               ` Theodore Ts'o
@ 2013-02-22  8:04                 ` Lukáš Czerner
  2013-02-22 13:18                 ` Zheng Liu
  1 sibling, 0 replies; 19+ messages in thread
From: Lukáš Czerner @ 2013-02-22  8:04 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Lukáš Czerner, linux-ext4

[-- Attachment #1: Type: TEXT/PLAIN, Size: 9600 bytes --]

On Thu, 21 Feb 2013, Theodore Ts'o wrote:

> Date: Thu, 21 Feb 2013 23:05:44 -0500
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukáš Czerner <lczerner@redhat.com>, linux-ext4@vger.kernel.org
> Subject: Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem
>     (Re: ... )
> 
> On Fri, Feb 22, 2013 at 11:03:27AM +0800, Zheng Liu wrote:
> > 
> > I agree with you that we should forbid user to use bigalloc feature with
> > block size = 1k or 2k because I guess no one really use it, at least in
> > Taobao we always use bigalloc feature with block size = 4k.
> 
> The main reason why file systems with 1k or 2k (with or without
> bigalloc) is that it's useful is that it's a good way of testing what
> happens on an architecture with a 16k page size and a 4k blocksize.  I
> am regularly testing with a 1k blocksize because it catches problems
> that would otherwise only show up on PowerPC or Intanium platforms.
> I'm not testing with bigalloc plus 1k block size, though, I admit.

I agree having block size smaller than 4k is useful not only for
testing purposes. However in my opinion with bigalloc it is entirely
unnecessary to allow those sizes.

-Lukas

> 
> > FWIW, recently I am considering whether we could remove 'data=journal'
> > and 'data=writeback' mode.  'data=journal' mode hurts performance
> > dramatically.  Further, 'data=writeback' seems also useless, especially
> > after we have 'no journal' feature.  TBH, these modes are lack of tests.
> > Maybe this is a crazy idea in my mind.
> 
> As far as data=writeback and data=ordered are concerned, the only
> difference is a single call to ext4_jbd2_file_end() in
> ext4_ordered_write_end().  In fact, it looks like we could do
> something like the following attached patch to simplify the code and
> improve code coverage from a testing perspective.  (Note: patch not
> yet tested!)
> 
> As far as "data=journalled", it's a bit more complicated but I do have
> a sentimental attachment to it, since it's something which no other
> file system has.  I have been regularly testing it, and it's something
> which has been pretty stable for quite a while now.
> 
> If there was a mode that I'd be tempted to get rid of, it would be the
> combined data=ordered/data=writeback modes.  The main reason why we
> keep it is because of a concern of buggy applications that depend on
> the implied fsync() of data=ordered mode at each commit.  However,
> ext4 has been around for long enough that I think most of the buggy
> applications have been fixed by now.  And of course, those buggy
> applications will lose in the same way when they are using btrfs and
> xfs.  Something to consider....
> 
> 						- Ted
> 
> commit d075b5c3031752a4c41642ff505c3302e5321940
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Thu Feb 21 23:04:59 2013 -0500
> 
>     ext4: collapse handling of data=ordered and data=writeback codepaths
>     
>     The only difference between how we handle data=ordered and
>     data=writeback is a single call to ext4_jbd2_file_inode().  Eliminate
>     code duplication by factoring out redundant the code paths.
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 6e16c18..85dfd49 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1373,6 +1373,7 @@ enum {
>  	EXT4_STATE_DIOREAD_LOCK,	/* Disable support for dio read
>  					   nolocking */
>  	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
> +	EXT4_STATE_ORDERED_MODE,	/* data=ordered mode */
>  };
>  
>  #define EXT4_INODE_BIT_FNS(name, field, offset)				\
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 95a0c62..2e338a6 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1055,77 +1055,36 @@ static int ext4_generic_write_end(struct file *file,
>   * ext4 never places buffers on inode->i_mapping->private_list.  metadata
>   * buffers are managed internally.
>   */
> -static int ext4_ordered_write_end(struct file *file,
> -				  struct address_space *mapping,
> -				  loff_t pos, unsigned len, unsigned copied,
> -				  struct page *page, void *fsdata)
> +static int ext4_write_end(struct file *file,
> +			  struct address_space *mapping,
> +			  loff_t pos, unsigned len, unsigned copied,
> +			  struct page *page, void *fsdata)
>  {
>  	handle_t *handle = ext4_journal_current_handle();
>  	struct inode *inode = mapping->host;
>  	int ret = 0, ret2;
>  
>  	trace_ext4_ordered_write_end(inode, pos, len, copied);
> -	ret = ext4_jbd2_file_inode(handle, inode);
> -
> -	if (ret == 0) {
> -		ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
> -							page, fsdata);
> -		copied = ret2;
> -		if (pos + len > inode->i_size && ext4_can_truncate(inode))
> -			/* if we have allocated more blocks and copied
> -			 * less. We will have blocks allocated outside
> -			 * inode->i_size. So truncate them
> -			 */
> -			ext4_orphan_add(handle, inode);
> -		if (ret2 < 0)
> -			ret = ret2;
> -	} else {
> -		unlock_page(page);
> -		page_cache_release(page);
> -	}
> -
> -	ret2 = ext4_journal_stop(handle);
> -	if (!ret)
> -		ret = ret2;
> -
> -	if (pos + len > inode->i_size) {
> -		ext4_truncate_failed_write(inode);
> -		/*
> -		 * If truncate failed early the inode might still be
> -		 * on the orphan list; we need to make sure the inode
> -		 * is removed from the orphan list in that case.
> -		 */
> -		if (inode->i_nlink)
> -			ext4_orphan_del(NULL, inode);
> +	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
> +		ret = ext4_jbd2_file_inode(handle, inode);
> +		if (ret) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto errout;
> +		}
>  	}
>  
> -
> -	return ret ? ret : copied;
> -}
> -
> -static int ext4_writeback_write_end(struct file *file,
> -				    struct address_space *mapping,
> -				    loff_t pos, unsigned len, unsigned copied,
> -				    struct page *page, void *fsdata)
> -{
> -	handle_t *handle = ext4_journal_current_handle();
> -	struct inode *inode = mapping->host;
> -	int ret = 0, ret2;
> -
> -	trace_ext4_writeback_write_end(inode, pos, len, copied);
> -	ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
> -							page, fsdata);
> -	copied = ret2;
> +	copied = ext4_generic_write_end(file, mapping, pos, len, copied,
> +					page, fsdata);
> +	if (copied < 0)
> +		ret = copied;
>  	if (pos + len > inode->i_size && ext4_can_truncate(inode))
>  		/* if we have allocated more blocks and copied
>  		 * less. We will have blocks allocated outside
>  		 * inode->i_size. So truncate them
>  		 */
>  		ext4_orphan_add(handle, inode);
> -
> -	if (ret2 < 0)
> -		ret = ret2;
> -
> +errout:
>  	ret2 = ext4_journal_stop(handle);
>  	if (!ret)
>  		ret = ret2;
> @@ -2656,18 +2615,9 @@ static int ext4_da_write_end(struct file *file,
>  	unsigned long start, end;
>  	int write_mode = (int)(unsigned long)fsdata;
>  
> -	if (write_mode == FALL_BACK_TO_NONDELALLOC) {
> -		switch (ext4_inode_journal_mode(inode)) {
> -		case EXT4_INODE_ORDERED_DATA_MODE:
> -			return ext4_ordered_write_end(file, mapping, pos,
> -					len, copied, page, fsdata);
> -		case EXT4_INODE_WRITEBACK_DATA_MODE:
> -			return ext4_writeback_write_end(file, mapping, pos,
> -					len, copied, page, fsdata);
> -		default:
> -			BUG();
> -		}
> -	}
> +	if (write_mode == FALL_BACK_TO_NONDELALLOC)
> +		return ext4_write_end(file, mapping, pos,
> +				      len, copied, page, fsdata);
>  
>  	trace_ext4_da_write_end(inode, pos, len, copied);
>  	start = pos & (PAGE_CACHE_SIZE - 1);
> @@ -3172,27 +3122,12 @@ static int ext4_journalled_set_page_dirty(struct page *page)
>  	return __set_page_dirty_nobuffers(page);
>  }
>  
> -static const struct address_space_operations ext4_ordered_aops = {
> +static const struct address_space_operations ext4_aops = {
>  	.readpage		= ext4_readpage,
>  	.readpages		= ext4_readpages,
>  	.writepage		= ext4_writepage,
>  	.write_begin		= ext4_write_begin,
> -	.write_end		= ext4_ordered_write_end,
> -	.bmap			= ext4_bmap,
> -	.invalidatepage		= ext4_invalidatepage,
> -	.releasepage		= ext4_releasepage,
> -	.direct_IO		= ext4_direct_IO,
> -	.migratepage		= buffer_migrate_page,
> -	.is_partially_uptodate  = block_is_partially_uptodate,
> -	.error_remove_page	= generic_error_remove_page,
> -};
> -
> -static const struct address_space_operations ext4_writeback_aops = {
> -	.readpage		= ext4_readpage,
> -	.readpages		= ext4_readpages,
> -	.writepage		= ext4_writepage,
> -	.write_begin		= ext4_write_begin,
> -	.write_end		= ext4_writeback_write_end,
> +	.write_end		= ext4_write_end,
>  	.bmap			= ext4_bmap,
>  	.invalidatepage		= ext4_invalidatepage,
>  	.releasepage		= ext4_releasepage,
> @@ -3237,23 +3172,21 @@ void ext4_set_aops(struct inode *inode)
>  {
>  	switch (ext4_inode_journal_mode(inode)) {
>  	case EXT4_INODE_ORDERED_DATA_MODE:
> -		if (test_opt(inode->i_sb, DELALLOC))
> -			inode->i_mapping->a_ops = &ext4_da_aops;
> -		else
> -			inode->i_mapping->a_ops = &ext4_ordered_aops;
> +		ext4_set_inode_state(inode, EXT4_STATE_ORDERED_MODE);
>  		break;
>  	case EXT4_INODE_WRITEBACK_DATA_MODE:
> -		if (test_opt(inode->i_sb, DELALLOC))
> -			inode->i_mapping->a_ops = &ext4_da_aops;
> -		else
> -			inode->i_mapping->a_ops = &ext4_writeback_aops;
> +		ext4_clear_inode_state(inode, EXT4_STATE_ORDERED_MODE);
>  		break;
>  	case EXT4_INODE_JOURNAL_DATA_MODE:
>  		inode->i_mapping->a_ops = &ext4_journalled_aops;
> -		break;
> +		return;
>  	default:
>  		BUG();
>  	}
> +	if (test_opt(inode->i_sb, DELALLOC))
> +		inode->i_mapping->a_ops = &ext4_da_aops;
> +	else
> +		inode->i_mapping->a_ops = &ext4_aops;
>  }
>  
>  
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-22  4:05               ` Theodore Ts'o
  2013-02-22  8:04                 ` Lukáš Czerner
@ 2013-02-22 13:18                 ` Zheng Liu
  2013-02-22 15:20                   ` Theodore Ts'o
  1 sibling, 1 reply; 19+ messages in thread
From: Zheng Liu @ 2013-02-22 13:18 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Lukáš Czerner, linux-ext4

On Thu, Feb 21, 2013 at 11:05:44PM -0500, Theodore Ts'o wrote:
> On Fri, Feb 22, 2013 at 11:03:27AM +0800, Zheng Liu wrote:
> > 
> > I agree with you that we should forbid user to use bigalloc feature with
> > block size = 1k or 2k because I guess no one really use it, at least in
> > Taobao we always use bigalloc feature with block size = 4k.
> 
> The main reason why file systems with 1k or 2k (with or without
> bigalloc) is that it's useful is that it's a good way of testing what
> happens on an architecture with a 16k page size and a 4k blocksize.  I
> am regularly testing with a 1k blocksize because it catches problems
> that would otherwise only show up on PowerPC or Intanium platforms.
> I'm not testing with bigalloc plus 1k block size, though, I admit.
> 
> > FWIW, recently I am considering whether we could remove 'data=journal'
> > and 'data=writeback' mode.  'data=journal' mode hurts performance
> > dramatically.  Further, 'data=writeback' seems also useless, especially
> > after we have 'no journal' feature.  TBH, these modes are lack of tests.
> > Maybe this is a crazy idea in my mind.
> 
> As far as data=writeback and data=ordered are concerned, the only
> difference is a single call to ext4_jbd2_file_end() in
> ext4_ordered_write_end().  In fact, it looks like we could do
> something like the following attached patch to simplify the code and
> improve code coverage from a testing perspective.  (Note: patch not
> yet tested!)
> 
> As far as "data=journalled", it's a bit more complicated but I do have
> a sentimental attachment to it, since it's something which no other
> file system has.  I have been regularly testing it, and it's something
> which has been pretty stable for quite a while now.

Indeed, it seems that other file systems don't have this feature AFAIK.

> 
> If there was a mode that I'd be tempted to get rid of, it would be the
> combined data=ordered/data=writeback modes.  The main reason why we
> keep it is because of a concern of buggy applications that depend on
> the implied fsync() of data=ordered mode at each commit.  However,
> ext4 has been around for long enough that I think most of the buggy
> applications have been fixed by now.  And of course, those buggy
> applications will lose in the same way when they are using btrfs and
> xfs.  Something to consider....

IMHO, the application shouldn't depend on a kernel feature.  So maybe it
is time to highlight this buggy usage.

> 
> 						- Ted
> 
> commit d075b5c3031752a4c41642ff505c3302e5321940
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Thu Feb 21 23:04:59 2013 -0500
> 
>     ext4: collapse handling of data=ordered and data=writeback codepaths
>     
>     The only difference between how we handle data=ordered and
>     data=writeback is a single call to ext4_jbd2_file_inode().  Eliminate
>     code duplication by factoring out redundant the code paths.
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

Just one minor comment below.  Otherwise the patch looks good to me, and
it can pass in xfstests with 'data=ordered' and 'data=writeback'.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>

> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 6e16c18..85dfd49 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1373,6 +1373,7 @@ enum {
>  	EXT4_STATE_DIOREAD_LOCK,	/* Disable support for dio read
>  					   nolocking */
>  	EXT4_STATE_MAY_INLINE_DATA,	/* may have in-inode data */
> +	EXT4_STATE_ORDERED_MODE,	/* data=ordered mode */
>  };
>  
>  #define EXT4_INODE_BIT_FNS(name, field, offset)				\
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 95a0c62..2e338a6 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1055,77 +1055,36 @@ static int ext4_generic_write_end(struct file *file,
>   * ext4 never places buffers on inode->i_mapping->private_list.  metadata
>   * buffers are managed internally.
>   */
> -static int ext4_ordered_write_end(struct file *file,
> -				  struct address_space *mapping,
> -				  loff_t pos, unsigned len, unsigned copied,
> -				  struct page *page, void *fsdata)
> +static int ext4_write_end(struct file *file,
> +			  struct address_space *mapping,
> +			  loff_t pos, unsigned len, unsigned copied,
> +			  struct page *page, void *fsdata)
>  {
>  	handle_t *handle = ext4_journal_current_handle();
>  	struct inode *inode = mapping->host;
>  	int ret = 0, ret2;
>  
>  	trace_ext4_ordered_write_end(inode, pos, len, copied);

Here this function needs to be renamed with trace_ext4_write_end().

Regards,
                                                - Zheng

> -	ret = ext4_jbd2_file_inode(handle, inode);
> -
> -	if (ret == 0) {
> -		ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
> -							page, fsdata);
> -		copied = ret2;
> -		if (pos + len > inode->i_size && ext4_can_truncate(inode))
> -			/* if we have allocated more blocks and copied
> -			 * less. We will have blocks allocated outside
> -			 * inode->i_size. So truncate them
> -			 */
> -			ext4_orphan_add(handle, inode);
> -		if (ret2 < 0)
> -			ret = ret2;
> -	} else {
> -		unlock_page(page);
> -		page_cache_release(page);
> -	}
> -
> -	ret2 = ext4_journal_stop(handle);
> -	if (!ret)
> -		ret = ret2;
> -
> -	if (pos + len > inode->i_size) {
> -		ext4_truncate_failed_write(inode);
> -		/*
> -		 * If truncate failed early the inode might still be
> -		 * on the orphan list; we need to make sure the inode
> -		 * is removed from the orphan list in that case.
> -		 */
> -		if (inode->i_nlink)
> -			ext4_orphan_del(NULL, inode);
> +	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
> +		ret = ext4_jbd2_file_inode(handle, inode);
> +		if (ret) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto errout;
> +		}
>  	}
>  
> -
> -	return ret ? ret : copied;
> -}
> -
> -static int ext4_writeback_write_end(struct file *file,
> -				    struct address_space *mapping,
> -				    loff_t pos, unsigned len, unsigned copied,
> -				    struct page *page, void *fsdata)
> -{
> -	handle_t *handle = ext4_journal_current_handle();
> -	struct inode *inode = mapping->host;
> -	int ret = 0, ret2;
> -
> -	trace_ext4_writeback_write_end(inode, pos, len, copied);
> -	ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
> -							page, fsdata);
> -	copied = ret2;
> +	copied = ext4_generic_write_end(file, mapping, pos, len, copied,
> +					page, fsdata);
> +	if (copied < 0)
> +		ret = copied;
>  	if (pos + len > inode->i_size && ext4_can_truncate(inode))
>  		/* if we have allocated more blocks and copied
>  		 * less. We will have blocks allocated outside
>  		 * inode->i_size. So truncate them
>  		 */
>  		ext4_orphan_add(handle, inode);
> -
> -	if (ret2 < 0)
> -		ret = ret2;
> -
> +errout:
>  	ret2 = ext4_journal_stop(handle);
>  	if (!ret)
>  		ret = ret2;
> @@ -2656,18 +2615,9 @@ static int ext4_da_write_end(struct file *file,
>  	unsigned long start, end;
>  	int write_mode = (int)(unsigned long)fsdata;
>  
> -	if (write_mode == FALL_BACK_TO_NONDELALLOC) {
> -		switch (ext4_inode_journal_mode(inode)) {
> -		case EXT4_INODE_ORDERED_DATA_MODE:
> -			return ext4_ordered_write_end(file, mapping, pos,
> -					len, copied, page, fsdata);
> -		case EXT4_INODE_WRITEBACK_DATA_MODE:
> -			return ext4_writeback_write_end(file, mapping, pos,
> -					len, copied, page, fsdata);
> -		default:
> -			BUG();
> -		}
> -	}
> +	if (write_mode == FALL_BACK_TO_NONDELALLOC)
> +		return ext4_write_end(file, mapping, pos,
> +				      len, copied, page, fsdata);
>  
>  	trace_ext4_da_write_end(inode, pos, len, copied);
>  	start = pos & (PAGE_CACHE_SIZE - 1);
> @@ -3172,27 +3122,12 @@ static int ext4_journalled_set_page_dirty(struct page *page)
>  	return __set_page_dirty_nobuffers(page);
>  }
>  
> -static const struct address_space_operations ext4_ordered_aops = {
> +static const struct address_space_operations ext4_aops = {
>  	.readpage		= ext4_readpage,
>  	.readpages		= ext4_readpages,
>  	.writepage		= ext4_writepage,
>  	.write_begin		= ext4_write_begin,
> -	.write_end		= ext4_ordered_write_end,
> -	.bmap			= ext4_bmap,
> -	.invalidatepage		= ext4_invalidatepage,
> -	.releasepage		= ext4_releasepage,
> -	.direct_IO		= ext4_direct_IO,
> -	.migratepage		= buffer_migrate_page,
> -	.is_partially_uptodate  = block_is_partially_uptodate,
> -	.error_remove_page	= generic_error_remove_page,
> -};
> -
> -static const struct address_space_operations ext4_writeback_aops = {
> -	.readpage		= ext4_readpage,
> -	.readpages		= ext4_readpages,
> -	.writepage		= ext4_writepage,
> -	.write_begin		= ext4_write_begin,
> -	.write_end		= ext4_writeback_write_end,
> +	.write_end		= ext4_write_end,
>  	.bmap			= ext4_bmap,
>  	.invalidatepage		= ext4_invalidatepage,
>  	.releasepage		= ext4_releasepage,
> @@ -3237,23 +3172,21 @@ void ext4_set_aops(struct inode *inode)
>  {
>  	switch (ext4_inode_journal_mode(inode)) {
>  	case EXT4_INODE_ORDERED_DATA_MODE:
> -		if (test_opt(inode->i_sb, DELALLOC))
> -			inode->i_mapping->a_ops = &ext4_da_aops;
> -		else
> -			inode->i_mapping->a_ops = &ext4_ordered_aops;
> +		ext4_set_inode_state(inode, EXT4_STATE_ORDERED_MODE);
>  		break;
>  	case EXT4_INODE_WRITEBACK_DATA_MODE:
> -		if (test_opt(inode->i_sb, DELALLOC))
> -			inode->i_mapping->a_ops = &ext4_da_aops;
> -		else
> -			inode->i_mapping->a_ops = &ext4_writeback_aops;
> +		ext4_clear_inode_state(inode, EXT4_STATE_ORDERED_MODE);
>  		break;
>  	case EXT4_INODE_JOURNAL_DATA_MODE:
>  		inode->i_mapping->a_ops = &ext4_journalled_aops;
> -		break;
> +		return;
>  	default:
>  		BUG();
>  	}
> +	if (test_opt(inode->i_sb, DELALLOC))
> +		inode->i_mapping->a_ops = &ext4_da_aops;
> +	else
> +		inode->i_mapping->a_ops = &ext4_aops;
>  }
>  
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-22 13:18                 ` Zheng Liu
@ 2013-02-22 15:20                   ` Theodore Ts'o
  2013-02-22 16:26                     ` Zheng Liu
  2013-03-24 12:29                     ` [PATCH] ext4: fold ext4_generic_write_end into ext4_write_end Zheng Liu
  0 siblings, 2 replies; 19+ messages in thread
From: Theodore Ts'o @ 2013-02-22 15:20 UTC (permalink / raw)
  To: Lukáš Czerner, linux-ext4

On Fri, Feb 22, 2013 at 09:18:12PM +0800, Zheng Liu wrote:
> > If there was a mode that I'd be tempted to get rid of, it would be the
> > combined data=ordered/data=writeback modes.  The main reason why we
> > keep it is because of a concern of buggy applications that depend on
> > the implied fsync() of data=ordered mode at each commit.  However,
> > ext4 has been around for long enough that I think most of the buggy
> > applications have been fixed by now.  And of course, those buggy
> > applications will lose in the same way when they are using btrfs and
> > xfs.  Something to consider....
> 
> IMHO, the application shouldn't depend on a kernel feature.  So maybe it
> is time to highlight this buggy usage.

Oh, agreed.  The question is how many people want us to keep the ext3
semantics to support those buggy applications.  To the extent that
distros are considering using ext4 to support ext3 file systems, there
might be a desire to maintain (as closely as possible) ext3 semantics,
even those that support buggy applications.  The primary problem is
that the when comes down to application developers versus file system
developers, the application developers vastly outnumber us.   :-)

> Just one minor comment below.  Otherwise the patch looks good to me, and
> it can pass in xfstests with 'data=ordered' and 'data=writeback'.

I hadn't bothered testing it yet because I'm focused on testing
and cleaning up the set of patches for the merge window --- and this
change is clearly for the next merge window.  Thanks for testing it!

> >  	trace_ext4_ordered_write_end(inode, pos, len, copied);
> 
> Here this function needs to be renamed with trace_ext4_write_end().

Yes, agreed.

I also need to get rid of trace_ext4_writeback_write_end() in
include/trace/events/ext4.h.

The other thing that needs to be done --- probably in a separate
commit, just to make things easier to review for correctness, is now
that we've folded ext4_writeback_write_end() and ext4_ordered_write_end()
into a single function, we now have a single user of
ext4_generic_write_end(), so we can now fold ext4_generic_write_end()
into ext4_write_end().

						- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-22 15:20                   ` Theodore Ts'o
@ 2013-02-22 16:26                     ` Zheng Liu
  2013-03-24 12:29                     ` [PATCH] ext4: fold ext4_generic_write_end into ext4_write_end Zheng Liu
  1 sibling, 0 replies; 19+ messages in thread
From: Zheng Liu @ 2013-02-22 16:26 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Lukáš Czerner, linux-ext4

On Fri, Feb 22, 2013 at 10:20:42AM -0500, Theodore Ts'o wrote:
> On Fri, Feb 22, 2013 at 09:18:12PM +0800, Zheng Liu wrote:
> > > If there was a mode that I'd be tempted to get rid of, it would be the
> > > combined data=ordered/data=writeback modes.  The main reason why we
> > > keep it is because of a concern of buggy applications that depend on
> > > the implied fsync() of data=ordered mode at each commit.  However,
> > > ext4 has been around for long enough that I think most of the buggy
> > > applications have been fixed by now.  And of course, those buggy
> > > applications will lose in the same way when they are using btrfs and
> > > xfs.  Something to consider....
> > 
> > IMHO, the application shouldn't depend on a kernel feature.  So maybe it
> > is time to highlight this buggy usage.
> 
> Oh, agreed.  The question is how many people want us to keep the ext3
> semantics to support those buggy applications.  To the extent that
> distros are considering using ext4 to support ext3 file systems, there
> might be a desire to maintain (as closely as possible) ext3 semantics,
> even those that support buggy applications.  The primary problem is
> that the when comes down to application developers versus file system
> developers, the application developers vastly outnumber us.   :-)

Yes, as file system developers we always need to meet the application
developers' requirement.  So it seems that we still need to keep it in
ext4. :-)

> 
> > Just one minor comment below.  Otherwise the patch looks good to me, and
> > it can pass in xfstests with 'data=ordered' and 'data=writeback'.
> 
> I hadn't bothered testing it yet because I'm focused on testing
> and cleaning up the set of patches for the merge window --- and this
> change is clearly for the next merge window.  Thanks for testing it!

I guess you are busy testing patches for the merge window.  One thing I
need to let you know is this patch [1].  I really think it should be
applied for this merge window because it fixes a security hole due to my
fault.  As commit log describes, a non-privilege user could cause system
crash using a truncate(1) command.  So please check it.

1. http://www.spinics.net/lists/linux-ext4/msg36784.html

> 
> > >  	trace_ext4_ordered_write_end(inode, pos, len, copied);
> > 
> > Here this function needs to be renamed with trace_ext4_write_end().
> 
> Yes, agreed.
> 
> I also need to get rid of trace_ext4_writeback_write_end() in
> include/trace/events/ext4.h.
> 
> The other thing that needs to be done --- probably in a separate
> commit, just to make things easier to review for correctness, is now
> that we've folded ext4_writeback_write_end() and ext4_ordered_write_end()
> into a single function, we now have a single user of
> ext4_generic_write_end(), so we can now fold ext4_generic_write_end()
> into ext4_write_end().

Yes, we can take a close look at in next merge window.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] ext4: fold ext4_generic_write_end into ext4_write_end
  2013-02-22 15:20                   ` Theodore Ts'o
  2013-02-22 16:26                     ` Zheng Liu
@ 2013-03-24 12:29                     ` Zheng Liu
  2013-03-25  0:07                       ` Theodore Ts'o
  1 sibling, 1 reply; 19+ messages in thread
From: Zheng Liu @ 2013-03-24 12:29 UTC (permalink / raw)
  To: linux-ext4; +Cc: Zheng Liu, Theodore Ts'o

From: Zheng Liu <wenqing.lz@taobao.com>

After collpasing the handling of data ordered and data writeback
codepath, ext4_generic_write_end has only one caller that is
ext4_write_end.  So we fold it into ext4_write_end.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
---
This commit is against dev-test branch of ext4.

Regards,
						- Zheng

 fs/ext4/inode.c | 67 +++++++++++++++++++++++----------------------------------
 1 file changed, 27 insertions(+), 40 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2934906..1464fcd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -997,14 +997,32 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 	return ext4_handle_dirty_metadata(handle, NULL, bh);
 }
 
-static int ext4_generic_write_end(struct file *file,
-				  struct address_space *mapping,
-				  loff_t pos, unsigned len, unsigned copied,
-				  struct page *page, void *fsdata)
+/*
+ * We need to pick up the new inode size which generic_commit_write gave us
+ * `file' can be NULL - eg, when called from page_symlink().
+ *
+ * ext4 never places buffers on inode->i_mapping->private_list.  metadata
+ * buffers are managed internally.
+ */
+static int ext4_write_end(struct file *file,
+			  struct address_space *mapping,
+			  loff_t pos, unsigned len, unsigned copied,
+			  struct page *page, void *fsdata)
 {
-	int i_size_changed = 0;
-	struct inode *inode = mapping->host;
 	handle_t *handle = ext4_journal_current_handle();
+	struct inode *inode = mapping->host;
+	int ret = 0, ret2;
+	int i_size_changed = 0;
+
+	trace_ext4_write_end(inode, pos, len, copied);
+	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
+		ret = ext4_jbd2_file_inode(handle, inode);
+		if (ret) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto errout;
+		}
+	}
 
 	if (ext4_has_inline_data(inode))
 		copied = ext4_write_inline_data_end(inode, pos, len,
@@ -1015,7 +1033,7 @@ static int ext4_generic_write_end(struct file *file,
 
 	/*
 	 * No need to use i_size_read() here, the i_size
-	 * cannot change under us because we hold i_mutex.
+	 * cannot change under us because we hole i_mutex.
 	 *
 	 * But it's important to update i_size while still holding page lock:
 	 * page writeout could otherwise come in and zero beyond i_size.
@@ -1025,10 +1043,10 @@ static int ext4_generic_write_end(struct file *file,
 		i_size_changed = 1;
 	}
 
-	if (pos + copied >  EXT4_I(inode)->i_disksize) {
+	if (pos + copied > EXT4_I(inode)->i_disksize) {
 		/* We need to mark inode dirty even if
 		 * new_i_size is less that inode->i_size
-		 * bu greater than i_disksize.(hint delalloc)
+		 * but greater than i_disksize. (hint delalloc)
 		 */
 		ext4_update_i_disksize(inode, (pos + copied));
 		i_size_changed = 1;
@@ -1045,37 +1063,6 @@ static int ext4_generic_write_end(struct file *file,
 	if (i_size_changed)
 		ext4_mark_inode_dirty(handle, inode);
 
-	return copied;
-}
-
-/*
- * We need to pick up the new inode size which generic_commit_write gave us
- * `file' can be NULL - eg, when called from page_symlink().
- *
- * ext4 never places buffers on inode->i_mapping->private_list.  metadata
- * buffers are managed internally.
- */
-static int ext4_write_end(struct file *file,
-			  struct address_space *mapping,
-			  loff_t pos, unsigned len, unsigned copied,
-			  struct page *page, void *fsdata)
-{
-	handle_t *handle = ext4_journal_current_handle();
-	struct inode *inode = mapping->host;
-	int ret = 0, ret2;
-
-	trace_ext4_write_end(inode, pos, len, copied);
-	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
-		ret = ext4_jbd2_file_inode(handle, inode);
-		if (ret) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto errout;
-		}
-	}

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fold ext4_generic_write_end into ext4_write_end
  2013-03-24 12:29                     ` [PATCH] ext4: fold ext4_generic_write_end into ext4_write_end Zheng Liu
@ 2013-03-25  0:07                       ` Theodore Ts'o
  0 siblings, 0 replies; 19+ messages in thread
From: Theodore Ts'o @ 2013-03-25  0:07 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-ext4, Zheng Liu

On Sun, Mar 24, 2013 at 08:29:16PM +0800, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> After collpasing the handling of data ordered and data writeback
> codepath, ext4_generic_write_end has only one caller that is
> ext4_write_end.  So we fold it into ext4_write_end.
> 
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> Cc: "Theodore Ts'o" <tytso@mit.edu>

Applied, thanks.

						- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... )
  2013-02-21 12:40   ` Lukáš Czerner
  2013-02-21 12:50     ` Lukáš Czerner
@ 2013-02-21 13:12     ` Zheng Liu
  1 sibling, 0 replies; 19+ messages in thread
From: Zheng Liu @ 2013-02-21 13:12 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: linux-ext4, Theodore Ts'o

On Thu, Feb 21, 2013 at 01:40:34PM +0100, Lukáš Czerner wrote:
[snip]
> 
> Hi Zheng,
> 
> thanks for the review. I know about the other issues and I'm trying
> to resolve those as well. Right now I have a patch which includes
> the changes ext4_calculate_overhead() you've described below and more,
> but even with this I still see some problems remaining.
> 
> Hopefully will send another patch soon.

Yeah, we will get a lot of xfstests failures about bigalloc feature when
bigalloc and delalloc are enabled simultaneouly.  So that would be great
if we could fix these problems.  Looking forward your patch. :-)

Regards,
                                                - Zheng

> > 
> > Subject: [PATCH] ext4: fix overhead calculation in bigalloc filesystem
> > 
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > ext4_calculate_overhead() should compute the overhead and stash it in
> > sbi->s_overhead.  But we miss use EXT4_B2C() to calculate the number of
> > clusters before first_data_block and the number of journal blocks.  This
> > commit use EXT4_NUM_B2C() instead of EXT4_B2C() to calculate the
> > overhead.
> > 
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > ---
> >  fs/ext4/super.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 3d4fb81..6165558 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -3219,7 +3219,7 @@ int ext4_calculate_overhead(struct super_block *sb)
> >  	/*
> >  	 * All of the blocks before first_data_block are overhead
> >  	 */
> > -	overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
> > +	overhead = EXT4_NUM_B2C(sbi, le32_to_cpu(es->s_first_data_block));
> >  
> >  	/*
> >  	 * Add the overhead found in each block group
> > @@ -3235,7 +3235,7 @@ int ext4_calculate_overhead(struct super_block *sb)
> >  	}
> >  	/* Add the journal blocks as well */
> >  	if (sbi->s_journal)
> > -		overhead += EXT4_B2C(sbi, sbi->s_journal->j_maxlen);
> > +		overhead += EXT4_NUM_B2C(sbi, sbi->s_journal->j_maxlen);
> >  
> >  	sbi->s_overhead = overhead;
> >  	smp_wmb();
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix free clusters calculation in bigalloc filesystem
  2013-02-21  8:01 [PATCH] ext4: fix free clusters calculation in bigalloc filesystem Lukas Czerner
  2013-02-21 12:15 ` [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... ) Zheng Liu
@ 2013-02-22  5:10 ` Theodore Ts'o
  2013-02-22  7:57   ` Lukáš Czerner
  2013-02-22  8:39 ` [PATCH v2] " Lukas Czerner
  2 siblings, 1 reply; 19+ messages in thread
From: Theodore Ts'o @ 2013-02-22  5:10 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: linux-ext4

On Thu, Feb 21, 2013 at 09:01:05AM +0100, Lukas Czerner wrote:
> 
> Moreover when calculating number of root clusters we should be using
> macro EXT4_NUM_B2C() instead of EXT4_C2B() otherwise the result will
> usually be off by one.

That should be "instead of EXT4_B2C()", and I don't think this is
true, since the number of blocks should always be a multiple of the
cluster ratio.  So the use of EXT4_B2C() is a bit of an optimization
(it avoids an add and mask operation), but it's probably not a
measurable optimizatoin in practice.

							- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ext4: fix free clusters calculation in bigalloc filesystem
  2013-02-22  5:10 ` [PATCH] ext4: fix free clusters calculation in bigalloc filesystem Theodore Ts'o
@ 2013-02-22  7:57   ` Lukáš Czerner
  0 siblings, 0 replies; 19+ messages in thread
From: Lukáš Czerner @ 2013-02-22  7:57 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Lukas Czerner, linux-ext4

On Fri, 22 Feb 2013, Theodore Ts'o wrote:

> Date: Fri, 22 Feb 2013 00:10:48 -0500
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukas Czerner <lczerner@redhat.com>
> Cc: linux-ext4@vger.kernel.org
> Subject: Re: [PATCH] ext4: fix free clusters calculation in bigalloc
>     filesystem
> 
> On Thu, Feb 21, 2013 at 09:01:05AM +0100, Lukas Czerner wrote:
> > 
> > Moreover when calculating number of root clusters we should be using
> > macro EXT4_NUM_B2C() instead of EXT4_C2B() otherwise the result will
> > usually be off by one.
> 
> That should be "instead of EXT4_B2C()", and I don't think this is
> true, since the number of blocks should always be a multiple of the
> cluster ratio.  So the use of EXT4_B2C() is a bit of an optimization
> (it avoids an add and mask operation), but it's probably not a
> measurable optimizatoin in practice.
> 
> 							- Ted

Oh right it was EXT4_B2C(), I'll update the description. Even though
it should always be a multiple of cluster ratio it's really bad
precedence for the use of EXT4_B2C() since it is confusing enough
as it is with blocks/clusters.

I'll resend.

-Lukas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2] ext4: fix free clusters calculation in bigalloc filesystem
  2013-02-21  8:01 [PATCH] ext4: fix free clusters calculation in bigalloc filesystem Lukas Czerner
  2013-02-21 12:15 ` [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... ) Zheng Liu
  2013-02-22  5:10 ` [PATCH] ext4: fix free clusters calculation in bigalloc filesystem Theodore Ts'o
@ 2013-02-22  8:39 ` Lukas Czerner
  2 siblings, 0 replies; 19+ messages in thread
From: Lukas Czerner @ 2013-02-22  8:39 UTC (permalink / raw)
  To: linux-ext4; +Cc: Lukas Czerner

ext4_has_free_clusters() should tell us whether there is enough free
clusters to allocate, however number of free clusters in the file system
is converted to blocks using EXT4_C2B() which is not only wrong use of
the macro (we should have used EXT4_NUM_B2C) but it's also completely
wrong concept since everything else is in cluster units.

Moreover when calculating number of root clusters we should be using
macro EXT4_NUM_B2C() instead of EXT4_B2C() otherwise the result might be
off by one. However r_blocks_count should always be a multiple of the
cluster ratio so doing a plain bit shift should be enough here. We
avoid using EXT4_B2C() because it's confusing.

As a result of the first problem number of free clusters is much bigger
than it should have been and ext4_has_free_clusters() would return 1 even
if there is really not enough free clusters available.

Fix this by removing the EXT4_C2B() conversion of free clusters and
using bit shift when calculating number of root clusters. This bug
affects number of xfstests tests covering file system ENOSPC situation
handling. With this patch most of the ENOSPC problems with bigalloc file
system disappear, especially the errors caused by delayed allocation not
having enough space when the actual allocation is finally requested.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
v2: update description, use just bit shift to calculate root_clusters

 fs/ext4/balloc.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index cf18217..30444b0 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -482,11 +482,16 @@ static int ext4_has_free_clusters(struct ext4_sb_info *sbi,

 	free_clusters  = percpu_counter_read_positive(fcc);
 	dirty_clusters = percpu_counter_read_positive(dcc);
-	root_clusters = EXT4_B2C(sbi, ext4_r_blocks_count(sbi->s_es));
+
+	/*
+	 * r_blocks_count should always be multiple of the cluster ratio so
+	 * we are safe to do a plane bit shift only.
+	 */
+	root_clusters = ext4_r_blocks_count(sbi->s_es) >> sbi->s_cluster_bits;

 	if (free_clusters - (nclusters + root_clusters + dirty_clusters) <
 					EXT4_FREECLUSTERS_WATERMARK) {
-		free_clusters  = EXT4_C2B(sbi, percpu_counter_sum_positive(fcc));
+		free_clusters  = percpu_counter_sum_positive(fcc);
 		dirty_clusters = percpu_counter_sum_positive(dcc);
 	}
 	/* Check whether we have space after accounting for current
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-03-25  0:07 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-21  8:01 [PATCH] ext4: fix free clusters calculation in bigalloc filesystem Lukas Czerner
2013-02-21 12:15 ` [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... ) Zheng Liu
2013-02-21 12:40   ` Lukáš Czerner
2013-02-21 12:50     ` Lukáš Czerner
2013-02-21 12:52       ` Lukáš Czerner
2013-02-21 13:49         ` Zheng Liu
2013-02-21 14:56           ` Lukáš Czerner
2013-02-22  3:03             ` Zheng Liu
2013-02-22  4:05               ` Theodore Ts'o
2013-02-22  8:04                 ` Lukáš Czerner
2013-02-22 13:18                 ` Zheng Liu
2013-02-22 15:20                   ` Theodore Ts'o
2013-02-22 16:26                     ` Zheng Liu
2013-03-24 12:29                     ` [PATCH] ext4: fold ext4_generic_write_end into ext4_write_end Zheng Liu
2013-03-25  0:07                       ` Theodore Ts'o
2013-02-21 13:12     ` [PATCH] ext4: fix overhead calculation in bigalloc filesystem (Re: ... ) Zheng Liu
2013-02-22  5:10 ` [PATCH] ext4: fix free clusters calculation in bigalloc filesystem Theodore Ts'o
2013-02-22  7:57   ` Lukáš Czerner
2013-02-22  8:39 ` [PATCH v2] " Lukas Czerner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).