* [RFC PATCH 0/1] ext4: Try to better reuse recently freed space @ 2013-07-04 9:11 Lukas Czerner 2013-07-04 9:11 ` [RFC PATCH 1/1] " Lukas Czerner 0 siblings, 1 reply; 21+ messages in thread From: Lukas Czerner @ 2013-07-04 9:11 UTC (permalink / raw) To: linux-ext4; +Cc: enwlinux, Jose_Mario_Gallegos, jordan_hargrave, rwheeler Hi all, as I just recently discovered here http://www.ogris.de/blkalloc/ ext4 have some unexpected allocation strategies which can cause some problems in certain scenarios (see the 1/1 patch). This is my attempt to fix this, however I think that this will need some discussion because it changes some of the block allocator heuristic quite significantly and I would like to make sure that it will not hurt our performance in some widely used workloads. Eric, could you run your performance testing with your test suite to see how it performs ? Also, I think that the old behaviour might be quite helpful for SSD because we're not overwriting existing blocks but rather trying to use new one, so the firmware should have easier job. However I do not know how significant impact if at all it might have. If we find out that this is helpful for SSD we might make this conditional depending on the type of the device. Which brings me to the other problem. Since the original behaviour is really bad for thinly provisioned devices we would like to avoid using it even if the underlying storage is SSD, however IIRC there is no way to distinguish this from within the kernel. See the comparison between the old and new allocator behaviour http://people.redhat.com/lczerner/allocator/ Comments and testing is welcomed. Thanks! -Lukas ^ permalink raw reply [flat|nested] 21+ messages in thread
* [RFC PATCH 1/1] ext4: Try to better reuse recently freed space 2013-07-04 9:11 [RFC PATCH 0/1] ext4: Try to better reuse recently freed space Lukas Czerner @ 2013-07-04 9:11 ` Lukas Czerner 2013-07-04 15:09 ` Jan Kara 0 siblings, 1 reply; 21+ messages in thread From: Lukas Czerner @ 2013-07-04 9:11 UTC (permalink / raw) To: linux-ext4 Cc: enwlinux, Jose_Mario_Gallegos, jordan_hargrave, rwheeler, Lukas Czerner Currently if the block allocator can not find the goal to allocate we would use global goal for stream allocation. However the global goal (s_mb_last_group and s_mb_last_start) will move further every time such allocation appears and never move backwards. This causes several problems in certain scenarios: - the goal will move further and further preventing us from reusing space which might have been freed since then. This is ok from the file system point of view because we will reuse that space eventually, however we're allocating block from slower parts of the spinning disk even though it might not be necessary. - The above also causes more serious problem for example for thinly provisioned storage (sparse images backed storage as well), because instead of reusing blocks which are already provisioned we would try to use new blocks. This would unnecessarily drain storage free blocks pool. - This will also cause blocks to be allocated further from the given goal than it's necessary. Consider for example truncating, or removing and rewriting the file in the loop. This workload will never reuse freed blocks until we continually claim and free all the block in the file system. Note that file systems like xfs, ext3, or btrfs does not have this problem. This is simply caused by the notion of global pool. Fix this by changing the global goal to be goal per inode. This will allow us to invalidate the goal every time the inode has been truncated, or newly created, so in those cases we would try to use the proper more specific goal which is based on inode position. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- fs/ext4/ext4.h | 7 ++++--- fs/ext4/inode.c | 8 ++++++++ fs/ext4/mballoc.c | 20 ++++++++------------ 3 files changed, 20 insertions(+), 15 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 6ed348d..4dffa92 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -917,6 +917,10 @@ struct ext4_inode_info { /* Precomputed uuid+inum+igen checksum for seeding inode checksums */ __u32 i_csum_seed; + + /* where last allocation was done - for stream allocation */ + unsigned long i_last_group; + unsigned long i_last_start; }; /* @@ -1242,9 +1246,6 @@ struct ext4_sb_info { unsigned int s_mb_order2_reqs; unsigned int s_mb_group_prealloc; unsigned int s_max_dir_size_kb; - /* where last allocation was done - for stream allocation */ - unsigned long s_mb_last_group; - unsigned long s_mb_last_start; /* stats for buddy allocator */ atomic_t s_bal_reqs; /* number of reqs with len > 1 */ diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 0188e65..07d0434 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3702,6 +3702,10 @@ void ext4_truncate(struct inode *inode) else ext4_ind_truncate(handle, inode); + /* Invalidate last allocation counters */ + ei->i_last_group = UINT_MAX; + ei->i_last_start = UINT_MAX; + up_write(&ei->i_data_sem); if (IS_SYNC(inode)) @@ -4060,6 +4064,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) inode->i_generation = le32_to_cpu(raw_inode->i_generation); ei->i_block_group = iloc.block_group; ei->i_last_alloc_group = ~0; + + /* Invalidate last allocation counters */ + ei->i_last_group = UINT_MAX; + ei->i_last_start = UINT_MAX; /* * NOTE! The in-memory inode i_data array is in little-endian order * even on big-endian machines: we do NOT byteswap the block numbers! diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index a9ff5e5..6c23666 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1591,7 +1591,6 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, struct ext4_buddy *e4b) { - struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); int ret; BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); @@ -1622,10 +1621,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, get_page(ac->ac_buddy_page); /* store last allocated for subsequent stream allocation */ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - spin_lock(&sbi->s_md_lock); - sbi->s_mb_last_group = ac->ac_f_ex.fe_group; - sbi->s_mb_last_start = ac->ac_f_ex.fe_start; - spin_unlock(&sbi->s_md_lock); + EXT4_I(ac->ac_inode)->i_last_group = ac->ac_f_ex.fe_group; + EXT4_I(ac->ac_inode)->i_last_start = ac->ac_f_ex.fe_start; } } @@ -2080,13 +2077,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_2order = i - 1; } - /* if stream allocation is enabled, use global goal */ - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - /* TBD: may be hot point */ - spin_lock(&sbi->s_md_lock); - ac->ac_g_ex.fe_group = sbi->s_mb_last_group; - ac->ac_g_ex.fe_start = sbi->s_mb_last_start; - spin_unlock(&sbi->s_md_lock); + /* if stream allocation is enabled and per inode goal is + * set, use it */ + if ((ac->ac_flags & EXT4_MB_STREAM_ALLOC) && + (EXT4_I(ac->ac_inode)->i_last_start != UINT_MAX)) { + ac->ac_g_ex.fe_group = EXT4_I(ac->ac_inode)->i_last_group; + ac->ac_g_ex.fe_start = EXT4_I(ac->ac_inode)->i_last_start; } /* Let's just scan groups to find more-less suitable blocks */ -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space 2013-07-04 9:11 ` [RFC PATCH 1/1] " Lukas Czerner @ 2013-07-04 15:09 ` Jan Kara 2013-07-04 15:32 ` Lukáš Czerner 0 siblings, 1 reply; 21+ messages in thread From: Jan Kara @ 2013-07-04 15:09 UTC (permalink / raw) To: Lukas Czerner Cc: linux-ext4, enwlinux, Jose_Mario_Gallegos, jordan_hargrave, rwheeler On Thu 04-07-13 11:11:54, Lukas Czerner wrote: > Currently if the block allocator can not find the goal to allocate we > would use global goal for stream allocation. However the global goal > (s_mb_last_group and s_mb_last_start) will move further every time such > allocation appears and never move backwards. > > This causes several problems in certain scenarios: > > - the goal will move further and further preventing us from reusing > space which might have been freed since then. This is ok from the file > system point of view because we will reuse that space eventually, > however we're allocating block from slower parts of the spinning disk > even though it might not be necessary. > - The above also causes more serious problem for example for thinly > provisioned storage (sparse images backed storage as well), because > instead of reusing blocks which are already provisioned we would try > to use new blocks. This would unnecessarily drain storage free blocks > pool. > - This will also cause blocks to be allocated further from the given > goal than it's necessary. Consider for example truncating, or removing > and rewriting the file in the loop. This workload will never reuse > freed blocks until we continually claim and free all the block in the > file system. > > Note that file systems like xfs, ext3, or btrfs does not have this > problem. This is simply caused by the notion of global pool. > > Fix this by changing the global goal to be goal per inode. This will > allow us to invalidate the goal every time the inode has been truncated, > or newly created, so in those cases we would try to use the proper more > specific goal which is based on inode position. In principle the patch looks fine to me. Just minor style nit below: > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > --- > fs/ext4/ext4.h | 7 ++++--- > fs/ext4/inode.c | 8 ++++++++ > fs/ext4/mballoc.c | 20 ++++++++------------ > 3 files changed, 20 insertions(+), 15 deletions(-) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 6ed348d..4dffa92 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -917,6 +917,10 @@ struct ext4_inode_info { > > /* Precomputed uuid+inum+igen checksum for seeding inode checksums */ > __u32 i_csum_seed; > + > + /* where last allocation was done - for stream allocation */ > + unsigned long i_last_group; > + unsigned long i_last_start; Can we use proper types here please? I think they should be ext4_grpblk_t and ext4_group_t. And maybe we should define something like EXT4_INVAL_GRPNO and use that instead of UINT_MAX. Honza > }; > > /* > @@ -1242,9 +1246,6 @@ struct ext4_sb_info { > unsigned int s_mb_order2_reqs; > unsigned int s_mb_group_prealloc; > unsigned int s_max_dir_size_kb; > - /* where last allocation was done - for stream allocation */ > - unsigned long s_mb_last_group; > - unsigned long s_mb_last_start; > > /* stats for buddy allocator */ > atomic_t s_bal_reqs; /* number of reqs with len > 1 */ > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 0188e65..07d0434 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3702,6 +3702,10 @@ void ext4_truncate(struct inode *inode) > else > ext4_ind_truncate(handle, inode); > > + /* Invalidate last allocation counters */ > + ei->i_last_group = UINT_MAX; > + ei->i_last_start = UINT_MAX; > + And > up_write(&ei->i_data_sem); > > if (IS_SYNC(inode)) > @@ -4060,6 +4064,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) > inode->i_generation = le32_to_cpu(raw_inode->i_generation); > ei->i_block_group = iloc.block_group; > ei->i_last_alloc_group = ~0; > + > + /* Invalidate last allocation counters */ > + ei->i_last_group = UINT_MAX; > + ei->i_last_start = UINT_MAX; > /* > * NOTE! The in-memory inode i_data array is in little-endian order > * even on big-endian machines: we do NOT byteswap the block numbers! > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index a9ff5e5..6c23666 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -1591,7 +1591,6 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) > static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, > struct ext4_buddy *e4b) > { > - struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); > int ret; > > BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); > @@ -1622,10 +1621,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, > get_page(ac->ac_buddy_page); > /* store last allocated for subsequent stream allocation */ > if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { > - spin_lock(&sbi->s_md_lock); > - sbi->s_mb_last_group = ac->ac_f_ex.fe_group; > - sbi->s_mb_last_start = ac->ac_f_ex.fe_start; > - spin_unlock(&sbi->s_md_lock); > + EXT4_I(ac->ac_inode)->i_last_group = ac->ac_f_ex.fe_group; > + EXT4_I(ac->ac_inode)->i_last_start = ac->ac_f_ex.fe_start; > } > } > > @@ -2080,13 +2077,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) > ac->ac_2order = i - 1; > } > > - /* if stream allocation is enabled, use global goal */ > - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { > - /* TBD: may be hot point */ > - spin_lock(&sbi->s_md_lock); > - ac->ac_g_ex.fe_group = sbi->s_mb_last_group; > - ac->ac_g_ex.fe_start = sbi->s_mb_last_start; > - spin_unlock(&sbi->s_md_lock); > + /* if stream allocation is enabled and per inode goal is > + * set, use it */ > + if ((ac->ac_flags & EXT4_MB_STREAM_ALLOC) && > + (EXT4_I(ac->ac_inode)->i_last_start != UINT_MAX)) { > + ac->ac_g_ex.fe_group = EXT4_I(ac->ac_inode)->i_last_group; > + ac->ac_g_ex.fe_start = EXT4_I(ac->ac_inode)->i_last_start; > } > > /* Let's just scan groups to find more-less suitable blocks */ > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space 2013-07-04 15:09 ` Jan Kara @ 2013-07-04 15:32 ` Lukáš Czerner 2013-07-04 16:06 ` Jose_Mario_Gallegos 2013-07-08 7:38 ` [PATCH v2] " Lukas Czerner 0 siblings, 2 replies; 21+ messages in thread From: Lukáš Czerner @ 2013-07-04 15:32 UTC (permalink / raw) To: Jan Kara Cc: linux-ext4, enwlinux, Jose_Mario_Gallegos, jordan_hargrave, rwheeler On Thu, 4 Jul 2013, Jan Kara wrote: > Date: Thu, 4 Jul 2013 17:09:05 +0200 > From: Jan Kara <jack@suse.cz> > To: Lukas Czerner <lczerner@redhat.com> > Cc: linux-ext4@vger.kernel.org, enwlinux@gmail.com, > Jose_Mario_Gallegos@Dell.com, jordan_hargrave@Dell.com, > rwheeler@redhat.com > Subject: Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space > > On Thu 04-07-13 11:11:54, Lukas Czerner wrote: > > Currently if the block allocator can not find the goal to allocate we > > would use global goal for stream allocation. However the global goal > > (s_mb_last_group and s_mb_last_start) will move further every time such > > allocation appears and never move backwards. > > > > This causes several problems in certain scenarios: > > > > - the goal will move further and further preventing us from reusing > > space which might have been freed since then. This is ok from the file > > system point of view because we will reuse that space eventually, > > however we're allocating block from slower parts of the spinning disk > > even though it might not be necessary. > > - The above also causes more serious problem for example for thinly > > provisioned storage (sparse images backed storage as well), because > > instead of reusing blocks which are already provisioned we would try > > to use new blocks. This would unnecessarily drain storage free blocks > > pool. > > - This will also cause blocks to be allocated further from the given > > goal than it's necessary. Consider for example truncating, or removing > > and rewriting the file in the loop. This workload will never reuse > > freed blocks until we continually claim and free all the block in the > > file system. > > > > Note that file systems like xfs, ext3, or btrfs does not have this > > problem. This is simply caused by the notion of global pool. > > > > Fix this by changing the global goal to be goal per inode. This will > > allow us to invalidate the goal every time the inode has been truncated, > > or newly created, so in those cases we would try to use the proper more > > specific goal which is based on inode position. > In principle the patch looks fine to me. Just minor style nit below: > > > > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > > --- > > fs/ext4/ext4.h | 7 ++++--- > > fs/ext4/inode.c | 8 ++++++++ > > fs/ext4/mballoc.c | 20 ++++++++------------ > > 3 files changed, 20 insertions(+), 15 deletions(-) > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > > index 6ed348d..4dffa92 100644 > > --- a/fs/ext4/ext4.h > > +++ b/fs/ext4/ext4.h > > @@ -917,6 +917,10 @@ struct ext4_inode_info { > > > > /* Precomputed uuid+inum+igen checksum for seeding inode checksums */ > > __u32 i_csum_seed; > > + > > + /* where last allocation was done - for stream allocation */ > > + unsigned long i_last_group; > > + unsigned long i_last_start; > Can we use proper types here please? I think they should be > ext4_grpblk_t and ext4_group_t. And maybe we should define something like > EXT4_INVAL_GRPNO and use that instead of UINT_MAX. Sure, it'll look better with named types and defined invalidate value. I'll resend the patch soon. Thanks! -Lukas > > Honza > > }; > > > > /* > > @@ -1242,9 +1246,6 @@ struct ext4_sb_info { > > unsigned int s_mb_order2_reqs; > > unsigned int s_mb_group_prealloc; > > unsigned int s_max_dir_size_kb; > > - /* where last allocation was done - for stream allocation */ > > - unsigned long s_mb_last_group; > > - unsigned long s_mb_last_start; > > > > /* stats for buddy allocator */ > > atomic_t s_bal_reqs; /* number of reqs with len > 1 */ > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index 0188e65..07d0434 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -3702,6 +3702,10 @@ void ext4_truncate(struct inode *inode) > > else > > ext4_ind_truncate(handle, inode); > > > > + /* Invalidate last allocation counters */ > > + ei->i_last_group = UINT_MAX; > > + ei->i_last_start = UINT_MAX; > > + > And > > up_write(&ei->i_data_sem); > > > > if (IS_SYNC(inode)) > > @@ -4060,6 +4064,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) > > inode->i_generation = le32_to_cpu(raw_inode->i_generation); > > ei->i_block_group = iloc.block_group; > > ei->i_last_alloc_group = ~0; > > + > > + /* Invalidate last allocation counters */ > > + ei->i_last_group = UINT_MAX; > > + ei->i_last_start = UINT_MAX; > > /* > > * NOTE! The in-memory inode i_data array is in little-endian order > > * even on big-endian machines: we do NOT byteswap the block numbers! > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > > index a9ff5e5..6c23666 100644 > > --- a/fs/ext4/mballoc.c > > +++ b/fs/ext4/mballoc.c > > @@ -1591,7 +1591,6 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) > > static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, > > struct ext4_buddy *e4b) > > { > > - struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); > > int ret; > > > > BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); > > @@ -1622,10 +1621,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, > > get_page(ac->ac_buddy_page); > > /* store last allocated for subsequent stream allocation */ > > if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { > > - spin_lock(&sbi->s_md_lock); > > - sbi->s_mb_last_group = ac->ac_f_ex.fe_group; > > - sbi->s_mb_last_start = ac->ac_f_ex.fe_start; > > - spin_unlock(&sbi->s_md_lock); > > + EXT4_I(ac->ac_inode)->i_last_group = ac->ac_f_ex.fe_group; > > + EXT4_I(ac->ac_inode)->i_last_start = ac->ac_f_ex.fe_start; > > } > > } > > > > @@ -2080,13 +2077,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) > > ac->ac_2order = i - 1; > > } > > > > - /* if stream allocation is enabled, use global goal */ > > - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { > > - /* TBD: may be hot point */ > > - spin_lock(&sbi->s_md_lock); > > - ac->ac_g_ex.fe_group = sbi->s_mb_last_group; > > - ac->ac_g_ex.fe_start = sbi->s_mb_last_start; > > - spin_unlock(&sbi->s_md_lock); > > + /* if stream allocation is enabled and per inode goal is > > + * set, use it */ > > + if ((ac->ac_flags & EXT4_MB_STREAM_ALLOC) && > > + (EXT4_I(ac->ac_inode)->i_last_start != UINT_MAX)) { > > + ac->ac_g_ex.fe_group = EXT4_I(ac->ac_inode)->i_last_group; > > + ac->ac_g_ex.fe_start = EXT4_I(ac->ac_inode)->i_last_start; > > } > > > > /* Let's just scan groups to find more-less suitable blocks */ > > -- > > 1.8.3.1 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space 2013-07-04 15:32 ` Lukáš Czerner @ 2013-07-04 16:06 ` Jose_Mario_Gallegos 2013-07-08 7:38 ` [PATCH v2] " Lukas Czerner 1 sibling, 0 replies; 21+ messages in thread From: Jose_Mario_Gallegos @ 2013-07-04 16:06 UTC (permalink / raw) To: lczerner, jack; +Cc: linux-ext4, enwlinux, Jordan_Hargrave, rwheeler Thank you very much for looking into this issue Lukas! Ric, What are the odds of getting a patch as a hot fix or beta for a proof of concept? The deadline is 7/15/2013 and this may eliminate the biggest remaining issue. Thanks again and best regards, Mario > -----Original Message----- > From: Lukáš Czerner [mailto:lczerner@redhat.com] > Sent: Thursday, July 04, 2013 10:33 AM > To: Jan Kara > Cc: linux-ext4@vger.kernel.org; enwlinux@gmail.com; Gallegos, Mario; > Hargrave, Jordan; rwheeler@redhat.com > Subject: Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space > > On Thu, 4 Jul 2013, Jan Kara wrote: > > > Date: Thu, 4 Jul 2013 17:09:05 +0200 > > From: Jan Kara <jack@suse.cz> > > To: Lukas Czerner <lczerner@redhat.com> > > Cc: linux-ext4@vger.kernel.org, enwlinux@gmail.com, > > Jose_Mario_Gallegos@Dell.com, jordan_hargrave@Dell.com, > > rwheeler@redhat.com > > Subject: Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed > > space > > > > On Thu 04-07-13 11:11:54, Lukas Czerner wrote: > > > Currently if the block allocator can not find the goal to allocate > > > we would use global goal for stream allocation. However the global > > > goal (s_mb_last_group and s_mb_last_start) will move further every > > > time such allocation appears and never move backwards. > > > > > > This causes several problems in certain scenarios: > > > > > > - the goal will move further and further preventing us from reusing > > > space which might have been freed since then. This is ok from the file > > > system point of view because we will reuse that space eventually, > > > however we're allocating block from slower parts of the spinning disk > > > even though it might not be necessary. > > > - The above also causes more serious problem for example for thinly > > > provisioned storage (sparse images backed storage as well), because > > > instead of reusing blocks which are already provisioned we would try > > > to use new blocks. This would unnecessarily drain storage free blocks > > > pool. > > > - This will also cause blocks to be allocated further from the given > > > goal than it's necessary. Consider for example truncating, or removing > > > and rewriting the file in the loop. This workload will never reuse > > > freed blocks until we continually claim and free all the block in the > > > file system. > > > > > > Note that file systems like xfs, ext3, or btrfs does not have this > > > problem. This is simply caused by the notion of global pool. > > > > > > Fix this by changing the global goal to be goal per inode. This will > > > allow us to invalidate the goal every time the inode has been > > > truncated, or newly created, so in those cases we would try to use > > > the proper more specific goal which is based on inode position. > > In principle the patch looks fine to me. Just minor style nit below: > > > > > > > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > > > --- > > > fs/ext4/ext4.h | 7 ++++--- > > > fs/ext4/inode.c | 8 ++++++++ > > > fs/ext4/mballoc.c | 20 ++++++++------------ > > > 3 files changed, 20 insertions(+), 15 deletions(-) > > > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 6ed348d..4dffa92 > > > 100644 > > > --- a/fs/ext4/ext4.h > > > +++ b/fs/ext4/ext4.h > > > @@ -917,6 +917,10 @@ struct ext4_inode_info { > > > > > > /* Precomputed uuid+inum+igen checksum for seeding inode > checksums */ > > > __u32 i_csum_seed; > > > + > > > + /* where last allocation was done - for stream allocation */ > > > + unsigned long i_last_group; > > > + unsigned long i_last_start; > > Can we use proper types here please? I think they should be > > ext4_grpblk_t and ext4_group_t. And maybe we should define something > > like EXT4_INVAL_GRPNO and use that instead of UINT_MAX. > > Sure, it'll look better with named types and defined invalidate value. I'll > resend the patch soon. > > Thanks! > -Lukas > > > > > Honza > > > }; > > > > > > /* > > > @@ -1242,9 +1246,6 @@ struct ext4_sb_info { > > > unsigned int s_mb_order2_reqs; > > > unsigned int s_mb_group_prealloc; > > > unsigned int s_max_dir_size_kb; > > > - /* where last allocation was done - for stream allocation */ > > > - unsigned long s_mb_last_group; > > > - unsigned long s_mb_last_start; > > > > > > /* stats for buddy allocator */ > > > atomic_t s_bal_reqs; /* number of reqs with len > 1 */ > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index > > > 0188e65..07d0434 100644 > > > --- a/fs/ext4/inode.c > > > +++ b/fs/ext4/inode.c > > > @@ -3702,6 +3702,10 @@ void ext4_truncate(struct inode *inode) > > > else > > > ext4_ind_truncate(handle, inode); > > > > > > + /* Invalidate last allocation counters */ > > > + ei->i_last_group = UINT_MAX; > > > + ei->i_last_start = UINT_MAX; > > > + > > And > > > up_write(&ei->i_data_sem); > > > > > > if (IS_SYNC(inode)) > > > @@ -4060,6 +4064,10 @@ struct inode *ext4_iget(struct super_block > *sb, unsigned long ino) > > > inode->i_generation = le32_to_cpu(raw_inode->i_generation); > > > ei->i_block_group = iloc.block_group; > > > ei->i_last_alloc_group = ~0; > > > + > > > + /* Invalidate last allocation counters */ > > > + ei->i_last_group = UINT_MAX; > > > + ei->i_last_start = UINT_MAX; > > > /* > > > * NOTE! The in-memory inode i_data array is in little-endian order > > > * even on big-endian machines: we do NOT byteswap the block > numbers! > > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index > > > a9ff5e5..6c23666 100644 > > > --- a/fs/ext4/mballoc.c > > > +++ b/fs/ext4/mballoc.c > > > @@ -1591,7 +1591,6 @@ static int mb_mark_used(struct ext4_buddy > > > *e4b, struct ext4_free_extent *ex) static void > ext4_mb_use_best_found(struct ext4_allocation_context *ac, > > > struct ext4_buddy *e4b) > > > { > > > - struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); > > > int ret; > > > > > > BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); @@ -1622,10 > +1621,8 > > > @@ static void ext4_mb_use_best_found(struct > ext4_allocation_context *ac, > > > get_page(ac->ac_buddy_page); > > > /* store last allocated for subsequent stream allocation */ > > > if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { > > > - spin_lock(&sbi->s_md_lock); > > > - sbi->s_mb_last_group = ac->ac_f_ex.fe_group; > > > - sbi->s_mb_last_start = ac->ac_f_ex.fe_start; > > > - spin_unlock(&sbi->s_md_lock); > > > + EXT4_I(ac->ac_inode)->i_last_group = ac- > >ac_f_ex.fe_group; > > > + EXT4_I(ac->ac_inode)->i_last_start = ac->ac_f_ex.fe_start; > > > } > > > } > > > > > > @@ -2080,13 +2077,12 @@ ext4_mb_regular_allocator(struct > ext4_allocation_context *ac) > > > ac->ac_2order = i - 1; > > > } > > > > > > - /* if stream allocation is enabled, use global goal */ > > > - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { > > > - /* TBD: may be hot point */ > > > - spin_lock(&sbi->s_md_lock); > > > - ac->ac_g_ex.fe_group = sbi->s_mb_last_group; > > > - ac->ac_g_ex.fe_start = sbi->s_mb_last_start; > > > - spin_unlock(&sbi->s_md_lock); > > > + /* if stream allocation is enabled and per inode goal is > > > + * set, use it */ > > > + if ((ac->ac_flags & EXT4_MB_STREAM_ALLOC) && > > > + (EXT4_I(ac->ac_inode)->i_last_start != UINT_MAX)) { > > > + ac->ac_g_ex.fe_group = EXT4_I(ac->ac_inode)- > >i_last_group; > > > + ac->ac_g_ex.fe_start = EXT4_I(ac->ac_inode)->i_last_start; > > > } > > > > > > /* Let's just scan groups to find more-less suitable blocks */ > > > -- > > > 1.8.3.1 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe > > > linux-ext4" in the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v2] ext4: Try to better reuse recently freed space 2013-07-04 15:32 ` Lukáš Czerner 2013-07-04 16:06 ` Jose_Mario_Gallegos @ 2013-07-08 7:38 ` Lukas Czerner 2013-07-08 8:56 ` Jan Kara 1 sibling, 1 reply; 21+ messages in thread From: Lukas Czerner @ 2013-07-08 7:38 UTC (permalink / raw) To: linux-ext4; +Cc: jack, Lukas Czerner Currently if the block allocator can not find the goal to allocate we would use global goal for stream allocation. However the global goal (s_mb_last_group and s_mb_last_start) will move further every time such allocation appears and never move backwards. This causes several problems in certain scenarios: - the goal will move further and further preventing us from reusing space which might have been freed since then. This is ok from the file system point of view because we will reuse that space eventually, however we're allocating block from slower parts of the spinning disk even though it might not be necessary. - The above also causes more serious problem for example for thinly provisioned storage (sparse images backed storage as well), because instead of reusing blocks which are already provisioned we would try to use new blocks. This would unnecessarily drain storage free blocks pool. - This will also cause blocks to be allocated further from the given goal than it's necessary. Consider for example truncating, or removing and rewriting the file in the loop. This workload will never reuse freed blocks until we continually claim and free all the block in the file system. Note that file systems like xfs, ext3, or btrfs does not have this problem. This is simply caused by the notion of global pool. Fix this by changing the global goal to be goal per inode. This will allow us to invalidate the goal every time the inode has been truncated, or newly created, so in those cases we would try to use the proper more specific goal which is based on inode position. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- v2: Use proper types. Define invalid value fs/ext4/ext4.h | 12 +++++++++--- fs/ext4/inode.c | 8 ++++++++ fs/ext4/mballoc.c | 20 ++++++++------------ 3 files changed, 25 insertions(+), 15 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 6ed348d..d3e9d10 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -917,9 +917,18 @@ struct ext4_inode_info { /* Precomputed uuid+inum+igen checksum for seeding inode checksums */ __u32 i_csum_seed; + + /* Where last allocation was done - for stream allocation */ + ext4_group_t i_last_group; + ext4_grpblk_t i_last_start; }; /* + * Invalid value for last allocation pointer + */ +#define EXT4_INVAL_GRPNO UINT_MAX + +/* * File system states */ #define EXT4_VALID_FS 0x0001 /* Unmounted cleanly */ @@ -1242,9 +1251,6 @@ struct ext4_sb_info { unsigned int s_mb_order2_reqs; unsigned int s_mb_group_prealloc; unsigned int s_max_dir_size_kb; - /* where last allocation was done - for stream allocation */ - unsigned long s_mb_last_group; - unsigned long s_mb_last_start; /* stats for buddy allocator */ atomic_t s_bal_reqs; /* number of reqs with len > 1 */ diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 0188e65..8d3e67c 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3702,6 +3702,10 @@ void ext4_truncate(struct inode *inode) else ext4_ind_truncate(handle, inode); + /* Invalidate last allocation pointers */ + ei->i_last_group = EXT4_INVAL_GRPNO; + ei->i_last_start = 0; + up_write(&ei->i_data_sem); if (IS_SYNC(inode)) @@ -4060,6 +4064,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) inode->i_generation = le32_to_cpu(raw_inode->i_generation); ei->i_block_group = iloc.block_group; ei->i_last_alloc_group = ~0; + + /* Invalidate last allocation counters */ + ei->i_last_group = EXT4_INVAL_GRPNO; + ei->i_last_start = 0; /* * NOTE! The in-memory inode i_data array is in little-endian order * even on big-endian machines: we do NOT byteswap the block numbers! diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index a9ff5e5..7609f77 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1591,7 +1591,6 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, struct ext4_buddy *e4b) { - struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); int ret; BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); @@ -1622,10 +1621,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, get_page(ac->ac_buddy_page); /* store last allocated for subsequent stream allocation */ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - spin_lock(&sbi->s_md_lock); - sbi->s_mb_last_group = ac->ac_f_ex.fe_group; - sbi->s_mb_last_start = ac->ac_f_ex.fe_start; - spin_unlock(&sbi->s_md_lock); + EXT4_I(ac->ac_inode)->i_last_group = ac->ac_f_ex.fe_group; + EXT4_I(ac->ac_inode)->i_last_start = ac->ac_f_ex.fe_start; } } @@ -2080,13 +2077,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_2order = i - 1; } - /* if stream allocation is enabled, use global goal */ - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - /* TBD: may be hot point */ - spin_lock(&sbi->s_md_lock); - ac->ac_g_ex.fe_group = sbi->s_mb_last_group; - ac->ac_g_ex.fe_start = sbi->s_mb_last_start; - spin_unlock(&sbi->s_md_lock); + /* if stream allocation is enabled and per inode goal is + * set, use it */ + if ((ac->ac_flags & EXT4_MB_STREAM_ALLOC) && + (EXT4_I(ac->ac_inode)->i_last_group != EXT4_INVAL_GRPNO)) { + ac->ac_g_ex.fe_group = EXT4_I(ac->ac_inode)->i_last_group; + ac->ac_g_ex.fe_start = EXT4_I(ac->ac_inode)->i_last_start; } /* Let's just scan groups to find more-less suitable blocks */ -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH v2] ext4: Try to better reuse recently freed space 2013-07-08 7:38 ` [PATCH v2] " Lukas Czerner @ 2013-07-08 8:56 ` Jan Kara 2013-07-08 9:24 ` Lukáš Czerner 0 siblings, 1 reply; 21+ messages in thread From: Jan Kara @ 2013-07-08 8:56 UTC (permalink / raw) To: Lukas Czerner; +Cc: linux-ext4, jack, Andreas Dilger On Mon 08-07-13 09:38:27, Lukas Czerner wrote: > Currently if the block allocator can not find the goal to allocate we > would use global goal for stream allocation. However the global goal > (s_mb_last_group and s_mb_last_start) will move further every time such > allocation appears and never move backwards. > > This causes several problems in certain scenarios: > > - the goal will move further and further preventing us from reusing > space which might have been freed since then. This is ok from the file > system point of view because we will reuse that space eventually, > however we're allocating block from slower parts of the spinning disk > even though it might not be necessary. > - The above also causes more serious problem for example for thinly > provisioned storage (sparse images backed storage as well), because > instead of reusing blocks which are already provisioned we would try > to use new blocks. This would unnecessarily drain storage free blocks > pool. > - This will also cause blocks to be allocated further from the given > goal than it's necessary. Consider for example truncating, or removing > and rewriting the file in the loop. This workload will never reuse > freed blocks until we continually claim and free all the block in the > file system. > > Note that file systems like xfs, ext3, or btrfs does not have this > problem. This is simply caused by the notion of global pool. > > Fix this by changing the global goal to be goal per inode. This will > allow us to invalidate the goal every time the inode has been truncated, > or newly created, so in those cases we would try to use the proper more > specific goal which is based on inode position. When looking at your patch for second time, I started wondering, whether we need per-inode stream goal at all. We already do set goal in the allocation request for mballoc (ar->goal) e.g. in ext4_ext_find_goal(). It seems strange to then reset it inside mballoc and I don't even think mballoc will change it to something else now when the goal is per-inode and not global. Which makes me think again about why someone introduced global goal for stream allocations in the first place? Maybe the motivation was that at each moment we will have only one area from where we allocate streaming data (large chunks of blocks) which may presumably reduce fragmentation. I've added Andreas to CC, maybe he'll remember the purpose. Honza > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > --- > v2: Use proper types. Define invalid value > > fs/ext4/ext4.h | 12 +++++++++--- > fs/ext4/inode.c | 8 ++++++++ > fs/ext4/mballoc.c | 20 ++++++++------------ > 3 files changed, 25 insertions(+), 15 deletions(-) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 6ed348d..d3e9d10 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -917,9 +917,18 @@ struct ext4_inode_info { > > /* Precomputed uuid+inum+igen checksum for seeding inode checksums */ > __u32 i_csum_seed; > + > + /* Where last allocation was done - for stream allocation */ > + ext4_group_t i_last_group; > + ext4_grpblk_t i_last_start; > }; > > /* > + * Invalid value for last allocation pointer > + */ > +#define EXT4_INVAL_GRPNO UINT_MAX > + > +/* > * File system states > */ > #define EXT4_VALID_FS 0x0001 /* Unmounted cleanly */ > @@ -1242,9 +1251,6 @@ struct ext4_sb_info { > unsigned int s_mb_order2_reqs; > unsigned int s_mb_group_prealloc; > unsigned int s_max_dir_size_kb; > - /* where last allocation was done - for stream allocation */ > - unsigned long s_mb_last_group; > - unsigned long s_mb_last_start; > > /* stats for buddy allocator */ > atomic_t s_bal_reqs; /* number of reqs with len > 1 */ > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 0188e65..8d3e67c 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3702,6 +3702,10 @@ void ext4_truncate(struct inode *inode) > else > ext4_ind_truncate(handle, inode); > > + /* Invalidate last allocation pointers */ > + ei->i_last_group = EXT4_INVAL_GRPNO; > + ei->i_last_start = 0; > + > up_write(&ei->i_data_sem); > > if (IS_SYNC(inode)) > @@ -4060,6 +4064,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) > inode->i_generation = le32_to_cpu(raw_inode->i_generation); > ei->i_block_group = iloc.block_group; > ei->i_last_alloc_group = ~0; > + > + /* Invalidate last allocation counters */ > + ei->i_last_group = EXT4_INVAL_GRPNO; > + ei->i_last_start = 0; > /* > * NOTE! The in-memory inode i_data array is in little-endian order > * even on big-endian machines: we do NOT byteswap the block numbers! > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index a9ff5e5..7609f77 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -1591,7 +1591,6 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) > static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, > struct ext4_buddy *e4b) > { > - struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); > int ret; > > BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); > @@ -1622,10 +1621,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, > get_page(ac->ac_buddy_page); > /* store last allocated for subsequent stream allocation */ > if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { > - spin_lock(&sbi->s_md_lock); > - sbi->s_mb_last_group = ac->ac_f_ex.fe_group; > - sbi->s_mb_last_start = ac->ac_f_ex.fe_start; > - spin_unlock(&sbi->s_md_lock); > + EXT4_I(ac->ac_inode)->i_last_group = ac->ac_f_ex.fe_group; > + EXT4_I(ac->ac_inode)->i_last_start = ac->ac_f_ex.fe_start; > } > } > > @@ -2080,13 +2077,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) > ac->ac_2order = i - 1; > } > > - /* if stream allocation is enabled, use global goal */ > - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { > - /* TBD: may be hot point */ > - spin_lock(&sbi->s_md_lock); > - ac->ac_g_ex.fe_group = sbi->s_mb_last_group; > - ac->ac_g_ex.fe_start = sbi->s_mb_last_start; > - spin_unlock(&sbi->s_md_lock); > + /* if stream allocation is enabled and per inode goal is > + * set, use it */ > + if ((ac->ac_flags & EXT4_MB_STREAM_ALLOC) && > + (EXT4_I(ac->ac_inode)->i_last_group != EXT4_INVAL_GRPNO)) { > + ac->ac_g_ex.fe_group = EXT4_I(ac->ac_inode)->i_last_group; > + ac->ac_g_ex.fe_start = EXT4_I(ac->ac_inode)->i_last_start; > } > > /* Let's just scan groups to find more-less suitable blocks */ > -- > 1.8.3.1 > -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2] ext4: Try to better reuse recently freed space 2013-07-08 8:56 ` Jan Kara @ 2013-07-08 9:24 ` Lukáš Czerner 2013-07-08 11:59 ` Jan Kara 0 siblings, 1 reply; 21+ messages in thread From: Lukáš Czerner @ 2013-07-08 9:24 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4, Andreas Dilger On Mon, 8 Jul 2013, Jan Kara wrote: > Date: Mon, 8 Jul 2013 10:56:03 +0200 > From: Jan Kara <jack@suse.cz> > To: Lukas Czerner <lczerner@redhat.com> > Cc: linux-ext4@vger.kernel.org, jack@suse.cz, > Andreas Dilger <adilger.kernel@dilger.ca> > Subject: Re: [PATCH v2] ext4: Try to better reuse recently freed space > > On Mon 08-07-13 09:38:27, Lukas Czerner wrote: > > Currently if the block allocator can not find the goal to allocate we > > would use global goal for stream allocation. However the global goal > > (s_mb_last_group and s_mb_last_start) will move further every time such > > allocation appears and never move backwards. > > > > This causes several problems in certain scenarios: > > > > - the goal will move further and further preventing us from reusing > > space which might have been freed since then. This is ok from the file > > system point of view because we will reuse that space eventually, > > however we're allocating block from slower parts of the spinning disk > > even though it might not be necessary. > > - The above also causes more serious problem for example for thinly > > provisioned storage (sparse images backed storage as well), because > > instead of reusing blocks which are already provisioned we would try > > to use new blocks. This would unnecessarily drain storage free blocks > > pool. > > - This will also cause blocks to be allocated further from the given > > goal than it's necessary. Consider for example truncating, or removing > > and rewriting the file in the loop. This workload will never reuse > > freed blocks until we continually claim and free all the block in the > > file system. > > > > Note that file systems like xfs, ext3, or btrfs does not have this > > problem. This is simply caused by the notion of global pool. > > > > Fix this by changing the global goal to be goal per inode. This will > > allow us to invalidate the goal every time the inode has been truncated, > > or newly created, so in those cases we would try to use the proper more > > specific goal which is based on inode position. > When looking at your patch for second time, I started wondering, whether > we need per-inode stream goal at all. We already do set goal in the > allocation request for mballoc (ar->goal) e.g. in ext4_ext_find_goal(). > It seems strange to then reset it inside mballoc and I don't even think > mballoc will change it to something else now when the goal is per-inode and > not global. Yes, we do set the goal in the allocation request and it is supposed to be the "best" goal. However sometimes it can not be fulfilled because we do not have any free block at "goal". That's when the global (or per-inode) goal comes into play. I suppose that there was several reasons for that. First of all it makes it easier for allocator, because it can directly jump at the point where we allocated last time and it is likely that there is some free space to allocate from - so the benefit is that we do not have to walk all the space in between which is likely to be allocated. The second might be to keep new allocation together. This is also what is creating the problem which this patch is trying to fix. We assume that when the allocation request goal can not be fulfilled then all the space between that goal and the global goal is allocated so we jump directly to the global goal. However it might not be the case at all. With per-inode goal we can jump directly to the point we allocated last time for given inode and start searching free space from there in the case that allocation request goal can not be fulfilled for some reason. Also we're able to invalidate that goal when we truncate the file so we can start searching from the allocation request goal. Obviously the per-inode goal does not have all the benefits of the global goal, because we're less likely to have free space right after per-inode goal then after global goal, but it does not have that nasty side-effect of touching all the block in the file system unnecessarily, getting more and more distant from the allocation request goal even though it's not always needed, and unnecessarily pushing the allocation to the slower parts of the disk in some cases. -Lukas > > Which makes me think again about why someone introduced global goal for > stream allocations in the first place? Maybe the motivation was that at > each moment we will have only one area from where we allocate streaming > data (large chunks of blocks) which may presumably reduce fragmentation. > I've added Andreas to CC, maybe he'll remember the purpose. > > Honza > > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > > --- > > v2: Use proper types. Define invalid value > > > > fs/ext4/ext4.h | 12 +++++++++--- > > fs/ext4/inode.c | 8 ++++++++ > > fs/ext4/mballoc.c | 20 ++++++++------------ > > 3 files changed, 25 insertions(+), 15 deletions(-) > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > > index 6ed348d..d3e9d10 100644 > > --- a/fs/ext4/ext4.h > > +++ b/fs/ext4/ext4.h > > @@ -917,9 +917,18 @@ struct ext4_inode_info { > > > > /* Precomputed uuid+inum+igen checksum for seeding inode checksums */ > > __u32 i_csum_seed; > > + > > + /* Where last allocation was done - for stream allocation */ > > + ext4_group_t i_last_group; > > + ext4_grpblk_t i_last_start; > > }; > > > > /* > > + * Invalid value for last allocation pointer > > + */ > > +#define EXT4_INVAL_GRPNO UINT_MAX > > + > > +/* > > * File system states > > */ > > #define EXT4_VALID_FS 0x0001 /* Unmounted cleanly */ > > @@ -1242,9 +1251,6 @@ struct ext4_sb_info { > > unsigned int s_mb_order2_reqs; > > unsigned int s_mb_group_prealloc; > > unsigned int s_max_dir_size_kb; > > - /* where last allocation was done - for stream allocation */ > > - unsigned long s_mb_last_group; > > - unsigned long s_mb_last_start; > > > > /* stats for buddy allocator */ > > atomic_t s_bal_reqs; /* number of reqs with len > 1 */ > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index 0188e65..8d3e67c 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -3702,6 +3702,10 @@ void ext4_truncate(struct inode *inode) > > else > > ext4_ind_truncate(handle, inode); > > > > + /* Invalidate last allocation pointers */ > > + ei->i_last_group = EXT4_INVAL_GRPNO; > > + ei->i_last_start = 0; > > + > > up_write(&ei->i_data_sem); > > > > if (IS_SYNC(inode)) > > @@ -4060,6 +4064,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) > > inode->i_generation = le32_to_cpu(raw_inode->i_generation); > > ei->i_block_group = iloc.block_group; > > ei->i_last_alloc_group = ~0; > > + > > + /* Invalidate last allocation counters */ > > + ei->i_last_group = EXT4_INVAL_GRPNO; > > + ei->i_last_start = 0; > > /* > > * NOTE! The in-memory inode i_data array is in little-endian order > > * even on big-endian machines: we do NOT byteswap the block numbers! > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > > index a9ff5e5..7609f77 100644 > > --- a/fs/ext4/mballoc.c > > +++ b/fs/ext4/mballoc.c > > @@ -1591,7 +1591,6 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) > > static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, > > struct ext4_buddy *e4b) > > { > > - struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); > > int ret; > > > > BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); > > @@ -1622,10 +1621,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, > > get_page(ac->ac_buddy_page); > > /* store last allocated for subsequent stream allocation */ > > if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { > > - spin_lock(&sbi->s_md_lock); > > - sbi->s_mb_last_group = ac->ac_f_ex.fe_group; > > - sbi->s_mb_last_start = ac->ac_f_ex.fe_start; > > - spin_unlock(&sbi->s_md_lock); > > + EXT4_I(ac->ac_inode)->i_last_group = ac->ac_f_ex.fe_group; > > + EXT4_I(ac->ac_inode)->i_last_start = ac->ac_f_ex.fe_start; > > } > > } > > > > @@ -2080,13 +2077,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) > > ac->ac_2order = i - 1; > > } > > > > - /* if stream allocation is enabled, use global goal */ > > - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { > > - /* TBD: may be hot point */ > > - spin_lock(&sbi->s_md_lock); > > - ac->ac_g_ex.fe_group = sbi->s_mb_last_group; > > - ac->ac_g_ex.fe_start = sbi->s_mb_last_start; > > - spin_unlock(&sbi->s_md_lock); > > + /* if stream allocation is enabled and per inode goal is > > + * set, use it */ > > + if ((ac->ac_flags & EXT4_MB_STREAM_ALLOC) && > > + (EXT4_I(ac->ac_inode)->i_last_group != EXT4_INVAL_GRPNO)) { > > + ac->ac_g_ex.fe_group = EXT4_I(ac->ac_inode)->i_last_group; > > + ac->ac_g_ex.fe_start = EXT4_I(ac->ac_inode)->i_last_start; > > } > > > > /* Let's just scan groups to find more-less suitable blocks */ > > -- > > 1.8.3.1 > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2] ext4: Try to better reuse recently freed space 2013-07-08 9:24 ` Lukáš Czerner @ 2013-07-08 11:59 ` Jan Kara 2013-07-08 21:27 ` Andreas Dilger 2013-07-10 11:18 ` Lukáš Czerner 0 siblings, 2 replies; 21+ messages in thread From: Jan Kara @ 2013-07-08 11:59 UTC (permalink / raw) To: Lukáš Czerner; +Cc: Jan Kara, linux-ext4, Andreas Dilger On Mon 08-07-13 11:24:01, Lukáš Czerner wrote: > On Mon, 8 Jul 2013, Jan Kara wrote: > > > Date: Mon, 8 Jul 2013 10:56:03 +0200 > > From: Jan Kara <jack@suse.cz> > > To: Lukas Czerner <lczerner@redhat.com> > > Cc: linux-ext4@vger.kernel.org, jack@suse.cz, > > Andreas Dilger <adilger.kernel@dilger.ca> > > Subject: Re: [PATCH v2] ext4: Try to better reuse recently freed space > > > > On Mon 08-07-13 09:38:27, Lukas Czerner wrote: > > > Currently if the block allocator can not find the goal to allocate we > > > would use global goal for stream allocation. However the global goal > > > (s_mb_last_group and s_mb_last_start) will move further every time such > > > allocation appears and never move backwards. > > > > > > This causes several problems in certain scenarios: > > > > > > - the goal will move further and further preventing us from reusing > > > space which might have been freed since then. This is ok from the file > > > system point of view because we will reuse that space eventually, > > > however we're allocating block from slower parts of the spinning disk > > > even though it might not be necessary. > > > - The above also causes more serious problem for example for thinly > > > provisioned storage (sparse images backed storage as well), because > > > instead of reusing blocks which are already provisioned we would try > > > to use new blocks. This would unnecessarily drain storage free blocks > > > pool. > > > - This will also cause blocks to be allocated further from the given > > > goal than it's necessary. Consider for example truncating, or removing > > > and rewriting the file in the loop. This workload will never reuse > > > freed blocks until we continually claim and free all the block in the > > > file system. > > > > > > Note that file systems like xfs, ext3, or btrfs does not have this > > > problem. This is simply caused by the notion of global pool. > > > > > > Fix this by changing the global goal to be goal per inode. This will > > > allow us to invalidate the goal every time the inode has been truncated, > > > or newly created, so in those cases we would try to use the proper more > > > specific goal which is based on inode position. > > When looking at your patch for second time, I started wondering, whether > > we need per-inode stream goal at all. We already do set goal in the > > allocation request for mballoc (ar->goal) e.g. in ext4_ext_find_goal(). > > It seems strange to then reset it inside mballoc and I don't even think > > mballoc will change it to something else now when the goal is per-inode and > > not global. > > Yes, we do set the goal in the allocation request and it is supposed > to be the "best" goal. However sometimes it can not be fulfilled > because we do not have any free block at "goal". > > That's when the global (or per-inode) goal comes into play. I suppose > that there was several reasons for that. First of all it makes it > easier for allocator, because it can directly jump at the point > where we allocated last time and it is likely that there is some > free space to allocate from - so the benefit is that we do not have > to walk all the space in between which is likely to be allocated. Yep, but my question is: If we have per-inode streaming goal, can you show an example when the "best" goal will be different from the "streaming" goal? Because from a (I admit rather quick) look at how each of these is computed, it seems that both will point after the next allocated block in case of streaming IO. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2] ext4: Try to better reuse recently freed space 2013-07-08 11:59 ` Jan Kara @ 2013-07-08 21:27 ` Andreas Dilger 2013-07-10 11:30 ` Lukáš Czerner 2013-07-10 11:18 ` Lukáš Czerner 1 sibling, 1 reply; 21+ messages in thread From: Andreas Dilger @ 2013-07-08 21:27 UTC (permalink / raw) To: Jan Kara Cc: Lukáš Czerner, Alex Zhuravlev, linux-ext4@vger.kernel.org List On 2013-07-08, at 5:59 AM, Jan Kara wrote: > On Mon 08-07-13 11:24:01, Lukáš Czerner wrote: >> On Mon, 8 Jul 2013, Jan Kara wrote: >>> On Mon, 8 Jul 2013, Jan Kara <jack@suse.cz> wrote: >>> On Mon 08-07-13 09:38:27, Lukas Czerner wrote: >>>> Currently if the block allocator can not find the goal to >>>> allocate we would use global goal for stream allocation. >>>> However the global goal (s_mb_last_group and s_mb_last_start) will move further every time such allocation appears and >>>> never move backwards. >>>> >>>> This causes several problems in certain scenarios: >>>> - the goal will move further and further preventing us from >>>> reusing space which might have been freed since then. This >>>> is ok from the file system point of view because we will >>>> reuse that space eventually, however we're allocating block >>>> from slower parts of the spinning disk even though it might >>>> not be necessary. >>>> - The above also causes more serious problem for example for >>>> thinly provisioned storage (sparse images backed storage >>>> as well), because instead of reusing blocks which are >>>> already provisioned we would try to use new blocks. This >>>> would unnecessarily drain storage free blocks pool. >>>> - This will also cause blocks to be allocated further from >>>> the given goal than it's necessary. Consider for example >>>> truncating, or removing and rewriting the file in the loop. >>>> This workload will never reuse freed blocks until we continually >>>> claim and free all the block in the file system. >>>> >>>> Note that file systems like xfs, ext3, or btrfs does not have this problem. This is simply caused by the notion of global pool. >>>> >>>> Fix this by changing the global goal to be goal per inode. This will allow us to invalidate the goal every time the inode has >>>> been truncated, or newly created, so in those cases we would try >>>> to use the proper more specific goal which is based on inode >>>> position. >>> When looking at your patch for second time, I started wondering, >>> whether we need per-inode stream goal at all. We already do set >>> goal in the allocation request for mballoc (ar->goal) e.g. in >>> ext4_ext_find_goal(). >>> It seems strange to then reset it inside mballoc and I don't >>> even think mballoc will change it to something else now when >>> the goal is per-inode and not global. >> >> Yes, we do set the goal in the allocation request and it is supposed >> to be the "best" goal. However sometimes it can not be fulfilled >> because we do not have any free block at "goal". >> >> That's when the global (or per-inode) goal comes into play. I suppose >> that there was several reasons for that. First of all it makes it >> easier for allocator, because it can directly jump at the point >> where we allocated last time and it is likely that there is some >> free space to allocate from - so the benefit is that we do not have >> to walk all the space in between which is likely to be allocated. > > Yep, but my question is: If we have per-inode streaming goal, can > you show an example when the "best" goal will be different from the > "streaming" goal? Because from a (I admit rather quick) look at how > each of these is computed, it seems that both will point after the > next allocated block in case of streaming IO. There were a few goals with the design of the mballoc allocator: - keep large allocations sized and aligned on RAID chunks - pack small allocations together, so they can fit into a single RAID chunk and avoid many read-modify-write cycles - keep allocations relatively close together, so that seeking is minimized and it doesn't search through fragmented free space It was designed for high-bandwidth streaming IO, and not small random IO as is seen with VM images and thin-provisioned storage. That said, with TRIM and sparse storage, there is no guarantee that offset within the virtual device has any association with the real disk, so I'm not sure it makes sense to optimize for this case. The ext4 code will already bias new allocations towards the beginning of the disk because this is how inodes are allocated and this influences each inode's block allocation. The global target allows the large allocations to get out of the mess of fragmented free space and into other parts of the filesystem that are not as heavily used. Rather than tweaking the global target (which is currently just a simple round-robin), it probably makes sense to look at a global extent map for free space, so that it is easier to find large chunks of free space. It is still desirable to keep these allocations closer together, otherwise the "optimum" location for several large allocations may be on different ends of the disk and the situation would be far worse than making periodic writes over the whole disk. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2] ext4: Try to better reuse recently freed space 2013-07-08 21:27 ` Andreas Dilger @ 2013-07-10 11:30 ` Lukáš Czerner 0 siblings, 0 replies; 21+ messages in thread From: Lukáš Czerner @ 2013-07-10 11:30 UTC (permalink / raw) To: Andreas Dilger; +Cc: Jan Kara, Alex Zhuravlev, linux-ext4@vger.kernel.org List [-- Attachment #1: Type: TEXT/PLAIN, Size: 6399 bytes --] On Mon, 8 Jul 2013, Andreas Dilger wrote: > Date: Mon, 8 Jul 2013 15:27:21 -0600 > From: Andreas Dilger <adilger@dilger.ca> > To: Jan Kara <jack@suse.cz> > Cc: Lukáš Czerner <lczerner@redhat.com>, > Alex Zhuravlev <alexey.zhuravlev@intel.com>, > "linux-ext4@vger.kernel.org List" <linux-ext4@vger.kernel.org> > Subject: Re: [PATCH v2] ext4: Try to better reuse recently freed space > > On 2013-07-08, at 5:59 AM, Jan Kara wrote: > > On Mon 08-07-13 11:24:01, Lukáš Czerner wrote: > >> On Mon, 8 Jul 2013, Jan Kara wrote: > >>> On Mon, 8 Jul 2013, Jan Kara <jack@suse.cz> wrote: > >>> On Mon 08-07-13 09:38:27, Lukas Czerner wrote: > >>>> Currently if the block allocator can not find the goal to > >>>> allocate we would use global goal for stream allocation. > >>>> However the global goal (s_mb_last_group and s_mb_last_start) will move further every time such allocation appears and > >>>> never move backwards. > >>>> > >>>> This causes several problems in certain scenarios: > >>>> - the goal will move further and further preventing us from > >>>> reusing space which might have been freed since then. This > >>>> is ok from the file system point of view because we will > >>>> reuse that space eventually, however we're allocating block > >>>> from slower parts of the spinning disk even though it might > >>>> not be necessary. > >>>> - The above also causes more serious problem for example for > >>>> thinly provisioned storage (sparse images backed storage > >>>> as well), because instead of reusing blocks which are > >>>> already provisioned we would try to use new blocks. This > >>>> would unnecessarily drain storage free blocks pool. > >>>> - This will also cause blocks to be allocated further from > >>>> the given goal than it's necessary. Consider for example > >>>> truncating, or removing and rewriting the file in the loop. > >>>> This workload will never reuse freed blocks until we continually > >>>> claim and free all the block in the file system. > >>>> > >>>> Note that file systems like xfs, ext3, or btrfs does not have this problem. This is simply caused by the notion of global pool. > >>>> > >>>> Fix this by changing the global goal to be goal per inode. This will allow us to invalidate the goal every time the inode has > >>>> been truncated, or newly created, so in those cases we would try > >>>> to use the proper more specific goal which is based on inode > >>>> position. > >>> When looking at your patch for second time, I started wondering, > >>> whether we need per-inode stream goal at all. We already do set > >>> goal in the allocation request for mballoc (ar->goal) e.g. in > >>> ext4_ext_find_goal(). > >>> It seems strange to then reset it inside mballoc and I don't > >>> even think mballoc will change it to something else now when > >>> the goal is per-inode and not global. > >> > >> Yes, we do set the goal in the allocation request and it is supposed > >> to be the "best" goal. However sometimes it can not be fulfilled > >> because we do not have any free block at "goal". > >> > >> That's when the global (or per-inode) goal comes into play. I suppose > >> that there was several reasons for that. First of all it makes it > >> easier for allocator, because it can directly jump at the point > >> where we allocated last time and it is likely that there is some > >> free space to allocate from - so the benefit is that we do not have > >> to walk all the space in between which is likely to be allocated. > > > > Yep, but my question is: If we have per-inode streaming goal, can > > you show an example when the "best" goal will be different from the > > "streaming" goal? Because from a (I admit rather quick) look at how > > each of these is computed, it seems that both will point after the > > next allocated block in case of streaming IO. > > There were a few goals with the design of the mballoc allocator: > - keep large allocations sized and aligned on RAID chunks > - pack small allocations together, so they can fit into a > single RAID chunk and avoid many read-modify-write cycles > - keep allocations relatively close together, so that seeking is > minimized and it doesn't search through fragmented free space > > It was designed for high-bandwidth streaming IO, and not small random IO as is seen with VM images and thin-provisioned storage. That said, with TRIM and sparse storage, there is no guarantee that offset within the virtual device has any association with the real disk, so I'm not sure it makes sense to optimize for this case. However it is used for small random IO as well. We're going to use EXT4_MB_STREAM_ALLOC every time when the file, or allocation is bigger than sbi->s_mb_stream_request. Are you saying that it was not designed that way ? Moreover we do not really care about the association of the offset within the virtual device and the real disk. That's not the problem. The offset might only be a problem on real disk, where it might be getting slower as we move further into the file system, but that's also questionable in some cases. However with thin-provisioning this might really be a problem because we would "provision" much more blocks that the file system really needs, because we're not reusing blocks which has been recently freed, but rather touching new block. This will unnecessarily drain thin-provisioning free space pool. > > The ext4 code will already bias new allocations towards the beginning of the disk because this is how inodes are allocated and this influences each inode's block allocation. The global target allows the large allocations to get out of the mess of fragmented free space and into other parts of the filesystem that are not as heavily used. > > Rather than tweaking the global target (which is currently just a simple round-robin), it probably makes sense to look at a global extent map for free space, so that it is easier to find large chunks of free space. It is still desirable to keep these allocations closer together, otherwise the "optimum" location for several large allocations may be on different ends of the disk and the situation would be far worse than making periodic writes over the whole disk. I agre that having a global extent map for free space will be useful. But this is a different problem even though somewhat related. Thanks! -Lukas > > Cheers, Andreas > > > > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2] ext4: Try to better reuse recently freed space 2013-07-08 11:59 ` Jan Kara 2013-07-08 21:27 ` Andreas Dilger @ 2013-07-10 11:18 ` Lukáš Czerner 1 sibling, 0 replies; 21+ messages in thread From: Lukáš Czerner @ 2013-07-10 11:18 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4, Andreas Dilger [-- Attachment #1: Type: TEXT/PLAIN, Size: 4760 bytes --] On Mon, 8 Jul 2013, Jan Kara wrote: > Date: Mon, 8 Jul 2013 13:59:51 +0200 > From: Jan Kara <jack@suse.cz> > To: Lukáš Czerner <lczerner@redhat.com> > Cc: Jan Kara <jack@suse.cz>, linux-ext4@vger.kernel.org, > Andreas Dilger <adilger.kernel@dilger.ca> > Subject: Re: [PATCH v2] ext4: Try to better reuse recently freed space > > On Mon 08-07-13 11:24:01, Lukáš Czerner wrote: > > On Mon, 8 Jul 2013, Jan Kara wrote: > > > > > Date: Mon, 8 Jul 2013 10:56:03 +0200 > > > From: Jan Kara <jack@suse.cz> > > > To: Lukas Czerner <lczerner@redhat.com> > > > Cc: linux-ext4@vger.kernel.org, jack@suse.cz, > > > Andreas Dilger <adilger.kernel@dilger.ca> > > > Subject: Re: [PATCH v2] ext4: Try to better reuse recently freed space > > > > > > On Mon 08-07-13 09:38:27, Lukas Czerner wrote: > > > > Currently if the block allocator can not find the goal to allocate we > > > > would use global goal for stream allocation. However the global goal > > > > (s_mb_last_group and s_mb_last_start) will move further every time such > > > > allocation appears and never move backwards. > > > > > > > > This causes several problems in certain scenarios: > > > > > > > > - the goal will move further and further preventing us from reusing > > > > space which might have been freed since then. This is ok from the file > > > > system point of view because we will reuse that space eventually, > > > > however we're allocating block from slower parts of the spinning disk > > > > even though it might not be necessary. > > > > - The above also causes more serious problem for example for thinly > > > > provisioned storage (sparse images backed storage as well), because > > > > instead of reusing blocks which are already provisioned we would try > > > > to use new blocks. This would unnecessarily drain storage free blocks > > > > pool. > > > > - This will also cause blocks to be allocated further from the given > > > > goal than it's necessary. Consider for example truncating, or removing > > > > and rewriting the file in the loop. This workload will never reuse > > > > freed blocks until we continually claim and free all the block in the > > > > file system. > > > > > > > > Note that file systems like xfs, ext3, or btrfs does not have this > > > > problem. This is simply caused by the notion of global pool. > > > > > > > > Fix this by changing the global goal to be goal per inode. This will > > > > allow us to invalidate the goal every time the inode has been truncated, > > > > or newly created, so in those cases we would try to use the proper more > > > > specific goal which is based on inode position. > > > When looking at your patch for second time, I started wondering, whether > > > we need per-inode stream goal at all. We already do set goal in the > > > allocation request for mballoc (ar->goal) e.g. in ext4_ext_find_goal(). > > > It seems strange to then reset it inside mballoc and I don't even think > > > mballoc will change it to something else now when the goal is per-inode and > > > not global. > > > > Yes, we do set the goal in the allocation request and it is supposed > > to be the "best" goal. However sometimes it can not be fulfilled > > because we do not have any free block at "goal". > > > > That's when the global (or per-inode) goal comes into play. I suppose > > that there was several reasons for that. First of all it makes it > > easier for allocator, because it can directly jump at the point > > where we allocated last time and it is likely that there is some > > free space to allocate from - so the benefit is that we do not have > > to walk all the space in between which is likely to be allocated. > Yep, but my question is: If we have per-inode streaming goal, can you > show an example when the "best" goal will be different from the "streaming" > goal? Because from a (I admit rather quick) look at how each of these is > computed, it seems that both will point after the next allocated block in > case of streaming IO. EXT4_MB_STREAM_ALLOC or "streaming IO" is quite misleading name for what we have in ext4. It simply means that the file (or allocation) is bigger than certain threshold. So I think that one example would be when writing in the middle of sparse file when other processes might have already allocated requested blocks. This might be the case for file system images for example. Also for some reason I am seeing this when writing into file system image even though there are no other processes allocating from that file system. Simply hacking ext4 to print out the numbers when the goals differs and running xfstests shows that there are cases where it differs and where it helps to allocate from the per-inode goal. -Lukas > > Honza > ^ permalink raw reply [flat|nested] 21+ messages in thread
* [RFC PATCH 1/1] ext4: Try to better reuse recently freed space @ 2013-12-02 16:32 Lukáš Czerner 2013-12-04 5:21 ` Theodore Ts'o 2014-04-01 5:30 ` Theodore Ts'o 0 siblings, 2 replies; 21+ messages in thread From: Lukáš Czerner @ 2013-12-02 16:32 UTC (permalink / raw) To: linux-ext4 Hi all, this is the patch I send a while ago to fix the issue I've seen with a global allocation goal. This might no longer apply to the current kernel and it might not be the best approach, but I use this example just to start a discussion about those allocation goals and how to use, or change them. I think that we agree that the long term fix would be to have free extent map. But we might be able to do something quickly, unless someone commits to make the free extent map reality :) Thanks! -Lukas Currently if the block allocator can not find the goal to allocate we would use global goal for stream allocation. However the global goal (s_mb_last_group and s_mb_last_start) will move further every time such allocation appears and never move backwards. This causes several problems in certain scenarios: - the goal will move further and further preventing us from reusing space which might have been freed since then. This is ok from the file system point of view because we will reuse that space eventually, however we're allocating block from slower parts of the spinning disk even though it might not be necessary. - The above also causes more serious problem for example for thinly provisioned storage (sparse images backed storage as well), because instead of reusing blocks which are already provisioned we would try to use new blocks. This would unnecessarily drain storage free blocks pool. - This will also cause blocks to be allocated further from the given goal than it's necessary. Consider for example truncating, or removing and rewriting the file in the loop. This workload will never reuse freed blocks until we continually claim and free all the block in the file system. Note that file systems like xfs, ext3, or btrfs does not have this problem. This is simply caused by the notion of global pool. Fix this by changing the global goal to be goal per inode. This will allow us to invalidate the goal every time the inode has been truncated, or newly created, so in those cases we would try to use the proper more specific goal which is based on inode position. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- fs/ext4/ext4.h | 7 ++++--- fs/ext4/inode.c | 8 ++++++++ fs/ext4/mballoc.c | 20 ++++++++------------ 3 files changed, 20 insertions(+), 15 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 6ed348d..4dffa92 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -917,6 +917,10 @@ struct ext4_inode_info { /* Precomputed uuid+inum+igen checksum for seeding inode checksums */ __u32 i_csum_seed; + + /* where last allocation was done - for stream allocation */ + unsigned long i_last_group; + unsigned long i_last_start; }; /* @@ -1242,9 +1246,6 @@ struct ext4_sb_info { unsigned int s_mb_order2_reqs; unsigned int s_mb_group_prealloc; unsigned int s_max_dir_size_kb; - /* where last allocation was done - for stream allocation */ - unsigned long s_mb_last_group; - unsigned long s_mb_last_start; /* stats for buddy allocator */ atomic_t s_bal_reqs; /* number of reqs with len > 1 */ diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 0188e65..07d0434 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3702,6 +3702,10 @@ void ext4_truncate(struct inode *inode) else ext4_ind_truncate(handle, inode); + /* Invalidate last allocation counters */ + ei->i_last_group = UINT_MAX; + ei->i_last_start = UINT_MAX; + up_write(&ei->i_data_sem); if (IS_SYNC(inode)) @@ -4060,6 +4064,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino) inode->i_generation = le32_to_cpu(raw_inode->i_generation); ei->i_block_group = iloc.block_group; ei->i_last_alloc_group = ~0; + + /* Invalidate last allocation counters */ + ei->i_last_group = UINT_MAX; + ei->i_last_start = UINT_MAX; /* * NOTE! The in-memory inode i_data array is in little-endian order * even on big-endian machines: we do NOT byteswap the block numbers! diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index a9ff5e5..6c23666 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1591,7 +1591,6 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, struct ext4_buddy *e4b) { - struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); int ret; BUG_ON(ac->ac_b_ex.fe_group != e4b->bd_group); @@ -1622,10 +1621,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, get_page(ac->ac_buddy_page); /* store last allocated for subsequent stream allocation */ if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - spin_lock(&sbi->s_md_lock); - sbi->s_mb_last_group = ac->ac_f_ex.fe_group; - sbi->s_mb_last_start = ac->ac_f_ex.fe_start; - spin_unlock(&sbi->s_md_lock); + EXT4_I(ac->ac_inode)->i_last_group = ac->ac_f_ex.fe_group; + EXT4_I(ac->ac_inode)->i_last_start = ac->ac_f_ex.fe_start; } } @@ -2080,13 +2077,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_2order = i - 1; } - /* if stream allocation is enabled, use global goal */ - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - /* TBD: may be hot point */ - spin_lock(&sbi->s_md_lock); - ac->ac_g_ex.fe_group = sbi->s_mb_last_group; - ac->ac_g_ex.fe_start = sbi->s_mb_last_start; - spin_unlock(&sbi->s_md_lock); + /* if stream allocation is enabled and per inode goal is + * set, use it */ + if ((ac->ac_flags & EXT4_MB_STREAM_ALLOC) && + (EXT4_I(ac->ac_inode)->i_last_start != UINT_MAX)) { + ac->ac_g_ex.fe_group = EXT4_I(ac->ac_inode)->i_last_group; + ac->ac_g_ex.fe_start = EXT4_I(ac->ac_inode)->i_last_start; } /* Let's just scan groups to find more-less suitable blocks */ -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space 2013-12-02 16:32 [RFC PATCH 1/1] " Lukáš Czerner @ 2013-12-04 5:21 ` Theodore Ts'o 2014-04-01 5:30 ` Theodore Ts'o 1 sibling, 0 replies; 21+ messages in thread From: Theodore Ts'o @ 2013-12-04 5:21 UTC (permalink / raw) To: Lukáš Czerner; +Cc: linux-ext4 On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote: > Hi all, > > this is the patch I send a while ago to fix the issue I've seen with > a global allocation goal. This might no longer apply to the current > kernel and it might not be the best approach, but I use this example > just to start a discussion about those allocation goals and how to > use, or change them. > > I think that we agree that the long term fix would be to have free > extent map. But we might be able to do something quickly, unless > someone commits to make the free extent map reality :) There was discussion from the previous thread here: http://patchwork.ozlabs.org/patch/257476/ Looking at this, I think the patch makes sense, but I think it would be good if we started thinking about what would be some good benchmarks so we can measure and compare various different allocation strategies. Part of it would certainly be how long it takes to write or fallocate the files, but also how how fragmented the files are. Some indication about how friendly the file system is to flash and thin-provisioned systems. (Where re-using blocks sooner rather than later might be helpful if the user is only running fstrim once a week or some such.) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space 2013-12-02 16:32 [RFC PATCH 1/1] " Lukáš Czerner 2013-12-04 5:21 ` Theodore Ts'o @ 2014-04-01 5:30 ` Theodore Ts'o 2014-04-01 11:44 ` Dmitry Monakhov ` (3 more replies) 1 sibling, 4 replies; 21+ messages in thread From: Theodore Ts'o @ 2014-04-01 5:30 UTC (permalink / raw) To: Andreas Dilger; +Cc: Lukáš Czerner, linux-ext4 On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote: > Hi all, > > this is the patch I send a while ago to fix the issue I've seen with > a global allocation goal. This might no longer apply to the current > kernel and it might not be the best approach, but I use this example > just to start a discussion about those allocation goals and how to > use, or change them. > > I think that we agree that the long term fix would be to have free > extent map. But we might be able to do something quickly, unless > someone commits to make the free extent map reality :) Hi Andreas, We discussed possibly applying this patch last week at the ext4 workshop, possibly as early as for the 3.15 merge window: http://patchwork.ozlabs.org/patch/295956/ However, I'm guessing that Lustre has the workload which is most likely to regress if we were to simply apply this patch. But, it's likely it will improve things for many/most other ext4 workloads. We did talk about trying to assemble some block allocation performance tests so we can better measure proposed changes to the block allocator, but that's not something we have yet. However, this global goal is definitely causing problems for a number of use cases, including thinp and being flash friendly. Would you be willing to apply this patch and then run some benchmarks to see if Lustre would be impacted negatively if we were to apply this patch for the next development cycle (i.e., not for 3.15, but for the next merge window)? Thanks, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space 2014-04-01 5:30 ` Theodore Ts'o @ 2014-04-01 11:44 ` Dmitry Monakhov 2014-04-01 14:47 ` Theodore Ts'o 2014-04-07 18:22 ` Andreas Dilger ` (2 subsequent siblings) 3 siblings, 1 reply; 21+ messages in thread From: Dmitry Monakhov @ 2014-04-01 11:44 UTC (permalink / raw) To: Theodore Ts'o, Andreas Dilger; +Cc: Lukáš Czerner, linux-ext4 On Tue, 1 Apr 2014 01:30:18 -0400, Theodore Ts'o <tytso@mit.edu> wrote: > On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote: > > Hi all, > > > > this is the patch I send a while ago to fix the issue I've seen with > > a global allocation goal. This might no longer apply to the current > > kernel and it might not be the best approach, but I use this example > > just to start a discussion about those allocation goals and how to > > use, or change them. > > > > I think that we agree that the long term fix would be to have free > > extent map. But we might be able to do something quickly, unless > > someone commits to make the free extent map reality :) > > Hi Andreas, > > We discussed possibly applying this patch last week at the ext4 > workshop, possibly as early as for the 3.15 merge window: > > http://patchwork.ozlabs.org/patch/295956/ > > However, I'm guessing that Lustre has the workload which is most > likely to regress if we were to simply apply this patch. But, it's > likely it will improve things for many/most other ext4 workloads. > > We did talk about trying to assemble some block allocation performance > tests so we can better measure proposed changes to the block > allocator, but that's not something we have yet. However, this global BTW where this can I find this discussion? I would like to cooperate this that activity. Please CC me next time you will disscuss allocation performance mesurments. At Parallels we run https://oss.oracle.com/~mason/compilebench/ as load simulator. > goal is definitely causing problems for a number of use cases, > including thinp and being flash friendly. > > Would you be willing to apply this patch and then run some benchmarks > to see if Lustre would be impacted negatively if we were to apply this > patch for the next development cycle (i.e., not for 3.15, but for the > next merge window)? > > Thanks, > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space 2014-04-01 11:44 ` Dmitry Monakhov @ 2014-04-01 14:47 ` Theodore Ts'o 2014-04-01 16:35 ` Dmitry Monakhov 0 siblings, 1 reply; 21+ messages in thread From: Theodore Ts'o @ 2014-04-01 14:47 UTC (permalink / raw) To: Dmitry Monakhov; +Cc: Andreas Dilger, Lukáš Czerner, linux-ext4 On Tue, Apr 01, 2014 at 03:44:53PM +0400, Dmitry Monakhov wrote: > BTW where this can I find this discussion? I would like to cooperate > this that activity. Please CC me next time you will disscuss allocation > performance mesurments. At Parallels we run https://oss.oracle.com/~mason/compilebench/ > as load simulator. The discussion happened at the Ext4 developer's get together in Napa, California, colocated with the LSF/MM and the Collaboration Summit. You should go next year; it was a huge amount of fun, and there were a bunch of other Parallels people there who can tell you about the reception at the Jacuzzi Family Winery, etc. :-) I suspect there will be some future conversations at our weekly conference calls. Typically design stuff will happen there, but technical low-level details about things like patches will happen on the mailing list, so you'll be alerted when we start having specific patches to evaluate and as we start putting together a set of allocation benchmarks. If you are interested in participating on the conference calls, contact me off-line. If the current time (8AM US/Pacific ; 11 AM US/Eastern) isn't good for you, we can try to see if another time works for everyone. One of the discussion points that came up last week is that it would be good if we can come up with allocation tests that are fast to run. That might mean (for example) taking a workload such as compilebench, and changing it to use fallocate() or having a mount option which causes the actual data path writes to be skipped for files. We would then need to have some kind of metric to evaluate how "good" a particular file system layout ends up being at the end of the workload. Not just for a specific file, but for all of the files in some kind of holistic measurement of "goodness", as well as looking at how fragmented the free space ended up being. Exactly how we do this is still something that we need to figure out; if you have any suggestions, they would be most welcome! Cheers, - Ted ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space 2014-04-01 14:47 ` Theodore Ts'o @ 2014-04-01 16:35 ` Dmitry Monakhov 0 siblings, 0 replies; 21+ messages in thread From: Dmitry Monakhov @ 2014-04-01 16:35 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Andreas Dilger, Lukáš Czerner, linux-ext4 On Tue, 1 Apr 2014 10:47:08 -0400, "Theodore Ts'o" <tytso@mit.edu> wrote: > On Tue, Apr 01, 2014 at 03:44:53PM +0400, Dmitry Monakhov wrote: > > BTW where this can I find this discussion? I would like to cooperate > > this that activity. Please CC me next time you will disscuss allocation > > performance mesurments. At Parallels we run https://oss.oracle.com/~mason/compilebench/ > > as load simulator. > > The discussion happened at the Ext4 developer's get together in Napa, > California, colocated with the LSF/MM and the Collaboration Summit. > You should go next year; it was a huge amount of fun, and there were a > bunch of other Parallels people there who can tell you about the > reception at the Jacuzzi Family Winery, etc. :-) Hm... the truth is that I was there. I am the man which asked your opinion about mfsync(multy-file-fsync) remember :) But probably I've simply missed an allocation topic. > > I suspect there will be some future conversations at our weekly > conference calls. Typically design stuff will happen there, but > technical low-level details about things like patches will happen on > the mailing list, so you'll be alerted when we start having specific > patches to evaluate and as we start putting together a set of > allocation benchmarks. > > If you are interested in participating on the conference calls, > contact me off-line. If the current time (8AM US/Pacific ; 11 AM > US/Eastern) isn't good for you, we can try to see if another time > works for everyone. Yes. it would be nice. Please invite be to the next call. > > One of the discussion points that came up last week is that it would > be good if we can come up with allocation tests that are fast to run. > That might mean (for example) taking a workload such as compilebench, > and changing it to use fallocate() or having a mount option which > causes the actual data path writes to be skipped for files. We would > then need to have some kind of metric to evaluate how "good" a > particular file system layout ends up being at the end of the > workload. Not just for a specific file, but for all of the files in > some kind of holistic measurement of "goodness", as well as looking at > how fragmented the free space ended up being. Exactly how we do this > is still something that we need to figure out; if you have any > suggestions, they would be most welcome! > > Cheers, > > - Ted ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space 2014-04-01 5:30 ` Theodore Ts'o 2014-04-01 11:44 ` Dmitry Monakhov @ 2014-04-07 18:22 ` Andreas Dilger 2014-04-07 20:01 ` Andreas Dilger 2014-04-08 1:14 ` Andreas Dilger 3 siblings, 0 replies; 21+ messages in thread From: Andreas Dilger @ 2014-04-07 18:22 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Lukáš Czerner, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 4660 bytes --] On Mar 31, 2014, at 11:30 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote: >> Hi all, >> >> this is the patch I send a while ago to fix the issue I've seen with >> a global allocation goal. This might no longer apply to the current >> kernel and it might not be the best approach, but I use this example >> just to start a discussion about those allocation goals and how to >> use, or change them. >> >> I think that we agree that the long term fix would be to have free >> extent map. But we might be able to do something quickly, unless >> someone commits to make the free extent map reality :) > > Hi Andreas, > > We discussed possibly applying this patch last week at the ext4 > workshop, possibly as early as for the 3.15 merge window: > > http://patchwork.ozlabs.org/patch/295956/ Sorry for the delay in replying, I was away the past couple of weeks on vacation, and am heading this week to the Lustre conference. > However, I'm guessing that Lustre has the workload which is most > likely to regress if we were to simply apply this patch. But, it's > likely it will improve things for many/most other ext4 workloads. I definitely agree that this will help with SSD storage, and thinp workloads that end up doing random IO to the underlying storage. The main reason for using the global goal is to maximize the chance of getting large contiguous allocations, since it does a round-robin traversal of the groups and maximizes the time that other blocks can be freed in each group. It also minimizes the churn of allocating large files in groups that may not have large contiguous extents (e.g. if small files/extents have just been freed in a group, but not all of the blocks in that group are freed). The global goal avoids not only the higher chance of finding free small chunks for the file, but also the CPU overhead of re-scanning groups that are partially full to find large free extents. > We did talk about trying to assemble some block allocation performance > tests so we can better measure proposed changes to the block > allocator, but that's not something we have yet. However, this global > goal is definitely causing problems for a number of use cases, > including thinp and being flash friendly. An important question is whether thinp and flash are the main uses for ext4? I could imagine ignoring the global goal if rotational == 0, or if some large fraction of the filesystem was just deleted, but this change means that each inode will have to re-scan the free groups to find large extents and I think this will have a long-term negative impact on file extent size. Note that the global goal is only used if the per-inode goal is already full (i.e. ext4_mb_regular_allocator->ext4_mb_find_by_goal() fails), so it will need to find a new allocation every time. I think typical benchmarks that allocate and then free all space in a new filesystem will not exercise this code properly. Something that allocates space but then only frees some of it, preferably multi-threaded. > Would you be willing to apply this patch and then run some benchmarks > to see if Lustre would be impacted negatively if we were to apply this > patch for the next development cycle (i.e., not for 3.15, but for the > next merge window)? Since most of the Lustre developers are travelling this week, it will be hard to get this patch suitably tested quickly. Also, Lustre servers are not running on 3.14 kernels yet, so the patch would need to be backported to a kernel that the Lustre server is currently running on. Some minor notes about the patch itself: - if there is a desire to land this patch quickly, it would be great to have this behaviour at least selectable via mount option and/or /sys/fs tuneable that turns this off and on. That would allow the patch to land and simplify testing, and disable it if it later shows regressions. - the i_last_group and i_last_start are unsigned long, but only need to be unsigned int like fe_group (ext4_group_t) and fe_start (ext4_grpblk_t). That saves 8 bytes per inode that didn't matter in the superblock. - there is no locking for the update of i_last_{group,start} if there are parallel writers on the same inode - rather than dropping the global goal entirely, it would be better to have a list of groups that have relatively free space. I know we also discussed using an rbtree to locate free extents instead of per-group buddy bitmaps. That would probably avoid the need for the global goal. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space 2014-04-01 5:30 ` Theodore Ts'o 2014-04-01 11:44 ` Dmitry Monakhov 2014-04-07 18:22 ` Andreas Dilger @ 2014-04-07 20:01 ` Andreas Dilger 2014-04-08 1:14 ` Andreas Dilger 3 siblings, 0 replies; 21+ messages in thread From: Andreas Dilger @ 2014-04-07 20:01 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Lukáš Czerner, Ext4 Developers List [-- Attachment #1: Type: text/plain, Size: 4642 bytes --] On Mar 31, 2014, at 11:30 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote: >> Hi all, >> >> this is the patch I send a while ago to fix the issue I've seen with >> a global allocation goal. This might no longer apply to the current >> kernel and it might not be the best approach, but I use this example >> just to start a discussion about those allocation goals and how to >> use, or change them. >> >> I think that we agree that the long term fix would be to have free >> extent map. But we might be able to do something quickly, unless >> someone commits to make the free extent map reality :) > > Hi Andreas, > > We discussed possibly applying this patch last week at the ext4 > workshop, possibly as early as for the 3.15 merge window: > > http://patchwork.ozlabs.org/patch/295956/ Sorry for the delay in replying, I was away the past couple of weeks on vacation, and am heading this week to the Lustre conference. > However, I'm guessing that Lustre has the workload which is most > likely to regress if we were to simply apply this patch. But, it's > likely it will improve things for many/most other ext4 workloads. I definitely agree that this will help with SSD storage, and thinp workloads that end up doing random IO to the underlying storage. The main reason for using the global goal is to maximize the chance of getting large contiguous allocations, since it does a round-robin traversal of the groups and maximizes the time that other blocks can be freed in each group. It also minimizes the churn of allocating large files in groups that may not have large contiguous extents (e.g. if small files/extents have just been freed in a group, but not all of the blocks in that group are freed). The global goal avoids not only the higher chance of finding free small chunks for the file, but also the CPU overhead of re-scanning groups that are partially full to find large free extents. > We did talk about trying to assemble some block allocation performance > tests so we can better measure proposed changes to the block > allocator, but that's not something we have yet. However, this global > goal is definitely causing problems for a number of use cases, > including thinp and being flash friendly. An important question is whether thinp and flash are the main uses for ext4? I could imagine ignoring the global goal if rotational == 0, or if some large fraction of the filesystem was just deleted, but this change means that each inode will have to re-scan the free groups to find large extents and I think this will have a long-term negative impact on file extent size. Note that the global goal is only used if the per-inode goal is already full (i.e. ext4_mb_regular_allocator->ext4_mb_find_by_goal() fails), so it will need to find a new allocation every time. I think typical benchmarks that allocate and then free all space in a new filesystem will not exercise this code properly. Something that allocates space but then only frees some of it, preferably multi-threaded. > Would you be willing to apply this patch and then run some benchmarks > to see if Lustre would be impacted negatively if we were to apply this > patch for the next development cycle (i.e., not for 3.15, but for the > next merge window)? Since most of the Lustre developers are travelling this week, it will be hard to get this patch suitably tested quickly. Also, Lustre servers are not running on 3.14 kernels yet, so the patch would need to be backported to a kernel that the Lustre server is currently running on. Some minor notes about the patch itself: - if there is a desire to land this patch quickly, it would be great to have this behaviour at least selectable via mount option and/or /sys/fs tuneable that turns this off and on. That would allow the patch to land and simplify testing, and disable it if it later shows regressions. - the i_last_group and i_last_start are unsigned long, but only need to be unsigned int like fe_group (ext4_group_t) and fe_start (ext4_grpblk_t). That saves 8 bytes per inode that didn't matter in the superblock. - there is no locking for the update of i_last_{group,start} if there are parallel writers on the same inode - rather than dropping the global goal entirely, it would be better to have a list of groups that have relatively free space. I know we also discussed using an rbtree to locate free extents instead of per-group buddy bitmaps. That would probably avoid the need for the global goal. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC PATCH 1/1] ext4: Try to better reuse recently freed space 2014-04-01 5:30 ` Theodore Ts'o ` (2 preceding siblings ...) 2014-04-07 20:01 ` Andreas Dilger @ 2014-04-08 1:14 ` Andreas Dilger 3 siblings, 0 replies; 21+ messages in thread From: Andreas Dilger @ 2014-04-08 1:14 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Lukáš Czerner, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 4514 bytes --] On Mar 31, 2014, at 11:30 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Mon, Dec 02, 2013 at 05:32:06PM +0100, Lukáš Czerner wrote: >> Hi all, >> >> this is the patch I send a while ago to fix the issue I've seen with >> a global allocation goal. This might no longer apply to the current >> kernel and it might not be the best approach, but I use this example >> just to start a discussion about those allocation goals and how to >> use, or change them. >> >> I think that we agree that the long term fix would be to have free >> extent map. But we might be able to do something quickly, unless >> someone commits to make the free extent map reality :) > > Hi Andreas, > > We discussed possibly applying this patch last week at the ext4 > workshop, possibly as early as for the 3.15 merge window: > > http://patchwork.ozlabs.org/patch/295956/ Sorry for the delay in replying, I was away the past couple of weeks on vacation, and am heading this week to the Lustre conference. > However, I'm guessing that Lustre has the workload which is most > likely to regress if we were to simply apply this patch. But, it's > likely it will improve things for many/most other ext4 workloads. I definitely agree that this will help with SSD storage, and thinp workloads that end up doing random IO to the underlying storage. The main reason for using the global goal is to maximize the chance of getting large contiguous allocations, since it does a round-robin traversal of the groups and maximizes the time that other blocks can be freed in each group. It also minimizes the churn of allocating large files in groups that may not have large contiguous extents (e.g. if small files/extents have just been freed in a group, but not all of the blocks in that group are freed). The global goal avoids not only the higher chance of finding free small chunks for the file, but also the CPU overhead of re-scanning groups that are partially full to find large free extents. > We did talk about trying to assemble some block allocation performance > tests so we can better measure proposed changes to the block > allocator, but that's not something we have yet. However, this global > goal is definitely causing problems for a number of use cases, > including thinp and being flash friendly. An important question is whether thinp and flash are the main uses for ext4? I could imagine ignoring the global goal if rotational == 0, or if some large fraction of the filesystem was just deleted, but this change means that each inode will have to re-scan the free groups to find large extents and I think this will have a long-term negative impact on file extent size. Note that the global goal is only used if the per-inode goal is already full (i.e. ext4_mb_regular_allocator->ext4_mb_find_by_goal() fails), so it will need to find a new allocation every time. I think typical benchmarks that allocate and then free all space in a new filesystem will not exercise this code properly. Something that allocates space but then only frees some of it, preferably multi-threaded. > Would you be willing to apply this patch and then run some benchmarks > to see if Lustre would be impacted negatively if we were to apply this > patch for the next development cycle (i.e., not for 3.15, but for the > next merge window)? Since most of the Lustre developers are travelling this week, it will be hard to get this patch suitably tested quickly. Also, Lustre servers are not running on 3.14 kernels yet, so the patch would need to be backported to a kernel that the Lustre server is currently running on. Some minor notes about the patch itself: - if there is a desire to land this patch quickly, it would be great to have this behaviour at least selectable via mount option and/or /sys/fs tuneable that turns this off and on. That would allow the patch to land even if we haven't finished testing, or it later shows regressions. - the i_last_group and i_last_start are unsigned long, but only need to be unsigned int like fe_group (ext4_group_t) and fe_start (ext4_grpblk_t). That saves 8 bytes per inode that didn't matter in the superblock. - there is no locking for the update of i_last_{group,start} - rather than dropping the global goal entirely, it would be better to have a list of groups that have relatively free space. I know we also discussed using an rbtree to locate the free extents. Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2014-04-08 1:15 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-07-04 9:11 [RFC PATCH 0/1] ext4: Try to better reuse recently freed space Lukas Czerner 2013-07-04 9:11 ` [RFC PATCH 1/1] " Lukas Czerner 2013-07-04 15:09 ` Jan Kara 2013-07-04 15:32 ` Lukáš Czerner 2013-07-04 16:06 ` Jose_Mario_Gallegos 2013-07-08 7:38 ` [PATCH v2] " Lukas Czerner 2013-07-08 8:56 ` Jan Kara 2013-07-08 9:24 ` Lukáš Czerner 2013-07-08 11:59 ` Jan Kara 2013-07-08 21:27 ` Andreas Dilger 2013-07-10 11:30 ` Lukáš Czerner 2013-07-10 11:18 ` Lukáš Czerner -- strict thread matches above, loose matches on Subject: below -- 2013-12-02 16:32 [RFC PATCH 1/1] " Lukáš Czerner 2013-12-04 5:21 ` Theodore Ts'o 2014-04-01 5:30 ` Theodore Ts'o 2014-04-01 11:44 ` Dmitry Monakhov 2014-04-01 14:47 ` Theodore Ts'o 2014-04-01 16:35 ` Dmitry Monakhov 2014-04-07 18:22 ` Andreas Dilger 2014-04-07 20:01 ` Andreas Dilger 2014-04-08 1:14 ` Andreas Dilger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).