public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
From: Badari Pulavarty <pbadari@gmail.com>
To: cmm@us.ibm.com
Cc: Jan Kara <jack@suse.cz>, linux-ext4@vger.kernel.org, sandeen@redhat.com
Subject: Re: Delayed allocation and page_lock vs transaction start ordering
Date: Wed, 16 Apr 2008 12:55:18 -0700	[thread overview]
Message-ID: <1208375718.17986.7.camel@badari-desktop> (raw)
In-Reply-To: <1208370260.3603.4.camel@localhost.localdomain>


On Wed, 2008-04-16 at 11:24 -0700, Mingming Cao wrote:
> On Wed, 2008-04-16 at 12:35 +0200, Jan Kara wrote:
> > On Tue 15-04-08 16:33:17, Mingming Cao wrote:
> > > On Tue, 2008-04-15 at 16:28 -0700, Mingming Cao wrote:
> > > > On Tue, 2008-04-15 at 11:08 -0700, Mingming Cao wrote:
> > > > > On Tue, 2008-04-15 at 18:14 +0200, Jan Kara wrote:
> > > > > >   Hi,
> > > > > > 
> > > > > >   I've ported my patch inversing locking ordering of page_lock and
> > > > > > transaction start to ext4 (on top of ext4 patch queue). Everything except
> > > > > > delayed allocation is converted (the patch is below for interested
> > > > > > readers). The question is how to proceed with delayed allocation. Its
> > > > > > current implementation in VFS is designed to work well with the old
> > > > > > ordering (page lock first, then start a transaction). We could bend it to
> > > > > > work with the new locking ordering but I really see no point since ext4 is
> > > > > > the only user. 
> > > > > 
> > > > > I think the plan is port the changes to ext2/3/JFS and support delayed
> > > > > allocation on those filesystems. 
> > > > > 
> > > > > > Also XFS has AFAIK ordering first start transaction, then
> > > > > > lock pages so if we should ever merge delayed alloc implementations the new
> > > > > > ordering would make it easier.
> > > > > >   So what do people think here? Do you agree with reimplementing current
> > > > > > mpage_da_... functions?
> > > > > 
> > > > > It worth a try, but I could not see how to bend delayed allocation to
> > > > > work the new ordering:( With delayed allocation Ext4 gets into
> > > > > writepage() directly with page locked, but we need to start transaction
> > > > > to do block allocation...:(
> > > > 
> > > > Looked again it seems possible to reservse the order with delayed
> > > > allocation. with ext3_da_writepgaes() we could start the journal before
> > > > calling mpage_da_writepages()(which will lock the pages), instead of
> > > > start the journal inside ext4_da_get_block_write(). So that we could get
> > > > the locking order right. Just need to taking care of the estimated
> > > > credits right.
> > > > 
> > > > How about this? (untested, just throw out for comment)
> > > 
> > > Seems sent out an old version, this version compiles
> >   Thanks for the patch. Some comments are below.
> > 
> > > ---
> > >  fs/ext4/inode.c |   53 ++++++++++++++++++++++++++++++++++++++++-------------
> > >  1 file changed, 40 insertions(+), 13 deletions(-)
> > > 
> > > Index: linux-2.6.25-rc9/fs/ext4/inode.c
> > > ===================================================================
> > > --- linux-2.6.25-rc9.orig/fs/ext4/inode.c	2008-04-15 15:40:33.000000000 -0700
> > > +++ linux-2.6.25-rc9/fs/ext4/inode.c	2008-04-15 16:32:10.000000000 -0700
> > > @@ -1437,18 +1437,12 @@ static int ext4_da_get_block_prep(struct
> > >  static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
> > >  				   struct buffer_head *bh_result, int create)
> > >  {
> > > -	int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> > > +	int ret;
> > >  	unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> > >  	loff_t disksize = EXT4_I(inode)->i_disksize;
> > >  	handle_t *handle = NULL;
> > >  
> > > -	if (create) {
> > > -		handle = ext4_journal_start(inode, needed_blocks);
> > > -		if (IS_ERR(handle)) {
> > > -			ret = PTR_ERR(handle);
> > > -			goto out;
> > > -		}
> > > -	}
> > > +	handle = ext4_journal_current_handle();
> >   Maybe we could assert that handle != NULL? When using delayed allocation,
> > a transaction should always be started.
> > 
> Agreed.
> 
> > >  	ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
> > >  				   bh_result, create, 0);
> > > @@ -1483,17 +1477,51 @@ static int ext4_da_get_block_write(struc
> > >  		ret = 0;
> > >  	}
> > >  
> > > -out:
> > > -	if (handle && !IS_ERR(handle))
> > > -		ext4_journal_stop(handle);
> > > -
> > >  	return ret;
> > >  }
> > >  
> > > +/*
> > > + * For now just follow the DIO way to estimate the max credits
> > > + * needed to write out EXT4_MAX_BUF_BLOCKS pages.
> > > + * todo: need to calculate the max credits need for
> > > + * extent based files, currently the DIO credits is based on
> > > + * indirect-blocks mapping way.
> > > + *
> > > + * Probably should have a generic way to calculate credits
> > > + * for DIO, writepages, and truncate
> > > + */
> > > +#define EXT4_MAX_BUF_BLOCKS	DIO_MAX_BLOCKS
> > > +#define EXT4_MAX_BUF_CREDITS	DIO_CREDITS
> > > +
> > >  static int ext4_da_writepages(struct address_space *mapping,
> > >  				struct writeback_control *wbc)
> > >  {
> > > -	return mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> > > +	struct inode *inode = mapping->host;
> > > +	handle_t *handle = NULL;
> > > +	int needed_blocks;
> > > +	int ret;
> > > +
> > > +	/*
> > > +	 * Estimate the worse case needed credits to write out
> > > +	 * EXT4_MAX_BUF_BLOCKS pages
> > > +	 */
> > > +	needed_blocks = EXT4_MAX_BUF_CREDITS;
> > > +
> > > +	/* start the transaction with credits*/
> > > +	handle = ext4_journal_start(inode, needed_blocks);
> > > +	if (IS_ERR(handle)) {
> > > +		ret = PTR_ERR(handle);
> > > +		return ret;
> > > +	}
> > > +
> > > +	/* set the max pages could be write-out at a time */
> > > +	wbc->range_end = wbc->range_start +
> > > +			EXT4_MAX_BUF_BLOCKS << PAGE_CACHE_SHIFT - 1;
> >   I think limiting mpage_da_writepages through nr_to_write is better than
> > through range_end. That way you don't count clean pages...
> > 
> 
> You are right. 
> 
> > > +
> > > +	ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> > > +	ext4_journal_stop(handle);
> >   But here we can't just stop. We have to write everything original caller
> > has asked about (at least in WB_SYNC_ALL mode). But the question is where
> > to resume because scanning the whole range again is kind-of excessive and
> > prone do livelock with other process dirtying the file via mmap. Maybe if
> > we slightly modified write_cache_pages() to always store in writeback_index
> > where they finished, we could use this value.
> 
> Thanks for pointing this out.
> How about this? 
> ---
>  fs/ext4/inode.c     |   70 ++++++++++++++++++++++++++++++++++++++++++----------
>  mm/page-writeback.c |    2 -
>  2 files changed, 58 insertions(+), 14 deletions(-)
> 
> Index: linux-2.6.25-rc9/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.25-rc9.orig/fs/ext4/inode.c	2008-04-16 09:59:00.000000000 -0700
> +++ linux-2.6.25-rc9/fs/ext4/inode.c	2008-04-16 11:23:12.000000000 -0700
> @@ -1437,18 +1437,13 @@ static int ext4_da_get_block_prep(struct
>  static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
>  				   struct buffer_head *bh_result, int create)
>  {
> -	int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> +	int ret;
>  	unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
>  	loff_t disksize = EXT4_I(inode)->i_disksize;
>  	handle_t *handle = NULL;
>  
> -	if (create) {
> -		handle = ext4_journal_start(inode, needed_blocks);
> -		if (IS_ERR(handle)) {
> -			ret = PTR_ERR(handle);
> -			goto out;
> -		}
> -	}
> +	J_ASSERT(handle != NULL || create == 0);
> +	handle = ext4_journal_current_handle();
>  
>  	ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
>  				   bh_result, create, 0);
> @@ -1483,17 +1478,66 @@ static int ext4_da_get_block_write(struc
>  		ret = 0;
>  	}
>  
> -out:
> -	if (handle && !IS_ERR(handle))
> -		ext4_journal_stop(handle);
> -
>  	return ret;
>  }
>  
> +/*
> + * For now just follow the DIO way to estimate the max credits
> + * needed to write out EXT4_MAX_WRITEBACK_PAGES.
> + * todo: need to calculate the max credits need for
> + * extent based files, currently the DIO credits is based on
> + * indirect-blocks mapping way.
> + *
> + * Probably should have a generic way to calculate credits
> + * for DIO, writepages, and truncate
> + */
> +#define EXT4_MAX_WRITEBACK_PAGES	DIO_MAX_BLOCKS
> +#define EXT4_MAX_WRITEBACK_CREDITS	DIO_CREDITS
> +
>  static int ext4_da_writepages(struct address_space *mapping,
>  				struct writeback_control *wbc)
>  {
> -	return mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> +	struct inode *inode = mapping->host;
> +	handle_t *handle = NULL;
> +	int needed_blocks;
> +	int ret = 0;
> +	unsigned range_cyclic;
> +	long to_write;
> +
> +	/*
> +	 * Estimate the worse case needed credits to write out
> +	 * EXT4_MAX_BUF_BLOCKS pages
> +	 */
> +	needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
> +
> +	to_write = wbc->nr_to_write;
> +	range_cyclic = wbc->range_cyclic;
> +	wbc->range_cyclic = 1;
> +
> +	while (!ret && to_write) {
> +		/* start a new transaction*/
> +		handle = ext4_journal_start(inode, needed_blocks);
> +		if (IS_ERR(handle)) {
> +			ret = PTR_ERR(handle);
> +			goto out_writepages;
> +		}
> +		/*
> +		 * set the max dirty pages could be write at a time
> +		 * to fit into the reserved transaction credits
> +		 */
> +		if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
> +			wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
> +		to_write -= wbc->nr_to_write;
> +
> +		ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> +		ext4_journal_stop(handle);
> +		to_write +=wbc->nr_to_write;
> +	}

You need to set wbc->nr_to_write in the loop before calling
mpage_da_write_page() (for the next iteration).

> +
> +out_writepages:
> +	wbc->nr_to_write = to_write;
> +	wbc->range_cyclic = range_cyclic;
> +	return ret;
>  }
>  
>  static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> Index: linux-2.6.25-rc9/mm/page-writeback.c
> ===================================================================
> --- linux-2.6.25-rc9.orig/mm/page-writeback.c	2008-04-16 11:00:20.000000000 -0700
> +++ linux-2.6.25-rc9/mm/page-writeback.c	2008-04-16 11:07:59.000000000 -0700
> @@ -816,7 +816,7 @@ int write_cache_pages(struct address_spa
>  	pagevec_init(&pvec, 0);
>  	if (wbc->range_cyclic) {
>  		index = mapping->writeback_index; /* Start from prev offset */
> -		end = -1;
> +		  end = wbc->range_end >> PAGE_CACHE_SHIFT;

Hmm. There are other callers to write_cache_pages() using
"range_cyclic" . Did you check them to make sure, they set range_end
correctly ?

Thanks,
Badari


  reply	other threads:[~2008-04-16 19:55 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-15 16:14 Delayed allocation and page_lock vs transaction start ordering Jan Kara
2008-04-15 17:58 ` Badari Pulavarty
2008-04-16  9:26   ` Jan Kara
2008-04-15 18:08 ` Mingming Cao
2008-04-15 23:28   ` Mingming Cao
2008-04-15 23:33     ` Mingming Cao
2008-04-16 10:35       ` Jan Kara
2008-04-16 18:24         ` Mingming Cao
2008-04-16 19:55           ` Badari Pulavarty [this message]
2008-04-16  9:38   ` Jan Kara
2008-04-18 18:54     ` Andreas Dilger
2008-04-18 19:38       ` Mingming Cao
2008-04-21 17:13       ` Jan Kara
2008-05-21  8:21 ` Aneesh Kumar K.V
2008-05-26 17:21   ` Jan Kara
2008-05-26 18:00     ` Aneesh Kumar K.V
2008-05-27 12:43       ` Jan Kara
2008-05-27 15:11         ` Aneesh Kumar K.V
2008-05-28  9:33           ` Jan Kara
2008-05-28  9:43             ` Aneesh Kumar K.V
2008-05-28 10:33               ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1208375718.17986.7.camel@badari-desktop \
    --to=pbadari@gmail.com \
    --cc=cmm@us.ibm.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=sandeen@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox