Re: Delayed allocation and page_lock vs transaction start ordering

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mingming Cao <cmm@us.ibm.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org, sandeen@redhat.com
Subject: Re: Delayed allocation and page_lock vs transaction start ordering
Date: Wed, 16 Apr 2008 11:24:20 -0700	[thread overview]
Message-ID: <1208370260.3603.4.camel@localhost.localdomain> (raw)
In-Reply-To: <20080416103531.GC6116@duck.suse.cz>

On Wed, 2008-04-16 at 12:35 +0200, Jan Kara wrote:
> On Tue 15-04-08 16:33:17, Mingming Cao wrote:
> > On Tue, 2008-04-15 at 16:28 -0700, Mingming Cao wrote:
> > > On Tue, 2008-04-15 at 11:08 -0700, Mingming Cao wrote:
> > > > On Tue, 2008-04-15 at 18:14 +0200, Jan Kara wrote:
> > > > >   Hi,
> > > > > 
> > > > >   I've ported my patch inversing locking ordering of page_lock and
> > > > > transaction start to ext4 (on top of ext4 patch queue). Everything except
> > > > > delayed allocation is converted (the patch is below for interested
> > > > > readers). The question is how to proceed with delayed allocation. Its
> > > > > current implementation in VFS is designed to work well with the old
> > > > > ordering (page lock first, then start a transaction). We could bend it to
> > > > > work with the new locking ordering but I really see no point since ext4 is
> > > > > the only user. 
> > > > 
> > > > I think the plan is port the changes to ext2/3/JFS and support delayed
> > > > allocation on those filesystems. 
> > > > 
> > > > > Also XFS has AFAIK ordering first start transaction, then
> > > > > lock pages so if we should ever merge delayed alloc implementations the new
> > > > > ordering would make it easier.
> > > > >   So what do people think here? Do you agree with reimplementing current
> > > > > mpage_da_... functions?
> > > > 
> > > > It worth a try, but I could not see how to bend delayed allocation to
> > > > work the new ordering:( With delayed allocation Ext4 gets into
> > > > writepage() directly with page locked, but we need to start transaction
> > > > to do block allocation...:(
> > > 
> > > Looked again it seems possible to reservse the order with delayed
> > > allocation. with ext3_da_writepgaes() we could start the journal before
> > > calling mpage_da_writepages()(which will lock the pages), instead of
> > > start the journal inside ext4_da_get_block_write(). So that we could get
> > > the locking order right. Just need to taking care of the estimated
> > > credits right.
> > > 
> > > How about this? (untested, just throw out for comment)
> > 
> > Seems sent out an old version, this version compiles
>   Thanks for the patch. Some comments are below.
> 
> > ---
> >  fs/ext4/inode.c |   53 ++++++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 40 insertions(+), 13 deletions(-)
> > 
> > Index: linux-2.6.25-rc9/fs/ext4/inode.c
> > ===================================================================
> > --- linux-2.6.25-rc9.orig/fs/ext4/inode.c	2008-04-15 15:40:33.000000000 -0700
> > +++ linux-2.6.25-rc9/fs/ext4/inode.c	2008-04-15 16:32:10.000000000 -0700
> > @@ -1437,18 +1437,12 @@ static int ext4_da_get_block_prep(struct
> >  static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
> >  				   struct buffer_head *bh_result, int create)
> >  {
> > -	int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> > +	int ret;
> >  	unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
> >  	loff_t disksize = EXT4_I(inode)->i_disksize;
> >  	handle_t *handle = NULL;
> >  
> > -	if (create) {
> > -		handle = ext4_journal_start(inode, needed_blocks);
> > -		if (IS_ERR(handle)) {
> > -			ret = PTR_ERR(handle);
> > -			goto out;
> > -		}
> > -	}
> > +	handle = ext4_journal_current_handle();
>   Maybe we could assert that handle != NULL? When using delayed allocation,
> a transaction should always be started.
> 
Agreed.

> >  	ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
> >  				   bh_result, create, 0);
> > @@ -1483,17 +1477,51 @@ static int ext4_da_get_block_write(struc
> >  		ret = 0;
> >  	}
> >  
> > -out:
> > -	if (handle && !IS_ERR(handle))
> > -		ext4_journal_stop(handle);
> > -
> >  	return ret;
> >  }
> >  
> > +/*
> > + * For now just follow the DIO way to estimate the max credits
> > + * needed to write out EXT4_MAX_BUF_BLOCKS pages.
> > + * todo: need to calculate the max credits need for
> > + * extent based files, currently the DIO credits is based on
> > + * indirect-blocks mapping way.
> > + *
> > + * Probably should have a generic way to calculate credits
> > + * for DIO, writepages, and truncate
> > + */
> > +#define EXT4_MAX_BUF_BLOCKS	DIO_MAX_BLOCKS
> > +#define EXT4_MAX_BUF_CREDITS	DIO_CREDITS
> > +
> >  static int ext4_da_writepages(struct address_space *mapping,
> >  				struct writeback_control *wbc)
> >  {
> > -	return mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> > +	struct inode *inode = mapping->host;
> > +	handle_t *handle = NULL;
> > +	int needed_blocks;
> > +	int ret;
> > +
> > +	/*
> > +	 * Estimate the worse case needed credits to write out
> > +	 * EXT4_MAX_BUF_BLOCKS pages
> > +	 */
> > +	needed_blocks = EXT4_MAX_BUF_CREDITS;
> > +
> > +	/* start the transaction with credits*/
> > +	handle = ext4_journal_start(inode, needed_blocks);
> > +	if (IS_ERR(handle)) {
> > +		ret = PTR_ERR(handle);
> > +		return ret;
> > +	}
> > +
> > +	/* set the max pages could be write-out at a time */
> > +	wbc->range_end = wbc->range_start +
> > +			EXT4_MAX_BUF_BLOCKS << PAGE_CACHE_SHIFT - 1;
>   I think limiting mpage_da_writepages through nr_to_write is better than
> through range_end. That way you don't count clean pages...
> 

You are right. 

> > +
> > +	ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
> > +	ext4_journal_stop(handle);
>   But here we can't just stop. We have to write everything original caller
> has asked about (at least in WB_SYNC_ALL mode). But the question is where
> to resume because scanning the whole range again is kind-of excessive and
> prone do livelock with other process dirtying the file via mmap. Maybe if
> we slightly modified write_cache_pages() to always store in writeback_index
> where they finished, we could use this value.

Thanks for pointing this out.
How about this? 
---
 fs/ext4/inode.c     |   70 ++++++++++++++++++++++++++++++++++++++++++----------
 mm/page-writeback.c |    2 -
 2 files changed, 58 insertions(+), 14 deletions(-)

Index: linux-2.6.25-rc9/fs/ext4/inode.c
===================================================================
--- linux-2.6.25-rc9.orig/fs/ext4/inode.c	2008-04-16 09:59:00.000000000 -0700
+++ linux-2.6.25-rc9/fs/ext4/inode.c	2008-04-16 11:23:12.000000000 -0700
@@ -1437,18 +1437,13 @@ static int ext4_da_get_block_prep(struct
 static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
 				   struct buffer_head *bh_result, int create)
 {
-	int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
+	int ret;
 	unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
 	loff_t disksize = EXT4_I(inode)->i_disksize;
 	handle_t *handle = NULL;
 
-	if (create) {
-		handle = ext4_journal_start(inode, needed_blocks);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			goto out;
-		}
-	}
+	J_ASSERT(handle != NULL || create == 0);
+	handle = ext4_journal_current_handle();
 
 	ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
 				   bh_result, create, 0);
@@ -1483,17 +1478,66 @@ static int ext4_da_get_block_write(struc
 		ret = 0;
 	}
 
-out:
-	if (handle && !IS_ERR(handle))
-		ext4_journal_stop(handle);
-
 	return ret;
 }
 
+/*
+ * For now just follow the DIO way to estimate the max credits
+ * needed to write out EXT4_MAX_WRITEBACK_PAGES.
+ * todo: need to calculate the max credits need for
+ * extent based files, currently the DIO credits is based on
+ * indirect-blocks mapping way.
+ *
+ * Probably should have a generic way to calculate credits
+ * for DIO, writepages, and truncate
+ */
+#define EXT4_MAX_WRITEBACK_PAGES	DIO_MAX_BLOCKS
+#define EXT4_MAX_WRITEBACK_CREDITS	DIO_CREDITS
+
 static int ext4_da_writepages(struct address_space *mapping,
 				struct writeback_control *wbc)
 {
-	return mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
+	struct inode *inode = mapping->host;
+	handle_t *handle = NULL;
+	int needed_blocks;
+	int ret = 0;
+	unsigned range_cyclic;
+	long to_write;
+
+	/*
+	 * Estimate the worse case needed credits to write out
+	 * EXT4_MAX_BUF_BLOCKS pages
+	 */
+	needed_blocks = EXT4_MAX_WRITEBACK_CREDITS;
+
+	to_write = wbc->nr_to_write;
+	range_cyclic = wbc->range_cyclic;
+	wbc->range_cyclic = 1;
+
+	while (!ret && to_write) {
+		/* start a new transaction*/
+		handle = ext4_journal_start(inode, needed_blocks);
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			goto out_writepages;
+		}
+		/*
+		 * set the max dirty pages could be write at a time
+		 * to fit into the reserved transaction credits
+		 */
+		if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES)
+			wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES;
+		to_write -= wbc->nr_to_write;
+
+		ret = mpage_da_writepages(mapping, wbc, ext4_da_get_block_write);
+		ext4_journal_stop(handle);
+		to_write +=wbc->nr_to_write;
+	}
+
+out_writepages:
+	wbc->nr_to_write = to_write;
+	wbc->range_cyclic = range_cyclic;
+	return ret;
 }
 
 static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
Index: linux-2.6.25-rc9/mm/page-writeback.c
===================================================================
--- linux-2.6.25-rc9.orig/mm/page-writeback.c	2008-04-16 11:00:20.000000000 -0700
+++ linux-2.6.25-rc9/mm/page-writeback.c	2008-04-16 11:07:59.000000000 -0700
@@ -816,7 +816,7 @@ int write_cache_pages(struct address_spa
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
-		end = -1;
+		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 	} else {
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;

next prev parent reply	other threads:[~2008-04-16 18:24 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-15 16:14 Delayed allocation and page_lock vs transaction start ordering Jan Kara
2008-04-15 17:58 ` Badari Pulavarty
2008-04-16  9:26   ` Jan Kara
2008-04-15 18:08 ` Mingming Cao
2008-04-15 23:28   ` Mingming Cao
2008-04-15 23:33     ` Mingming Cao
2008-04-16 10:35       ` Jan Kara
2008-04-16 18:24         ` Mingming Cao [this message]
2008-04-16 19:55           ` Badari Pulavarty
2008-04-16  9:38   ` Jan Kara
2008-04-18 18:54     ` Andreas Dilger
2008-04-18 19:38       ` Mingming Cao
2008-04-21 17:13       ` Jan Kara
2008-05-21  8:21 ` Aneesh Kumar K.V
2008-05-26 17:21   ` Jan Kara
2008-05-26 18:00     ` Aneesh Kumar K.V
2008-05-27 12:43       ` Jan Kara
2008-05-27 15:11         ` Aneesh Kumar K.V
2008-05-28  9:33           ` Jan Kara
2008-05-28  9:43             ` Aneesh Kumar K.V
2008-05-28 10:33               ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1208370260.3603.4.camel@localhost.localdomain \
    --to=cmm@us.ibm.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=sandeen@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.