public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
From: Mingming Cao <cmm@us.ibm.com>
To: Theodore Tso <tytso@mit.edu>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
	ext4 development <linux-ext4@vger.kernel.org>
Subject: Re: ENOSPC returned during writepages
Date: Wed, 20 Aug 2008 16:22:15 -0700	[thread overview]
Message-ID: <1219274535.7895.55.camel@mingming-laptop> (raw)
In-Reply-To: <1219265808.7895.14.camel@mingming-laptop>


在 2008-08-20三的 13:56 -0700,Mingming Cao写道:
> 在 2008-08-20三的 07:53 -0400,Theodore Tso写道:
> > On Wed, Aug 20, 2008 at 04:16:44PM +0530, Aneesh Kumar K.V wrote:
> > > > mpage_da_map_blocks block allocation failed for inode 323784 at logical
> > > > offset 313 with max blocks 11 with error -28
> > > > This should not happen.!! Data will be lost
> > 
> > We don't actually lose the data if free blocks are subsequently made
> > available, correct?
> > 
> > > I tried this patch. There are still multiple ways we can get wrong free
> > > block count. The patch reduced the number of errors. So we are doing
> > > better with patch. But I guess we can't use the percpu_counter based
> > > free block accounting with delalloc. Without delalloc it is ok even if
> > > we find some wrong free blocks count . The actual block allocation will fail in
> > > that case and we handle it perfectly fine. With delalloc we cannot
> > > afford to fail the block allocation. Should we look at a free block
> > > accounting rewrite using simple ext4_fsblk_t and and a spin lock ?
> > 
> > It would be a shame if we did given that the whole point of the percpu
> > counter was to avoid a scalability bottleneck.  Perhaps we could take
> > a filesystem-level spinlock only when the number of free blocks as
> > reported by the percpu_counter falls below some critical level?
> > 
> 
> Agree, and perhaps we should fall back to non-delalloc mode if the fs
> free blocks below some critical level?

How about this?

ext4: fall back to non delalloc mode if filesystem is almost full
From: Mingming Cao <cmm@us.ibm.com>

In the case of filesystem is close to full (free blocks is below 
the watermark NRCPUS *4) and there is not enough to reserve blocks for
delayed allocation, instead of return user back with ENOSPC error, with
this patch, it tries to fall back to non delayed allocation mode.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/ext4/ext4.h  |    2 -
 fs/ext4/inode.c |   61 ++++++++++++++++++++++++++++++++++++++++++++------------
 fs/ext4/namei.c |    4 +--
 3 files changed, 51 insertions(+), 16 deletions(-)

Index: linux-2.6.27-rc3/fs/ext4/inode.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/inode.c	2008-08-20 15:20:10.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/inode.c	2008-08-20 16:13:48.000000000 -0700
@@ -2391,6 +2391,25 @@
 	return ret;
 }
 
+/*
+ * In case of filesystem is almost full and delalloc could not
+ * get enough free blocks to reserve to prevent later ENOSPC,
+ * let's fall back to the nondelalloc mode
+ */
+static int ext4_write_begin_nondelalloc(struct file *file,
+				struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata)
+{
+	struct inode *inode = mapping->host;
+
+	/* turn off delalloc for this inode*/
+	ext4_set_aops(inode, 0);
+
+	return mapping->a_ops->write_begin(file, mapping, pos, len,
+					   flags, pagep, fsdata);
+}
+
 static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned flags,
 				struct page **pagep, void **fsdata)
@@ -2435,8 +2454,14 @@
 		page_cache_release(page);
 	}
 
-	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
-		goto retry;
+	if (ret == -ENOSPC) {
+		if (ext4_should_retry_alloc(inode->i_sb, &retries))
+			goto retry;
+		else
+			ret= ext4_write_begin_nondelalloc(file,mapping,pos,
+							  len, flags, pagep,
+							  fsdata);
+	}
 out:
 	return ret;
 }
@@ -3008,16 +3033,26 @@
 	.is_partially_uptodate  = block_is_partially_uptodate,
 };
 
-void ext4_set_aops(struct inode *inode)
+#define	EXT4_MIN_FREE_BLOCKS	(NR_CPUS*4)
+
+void ext4_set_aops(struct inode *inode, int delalloc)
 {
-	if (ext4_should_order_data(inode) &&
-		test_opt(inode->i_sb, DELALLOC))
-		inode->i_mapping->a_ops = &ext4_da_aops;
-	else if (ext4_should_order_data(inode))
+	if (test_opt(inode->i_sb, DELALLOC)) {
+		if (ext4_has_free_blocks(EXT4_SB(inode->i_sb),
+			 EXT4_MIN_FREE_BLOCKS) > EXT4_MIN_FREE_BLOCKS)
+			delalloc = 0;
+
+		if (delalloc) {
+			inode->i_mapping->a_ops = &ext4_da_aops;
+			return;
+		} else
+			printk(KERN_INFO "filesystem is close to full, "
+				"delayed allocation is turned off for "
+				" inode %lu\n", inode->i_ino);
+	}
+
+	if (ext4_should_order_data(inode))
 		inode->i_mapping->a_ops = &ext4_ordered_aops;
-	else if (ext4_should_writeback_data(inode) &&
-		 test_opt(inode->i_sb, DELALLOC))
-		inode->i_mapping->a_ops = &ext4_da_aops;
 	else if (ext4_should_writeback_data(inode))
 		inode->i_mapping->a_ops = &ext4_writeback_aops;
 	else
@@ -4011,7 +4046,7 @@
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = &ext4_file_inode_operations;
 		inode->i_fop = &ext4_file_operations;
-		ext4_set_aops(inode);
+		ext4_set_aops(inode, 1);
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext4_dir_inode_operations;
 		inode->i_fop = &ext4_dir_operations;
@@ -4020,7 +4055,7 @@
 			inode->i_op = &ext4_fast_symlink_inode_operations;
 		else {
 			inode->i_op = &ext4_symlink_inode_operations;
-			ext4_set_aops(inode);
+			ext4_set_aops(inode, 1);
 		}
 	} else {
 		inode->i_op = &ext4_special_inode_operations;
@@ -4783,7 +4818,7 @@
 		EXT4_I(inode)->i_flags |= EXT4_JOURNAL_DATA_FL;
 	else
 		EXT4_I(inode)->i_flags &= ~EXT4_JOURNAL_DATA_FL;
-	ext4_set_aops(inode);
+	ext4_set_aops(inode, 1);
 
 	jbd2_journal_unlock_updates(journal);
 
Index: linux-2.6.27-rc3/fs/ext4/ext4.h
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/ext4.h	2008-08-20 15:41:36.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/ext4.h	2008-08-20 15:41:56.000000000 -0700
@@ -1070,7 +1070,7 @@
 extern void ext4_truncate (struct inode *);
 extern void ext4_set_inode_flags(struct inode *);
 extern void ext4_get_inode_flags(struct ext4_inode_info *);
-extern void ext4_set_aops(struct inode *inode);
+extern void ext4_set_aops(struct inode *inode, int delalloc);
 extern int ext4_writepage_trans_blocks(struct inode *);
 extern int ext4_meta_trans_blocks(struct inode *, int nrblocks, int idxblocks);
 extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
Index: linux-2.6.27-rc3/fs/ext4/namei.c
===================================================================
--- linux-2.6.27-rc3.orig/fs/ext4/namei.c	2008-08-20 15:42:13.000000000 -0700
+++ linux-2.6.27-rc3/fs/ext4/namei.c	2008-08-20 15:42:41.000000000 -0700
@@ -1738,7 +1738,7 @@
 	if (!IS_ERR(inode)) {
 		inode->i_op = &ext4_file_inode_operations;
 		inode->i_fop = &ext4_file_operations;
-		ext4_set_aops(inode);
+		ext4_set_aops(inode, 1);
 		err = ext4_add_nondir(handle, dentry, inode);
 	}
 	ext4_journal_stop(handle);
@@ -2210,7 +2210,7 @@
 
 	if (l > sizeof (EXT4_I(inode)->i_data)) {
 		inode->i_op = &ext4_symlink_inode_operations;
-		ext4_set_aops(inode);
+		ext4_set_aops(inode, 1);
 		/*
 		 * page_symlink() calls into ext4_prepare/commit_write.
 		 * We have a transaction open.  All is sweetness.  It also sets


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2008-08-20 23:22 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-20  5:43 ENOSPC returned during writepages Aneesh Kumar K.V
2008-08-20 10:46 ` Aneesh Kumar K.V
2008-08-20 11:53   ` Theodore Tso
2008-08-20 18:27     ` Aneesh Kumar K.V
2008-08-20 21:35       ` Mingming Cao
2008-08-21 15:15         ` Aneesh Kumar K.V
2008-08-20 19:25     ` Andreas Dilger
2008-08-20 19:34       ` Theodore Tso
2008-08-20 20:56     ` Mingming Cao
2008-08-20 21:55       ` Theodore Tso
2008-08-20 22:02         ` Mingming Cao
2008-08-20 23:22       ` Mingming Cao [this message]
2008-08-20 23:42         ` Andreas Dilger
2008-08-20 23:58           ` Mingming Cao
2008-08-21  1:44             ` Andreas Dilger
2008-08-20 21:55     ` Mingming Cao
2008-08-21 15:18       ` Aneesh Kumar K.V
2008-08-21 15:35         ` Theodore Tso
2008-08-21 17:17           ` Mingming Cao
2008-08-23 11:12         ` Andreas Dilger
2008-08-21 15:12     ` Aneesh Kumar K.V
2008-08-21 16:56       ` Mingming Cao
2008-08-20 21:58 ` Mingming Cao
2008-08-21 15:09   ` Aneesh Kumar K.V
2008-08-21  5:06 ` Eric Sandeen
2008-08-21 16:45 ` Aneesh Kumar K.V
2008-08-21 17:07   ` Mingming Cao
2008-08-21 17:31     ` Aneesh Kumar K.V
2008-08-21 18:06       ` Mingming Cao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1219274535.7895.55.camel@mingming-laptop \
    --to=cmm@us.ibm.com \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox