Re: 2.6.33-rc1: kernel BUG at fs/ext4/inode.c:1063 (sparc)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dmitry Monakhov <dmonakhov@openvz.org>
To: tytso@mit.edu
Cc: Alexander Beregalov <a.beregalov@gmail.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-ext4@vger.kernel.org, Jens Axboe <jens.axboe@oracle.com>,
	dmitry.torokhov@gmail.com, Jan Kara <jack@suse.cz>,
	aanisimov@inbox.ru, pl4nkton@googlemail.com
Subject: Re: 2.6.33-rc1: kernel BUG at fs/ext4/inode.c:1063 (sparc)
Date: Wed, 30 Dec 2009 16:18:09 +0300	[thread overview]
Message-ID: <87k4w4vbvy.fsf@openvz.org> (raw)
In-Reply-To: 20091230053720.GK4429@thunk.org

tytso@mit.edu writes:

> On Sun, Dec 27, 2009 at 10:51:59PM -0500, tytso@MIT.EDU wrote:
>> OK, i've been able to reproduce the problem using xfsqa test #74
>> (fstest) when an ext3 file system is mounted the ext4 file system
>> driver.  I was then able to bisect it down to commit d21cd8f6, which
>> was introduced between 2.6.33-rc1 and 2.6.33-rc2, as part of
>> quota/ext4 patch series pushed by Jan.
>
> OK, here's a patch which I think should avoid the BUG in
> fs/ext4/inode.c.  It should fix the regression, but in the long run we
> need to pretty seriously rethink how we account for the need for
> potentially new meta-data blocks when doing delayed allocation.
>
> The remaining problem with this machinery is that
> ext4_da_update_reserve_space() and ext4_da_release_space() is that
> they both try to calculate how many metadata blocks will potentially
> required by calculating ext4_calc_metadata_amount() based on the
> number of delayed allocation blocks found in i_reserved_data_blocks.
> The problem is that ext4_calc_metadata_amount() assumes that the
> number of blocks passed to it is contiguous, and what might be left
> remaining to be written in the page cache could be anything but
> contiguous.  This is a problem which has always been there, so it's
> not a regression per se; just a design flaw.
Hello, I've finally able to reproduce the issue. I'm agree with your
diagnose. But while looking in to code i've found some questions
see late in the message.
>
> The patch below should fixes the regression caused by commit d21cd8f,
> but we need to look much more closely to find a better way of
> accounting for the potential need for metadata for inodes facing
> delayed allocation.  Could people who are having problems with the BUG
> in line 1063 of fs/ext4/inode.c try this patch?
>
> Thanks!!
>
>     	      	       		    	   - Ted
>
>
> commit 48b71e562ecd35ab12f6b6420a92fb3c9145da92
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Wed Dec 30 00:04:04 2009 -0500
>
>     ext4: Patch up how we claim metadata blocks for quota purposes
>     
>     Commit d21cd8f triggered a BUG in the function
>     ext4_da_update_reserve_space() found in fs/ext4/inode.c, which was
>     caused by fact that ext4_calc_metadata_amount() can over-estimate how
>     many metadata blocks will be needed, especially when using direct
>     block-mapped files.  Work around this by not claiming any excess
>     metadata blocks than we are prepared to claim at this point.
>     
>     Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3e3b454..d6e84b4 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1058,14 +1058,23 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
>  	mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;
>  
>  	if (mdb_free) {
> -		/* Account for allocated meta_blocks */
> +		/* 
> +		 * Account for allocated meta_blocks; it is possible
> +		 * for us to have allocated more meta blocks than we
> +		 * are prepared to free at this point.  This is
> +		 * because ext4_calc_metadata_amount can over-estimate
> +		 * how many blocks are still needed.  So we may not be
> +		 * able to claim all of the allocated meta blocks
> +		 * right away.  The accounting will work out in the end...
> +		 */
>  		mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
> -		BUG_ON(mdb_free < mdb_claim);
> +		if (mdb_free < mdb_claim)
> +			mdb_claim = mdb_free;
>  		mdb_free -= mdb_claim;
>  
>  		/* update fs dirty blocks counter */
>  		percpu_counter_sub(&sbi->s_dirtyblocks_counter, mdb_free);
> -		EXT4_I(inode)->i_allocated_meta_blocks = 0;
> +		EXT4_I(inode)->i_allocated_meta_blocks -= mdb_claim;
>  		EXT4_I(inode)->i_reserved_meta_blocks = mdb;
>  	}
>  
> @@ -1845,7 +1854,7 @@ repeat:
>  static void ext4_da_release_space(struct inode *inode, int to_free)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> -	int total, mdb, mdb_free, release;
> +	int total, mdb, mdb_free, mdb_claim, release;
>  
>  	if (!to_free)
>  		return;		/* Nothing to release, exit */
> @@ -1874,6 +1883,16 @@ static void ext4_da_release_space(struct inode *inode, int to_free)
>  	BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
>  	mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;
>  
> +	if (mdb_free) {
> +		/* Account for allocated meta_blocks */
> +		mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
> +		if (mdb_free < mdb_claim)
> +			mdb_claim = mdb_free;
> +		mdb_free -= mdb_claim;
> +
> +		EXT4_I(inode)->i_allocated_meta_blocks -= mdb_claim;
> +	}
> +
Seems what this is not enough.
Just imagine, we may have following call-trace:

 userspace pwrite(fd, d, 1000, off)
 ->ext4_da_reserve_space(inode, 1000)
   ->dq_reserve_space(1000 + md_needed)
 userspace ftruncate(fd, off) /* "off"  is the same as in pwrite call */
 ->ext4_da_invalidatepage()
   ->ext4_da_page_release_reservation()
    ->ext4_da_release_space()
<<< And we decrease ->i_allocated_meta_blocks only if (mdb_free > 0)
 userspace close(fd)
So reserved metadata blocks will leak. I'm able to reproduce it like this:
 quotacheck -cu /mnt
 quotaon /mnt
 fsstres -p 16 -d /mnt -l999999999 -n99999999&
 sleep 180
 killall -9 fsstress
 sync; sync;
 cp /mnt/aquota.user > q1
 quotaoff /mnt
 quotacheck -cu /mnt/ # recaculate real quota usage.
 cp /mnt/aquota.user > q2
 diff -up q1 q2 # in my case i've found 1 block leaked.

IMHO we may drop i_allocated_meta_block in ext4_release_file()
But while looking in to this function i've found another question
about locking
static int ext4_release_file(struct inode *inode, struct file *filp)
{
	if (EXT4_I(inode)->i_state & EXT4_STATE_DA_ALLOC_CLOSE) {
		ext4_alloc_da_blocks(inode);
		EXT4_I(inode)->i_state &= ~EXT4_STATE_DA_ALLOC_CLOSE;
<<< Seems what i_state modification must being protected by i_mutex,
 but currently caller don't have to hold it.
.....                
	}

>  	release = to_free + mdb_free;
>  
>  	/* update fs dirty blocks counter for truncate case */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2009-12-30 13:18 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-24 22:28 2.6.33-rc1: kernel BUG at fs/ext4/inode.c:1063 (sparc) Alexander Beregalov
2009-12-24 22:49 ` Alexander Beregalov
2009-12-25 12:31   ` Dmitry Monakhov
2009-12-25 12:31     ` Dmitry Monakhov
2009-12-25 19:33     ` Alexander Beregalov
2009-12-25 19:33       ` Alexander Beregalov
2009-12-25 23:47       ` Dmitry Monakhov
2009-12-25 23:47         ` Dmitry Monakhov
2009-12-27 20:32         ` Alexander Beregalov
2009-12-27 20:32           ` Alexander Beregalov
2009-12-27 21:38           ` Dmitry Torokhov
2009-12-27 21:38             ` Dmitry Torokhov
2009-12-27 22:52           ` tytso
2009-12-27 23:02             ` Alexander Beregalov
2009-12-27 23:02               ` Alexander Beregalov
2009-12-28  3:51               ` tytso
2009-12-30  5:37                 ` tytso
2009-12-30  5:37                   ` tytso
2009-12-30 13:18                   ` Dmitry Monakhov [this message]
2009-12-30 17:45                     ` tytso
2009-12-30 17:48                     ` tytso
2009-12-24 23:05 ` tytso
2009-12-24 23:15   ` tytso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87k4w4vbvy.fsf@openvz.org \
    --to=dmonakhov@openvz.org \
    --cc=a.beregalov@gmail.com \
    --cc=aanisimov@inbox.ru \
    --cc=dmitry.torokhov@gmail.com \
    --cc=jack@suse.cz \
    --cc=jens.axboe@oracle.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pl4nkton@googlemail.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.