From: Joseph Qi <joseph.qi@huawei.com>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: <linux-ext4@vger.kernel.org>
Subject: Re: Issue in ext4 rename
Date: Fri, 3 Apr 2015 17:57:25 +0800 [thread overview]
Message-ID: <551E6405.8050600@huawei.com> (raw)
In-Reply-To: <20150402140258.GC6873@thunk.org>
Hi Ted,
Thanks very much for your quick and detailed reply.
Yes, currently it will behave as RO, or PANIC or CONT based on the
mounted options.
You suggested a way to make sure the allocation cannot fail.
I am wondering if we can omit this handle when commit, for example,
introducing a way that invalids the handle in jbd2.
On 2015/4/2 22:02, Theodore Ts'o wrote:
> On Thu, Apr 02, 2015 at 06:49:07PM +0800, Joseph Qi wrote:
>> Hi all,
>> In ext4_rename_delete, it only logs a warning if ext4_delete_entry
>> fails.
>> IMO, it may lead to an inode with two entries (old and new), thus
>> filesystem will be inconsistent.
>> The case is described below:
>> ext4_rename
>> --> ext4_journal_start
>> --> ext4_add_entry (new)
>> --> ext4_rename_delete (old)
>> --> ext4_delete_entry
>> --> ext4_journal_get_write_access
>> *failed* because of -ENOMEM
>> --> ext4_journal_stop
>> Does anyone have an idea to resolve this issue?
>
> I'm guessing you must be using one of the kernel patches or
> pre-release kernels that is allowing GFP_NOFS allocations to fail.
> Currently in this case, we call ext4_std_error() which will declare
> the file system as inconsistent, and either mark the file system
> read/only, panic the system, or, if the error mode is set to
> "continue" (what I nick name the "don't worry, be happy mode"), the
> error gets ignored. What I recommend for companies that have a large
> number of disks and don't want to panic the entire system when a disk
> gets marked bad is to have monitoring software which notices when a
> disk gets marked inconsistent (either by scraping dmesg or by sending
> a notification out via a netlink socket[1]), and then instructing the
> cluster file system to declare the disk bad, and to eventually arrange
> to the file system fsck'ed.
>
> [1] At Google we have a patch which does this; I believe a version of
> the patchd did get sent out to the ext4 list, but the person who
> worked on it never had time to get it properly cleaned up so it could
> get upstreamed, and we got lost in debates about the proper way to
> handle such notifications, should they be done in the VFS, or
> conflated with quota errors, etc.) And at some point during the
> interface paint-shedding, the debate stalled out.
>
>
> In any case, there was a huge debate at the LSF/MM about this, where
> file system engineers tried to explain to VM folks why in some cases
> backing out of a memory failure is close to impossible, unless you
> want to add a transaction rollback system ala an RDBMS (and suffer the
> complexity and performance penalties of said RDBMS transaction
> rollback mechanism). You can read more about this at:
> https://lwn.net/Articles/636017/ and https://lwn.net/Articles/636797/.
>
> In the short term my plan was to try to create a wrapper for all
> kmalloc and slab allocation requests which would allow us to track
> memory used, pass in GFP_NOFAIL where necessary, and to loop in cases
> where GFP_NOFAIL requests started failing (because like Dave Chinner,
> I trust VM folks *this* much -->.<---). In the jbd2 layer, this would
> have to be done via some kind of optional callback system, since I
> don't want to force ocfs2 to have to use this scheme if they don't
> want to.
>
> In the very short term, if you can't figure out how to fix or rollback
> the patch which caused the GFP_NOFS allocations to start failing, you
> could simply replace all instances of GFP_NOFS with
> GFP_NOFS|GFP_NOFAIL in fs/jbd2 and fs/ext4.
>
> Regards,
>
> - Ted
>
> .
>
next prev parent reply other threads:[~2015-04-03 9:57 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-02 10:49 Issue in ext4 rename Joseph Qi
2015-04-02 14:02 ` Theodore Ts'o
2015-04-03 9:57 ` Joseph Qi [this message]
2015-04-03 15:06 ` Theodore Ts'o
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=551E6405.8050600@huawei.com \
--to=joseph.qi@huawei.com \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.