From: "Yan, Zheng" <zheng.z.yan@intel.com>
To: Alexandre Oliva <oliva@gnu.org>, Gregory Farnum <greg@inktank.com>
Cc: Sage Weil <sage@inktank.com>, "Yan, Zheng" <ukernel@gmail.com>,
ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: [PATCH] mds: handle setxattr ceph.parent
Date: Thu, 19 Dec 2013 21:27:38 +0800 [thread overview]
Message-ID: <52B2F44A.1050109@intel.com> (raw)
In-Reply-To: <or4n656zij.fsf@livre.home>
On 12/19/2013 04:00 PM, Alexandre Oliva wrote:
> On Dec 18, 2013, Gregory Farnum <greg@inktank.com> wrote:
>
>> This probably wouldn't be too hard to get working properly,
>
> For some value of properly ;-)
>
> The current state of affairs is that the parent attribute only gets
> updated when the log segment is about to be expired. Worst case, using
> the proposed setxattr extension will force it to be updated earlier.
> How could that end up being a bad thing? It's not like we even use the
> parent attribute for anything while the inode remains in the mds
> journal.
>
> So we have the following possibilities of divergence:
>
> a) the inode is created or moved, and then someone calls
> setxattr(parent), and the file remains in place until the inode gets
> expired from the journal. the parent attribute will be updated at the
> time of the setxattr request, but it won't ever be used before the inode
> gets expired from the journal, at which point it would have been updated
> to the same value.
>
> b) the inode is absent from the journal, and someone calls
> setxattr(parent), and then moves the inode to a different location. the
> parent attribute will be updated (a nop unless the attribute is missing
> or wrong) at the time of the setxattr request, and then the move
> operation will cause the attribute to be overwritten at the time the
> inode is about to be expired from the journal
>
> c) the inode is moved, then setxattr(parent)ed, then moved again, before
> the initial move gets expired from the journal. the setxattr will be
> performed at the time it is requested, and it will be correct at that
> point; when the first inode move is expired from the journal, the parent
> attribute may or may not be updated (I'm not sure), but if it is, then
> we're back to the original behavior, and anyway, this incorrect value
> won't ever be used as long as the subsequent move remains in the journal
>
>
> Did I miss any case?
>
I think you are right. Setting the parent xattr direclty won't compromise
the backtrace.
>
> Now, I've just run into another scenario in which this parent-setting
> useful. I had to resort to --reset-journal (for reasons unknown), but
> any files and directories created recently, whose create operations
> hadn't been expired from the journal yet, won't get a parent attribute
> from ceph unless I actually moved them about to force an update. This
> means caps on them won't recover properly until I find out what they are
> and perform corrective action.
>
> Moving a bunch of objects is somewhat tricky, because if the mds
> restarts just at the wrong time, the move operation will seem to fail
> because the new mds won't recover that transaction correctly, precisely
> because the object is absent from the journal and missing the parent
> attribute. This sort of probably will often get a client stuck, or
> signal an error that may or may not indicate the operation failed.
>
> Plus, if I have to do that move dance on a large number of objects, odds
> are the mds will get slow enough that a standby-replay mds will decide
> it's dead and take over, and then fail to recover the ongoing
> operations. See where I'm going? :-)
>
> Having some means to update the internal bookkeeping parent attribute
> without actually touching the inodes, not even their ctimes, is a plus
> for this case.
>
>
> So now it's not just really old ceph nodes and a wish to have accurate
> information in the parent nodes, it's recovering from a --reset-journal
> required by some other failure I couldn't figure out.
next time you encountered log corruption, please open "new issues" at http://tracker.ceph.com/
Regards
Yan, Zheng
>
> (hmm... if I have 2*N replicas of PGs in the metadata pool and demand N
> replicas to be up for the PG to be deemed complete, if I shut down the N
> replicas that are up after they get an update and bring up the other N
> replicas, they will know they're out of date, right? IIUC that's what
> the down state is about, although I'm not sure where the OSDs get the
> info from to decide to enter that state; I've always assumed it was from
> pg versions known by the monitors)
>
next prev parent reply other threads:[~2013-12-19 13:28 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-17 11:25 [PATCH] mds: handle setxattr ceph.parent Alexandre Oliva
2013-12-18 2:47 ` Yan, Zheng
2013-12-18 16:32 ` Alexandre Oliva
2013-12-18 17:09 ` Sage Weil
2013-12-18 21:10 ` Gregory Farnum
2013-12-19 8:00 ` Alexandre Oliva
2013-12-19 13:27 ` Yan, Zheng [this message]
2013-12-20 3:35 ` Alexandre Oliva
2013-12-20 8:03 ` Yan, Zheng
2013-12-21 0:22 ` Alexandre Oliva
2013-12-21 0:50 ` Alexandre Oliva
2014-01-06 17:22 ` Gregory Farnum
2014-01-07 4:15 ` Alexandre Oliva
2014-01-07 16:39 ` Gregory Farnum
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52B2F44A.1050109@intel.com \
--to=zheng.z.yan@intel.com \
--cc=ceph-devel@vger.kernel.org \
--cc=greg@inktank.com \
--cc=oliva@gnu.org \
--cc=sage@inktank.com \
--cc=ukernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.