All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode
@ 2016-11-04 11:34 Jeff Layton
  2016-11-04 11:34 ` [RFC PATCH 01/10] ceph: fix minor typo in unsafe_request_wait Jeff Layton
                   ` (9 more replies)
  0 siblings, 10 replies; 30+ messages in thread
From: Jeff Layton @ 2016-11-04 11:34 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, zyan, sage

This is the companion kernel patchset to this ceph userland pull req:

    https://github.com/ceph/ceph/pull/11710

The problem is that fsync can be very slow on ceph, as it waits for a
cap flush ack. Cap flushes however are generally considered by the MDS
to be background activity, so they don't entail a journal flush on their
own.

The idea here is to add a new flag to cap requests to inform the MDS
that the client is waiting on the reply and that it shouldn't delay it.

In addition, this adds support for the birthtime and change attribute in
cephfs. This is necessary since the new sync flag comes after those
fields.

In current mainline ceph, the btime and change_attribute share a feature
flag with addr2 support. In order to test this, I had to move addr2 to a
new feature flag since the kernel doesn't have that support yet.xi

For now, this is just an RFC set until the userland parts are in place.

Jeff Layton (10):
  ceph: fix minor typo in unsafe_request_wait
  ceph: move xattr initialzation before the encoding past the
    ceph_mds_caps
  ceph: initialize i_version to 0 in new ceph inodes
  ceph: save off btime and change_attr when we get an InodeStat
  ceph: handle btime and change_attr updates in cap messages
  ceph: define new argument structure for send_cap_msg
  ceph: update cap message struct version to 9
  ceph: add sync parameter to send_cap_msg
  ceph: plumb "sync" parameter into __send_cap
  ceph: turn on btime and change_attr support

 fs/ceph/caps.c                     | 307 ++++++++++++++++++++++---------------
 fs/ceph/inode.c                    |  11 +-
 fs/ceph/mds_client.c               |  10 ++
 fs/ceph/mds_client.h               |   2 +
 fs/ceph/snap.c                     |   3 +
 fs/ceph/super.c                    |   3 +-
 fs/ceph/super.h                    |   5 +
 include/linux/ceph/ceph_features.h |   2 +
 8 files changed, 219 insertions(+), 124 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 30+ messages in thread
* Re: [RFC PATCH 07/10] ceph: update cap message struct version to 9
@ 2016-11-07 21:21 Sage Weil
  2016-11-07 21:51 ` Jeff Layton
  2016-11-07 23:15 ` Gregory Farnum
  0 siblings, 2 replies; 30+ messages in thread
From: Sage Weil @ 2016-11-07 21:21 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Gregory Farnum, Yan, Zheng, ceph-devel, Ilya Dryomov, Zheng Yan

[-- Attachment #1: Type: TEXT/PLAIN, Size: 11551 bytes --]

On Mon, 7 Nov 2016, Jeff Layton wrote:
> On Mon, 2016-11-07 at 20:09 +0000, Sage Weil wrote:
> > On Mon, 7 Nov 2016, Gregory Farnum wrote:
> > > On Mon, Nov 7, 2016 at 10:39 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > On Mon, 2016-11-07 at 14:36 +0000, Sage Weil wrote:
> > > >> On Mon, 7 Nov 2016, Jeff Layton wrote:
> > > >> > On Mon, 2016-11-07 at 14:05 +0000, Sage Weil wrote:
> > > >> > > On Mon, 7 Nov 2016, Jeff Layton wrote:
> > > >> > > > On Mon, 2016-11-07 at 16:43 +0800, Yan, Zheng wrote:
> > > >> > > > > On Fri, Nov 4, 2016 at 8:57 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > >> > > > > >
> > > >> > > > > > On Fri, 2016-11-04 at 07:34 -0400, Jeff Layton wrote:
> > > >> > > > > > >
> > > >> > > > > > > The userland ceph has MClientCaps at struct version 9. This brings the
> > > >> > > > > > > kernel up the same version.
> > > >> > > > > > >
> > > >> > > > > > > With this change, we have to start tracking the btime and change_attr,
> > > >> > > > > > > so that the client can pass back sane values in cap messages. The
> > > >> > > > > > > client doesn't care about the btime at all, so this is just passed
> > > >> > > > > > > around, but the change_attr is used when ceph is exported via NFS.
> > > >> > > > > > >
> > > >> > > > > > > For now, the new "sync" parm is left at 0, to preserve the existing
> > > >> > > > > > > behavior of the client.
> > > >> > > > > > >
> > > >> > > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > >> > > > > > > ---
> > > >> > > > > > >  fs/ceph/caps.c | 33 +++++++++++++++++++++++++--------
> > > >> > > > > > >  1 file changed, 25 insertions(+), 8 deletions(-)
> > > >> > > > > > >
> > > >> > > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > >> > > > > > > index 6e99866b1946..452f5024589f 100644
> > > >> > > > > > > --- a/fs/ceph/caps.c
> > > >> > > > > > > +++ b/fs/ceph/caps.c
> > > >> > > > > > > @@ -991,9 +991,9 @@ struct cap_msg_args {
> > > >> > > > > > >       struct ceph_mds_session *session;
> > > >> > > > > > >       u64                     ino, cid, follows;
> > > >> > > > > > >       u64                     flush_tid, oldest_flush_tid, size, max_size;
> > > >> > > > > > > -     u64                     xattr_version;
> > > >> > > > > > > +     u64                     xattr_version, change_attr;
> > > >> > > > > > >       struct ceph_buffer      *xattr_buf;
> > > >> > > > > > > -     struct timespec         atime, mtime, ctime;
> > > >> > > > > > > +     struct timespec         atime, mtime, ctime, btime;
> > > >> > > > > > >       int                     op, caps, wanted, dirty;
> > > >> > > > > > >       u32                     seq, issue_seq, mseq, time_warp_seq;
> > > >> > > > > > >       kuid_t                  uid;
> > > >> > > > > > > @@ -1026,13 +1026,13 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > >> > > > > > >
> > > >> > > > > > >       /* flock buffer size + inline version + inline data size +
> > > >> > > > > > >        * osd_epoch_barrier + oldest_flush_tid */
> > > >> > > > > > > -     extra_len = 4 + 8 + 4 + 4 + 8;
> > > >> > > > > > > +     extra_len = 4 + 8 + 4 + 4 + 8 + 4 + 4 + 4 + 8 + 8 + 1;
> > > >> > > > > > >       msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
> > > >> > > > > > >                          GFP_NOFS, false);
> > > >> > > > > > >       if (!msg)
> > > >> > > > > > >               return -ENOMEM;
> > > >> > > > > > >
> > > >> > > > > > > -     msg->hdr.version = cpu_to_le16(6);
> > > >> > > > > > > +     msg->hdr.version = cpu_to_le16(9);
> > > >> > > > > > >       msg->hdr.tid = cpu_to_le64(arg->flush_tid);
> > > >> > > > > > >
> > > >> > > > > > >       fc = msg->front.iov_base;
> > > >> > > > > > > @@ -1068,17 +1068,30 @@ static int send_cap_msg(struct cap_msg_args *arg)
> > > >> > > > > > >       }
> > > >> > > > > > >
> > > >> > > > > > >       p = fc + 1;
> > > >> > > > > > > -     /* flock buffer size */
> > > >> > > > > > > +     /* flock buffer size (version 2) */
> > > >> > > > > > >       ceph_encode_32(&p, 0);
> > > >> > > > > > > -     /* inline version */
> > > >> > > > > > > +     /* inline version (version 4) */
> > > >> > > > > > >       ceph_encode_64(&p, arg->inline_data ? 0 : CEPH_INLINE_NONE);
> > > >> > > > > > >       /* inline data size */
> > > >> > > > > > >       ceph_encode_32(&p, 0);
> > > >> > > > > > > -     /* osd_epoch_barrier */
> > > >> > > > > > > +     /* osd_epoch_barrier (version 5) */
> > > >> > > > > > >       ceph_encode_32(&p, 0);
> > > >> > > > > > > -     /* oldest_flush_tid */
> > > >> > > > > > > +     /* oldest_flush_tid (version 6) */
> > > >> > > > > > >       ceph_encode_64(&p, arg->oldest_flush_tid);
> > > >> > > > > > >
> > > >> > > > > > > +     /* caller_uid/caller_gid (version 7) */
> > > >> > > > > > > +     ceph_encode_32(&p, (u32)-1);
> > > >> > > > > > > +     ceph_encode_32(&p, (u32)-1);
> > > >> > > > > >
> > > >> > > > > > A bit of self-review...
> > > >> > > > > >
> > > >> > > > > > Not sure if we want to set the above to something else -- maybe 0 or to
> > > >> > > > > > current's creds? That may not always make sense though (during e.g.
> > > >> > > > > > writeback).
> > > >> > > > > >
> > > >> > > >
> > > >> > > > Looking further, I'm not quite sure I understand why we send creds at
> > > >> > > > all in cap messages. Can you clarify where that matters?
> > > >> > > >
> > > >> > > > The way I look at it, would be to consider caps to be something like a
> > > >> > > > more granular NFS delegation or SMB oplock.
> > > >> > > >
> > > >> > > > In that light, a cap flush is just the client sending updated attrs for
> > > >> > > > the exclusive caps that it has already been granted. Is there a
> > > >> > > > situation where we would ever want to refuse that update?
> > > >> > >
> > > >> > > A chmod or chown can be done locally if you have excl caps and flushed
> > > >> > > back to the MDS via a caps message.  We need to verify the user has
> > > >> > > permission to make the change.
> > > >> > >
> > > >> >
> > > >> > My take is that once the MDS has delegated Ax to the client, then it's
> > > >> > effectively trusting the client to handle permissions enforcement
> > > >> > correctly. I don't see why we should second guess that.
> > > >> >
> > > >> > > > Note that nothing ever checks the return code for _do_cap_update in the
> > > >> > > > userland code. If the permissions check fails, then we'll end up
> > > >> > > > silently dropping the updated attrs on the floor.
> > > >> > >
> > > >> > > Yeah.  This was mainly for expediency... the protocol assumes that flushes
> > > >> > > don't fail.  Given that the client does it's own permissions check, I
> > > >> > > think the way to improve this is to have it prevent the flush in the first
> > > >> > > place, so that it's only nefarious clients that are effected (and who
> > > >> > > cares if they get confused).  I don't think we have a particularly good
> > > >> > > way to tell the client it can't, say, sudo chmod 0:0 a file, though.
> > > >> > >
> > > >> >
> > > >> > Sorry, I don't quite follow. How would we prevent the flush from a
> > > >> > nefarious client (which is not something we can really control)?
> > > >> >
> > > >> > In any case...ISTM that the permissions check in _do_cap_update ought to
> > > >> > be replaced by a cephx key check. IOW, what we really want to know is
> > > >> > whether the client is truly the one to which we delegated the caps. If
> > > >> > so, then we sort of have to trust that it's doing the right thing with
> > > >> > respect to permissions checking here.
> > > >>
> > > >> The capability can say "you are allowed to be uid 1000 or uid 1020." We
> > > >> want to delegate the EXCL caps to the client so that a create + chmod +
> > > >> chown + write can all happen efficiently, but we still need to ensure that
> > > >> the values they set are legal (a permitted uid/gid combo).
> > > >>
> > > >> A common example would be user workstations that are allowed access to
> > > >> /home/user and restricted via their mds caps to their uid/gid.  We need to
> > > >> prevent them from doing a 'sudo chown 0:0 foo'...
> > > >>
> > > >>
> > > >
> > > >
> > > > On what basis do you make such a decision though? For instance, NFS does
> > > > root-squashing which is (generally) a per-export+per-client thing.
> > > > It sounds like you're saying that ceph has different semantics here?
> > > >
> > > > (cc'ing Greg here)
> > > 
> > > As Sage says, we definitely avoid the root squash semantics. We
> > > discussed them last year and concluded they were an inappropriate
> > > match for Ceph's permission model.
> > > 
> > > >
> > > > Also, chown (at least under POSIX) is reserved for superuser only, and
> > > > now that I look, I think this check in MDSAuthCaps::is_capable may be
> > > > wrong:
> > > >
> > > >       // chown/chgrp
> > > >       if (mask & MAY_CHOWN) {
> > > >         if (new_uid != caller_uid ||   // you can't chown to someone else
> > > >             inode_uid != caller_uid) { // you can't chown from someone else
> > > >           continue;
> > > >         }
> > > >       }
> > > >
> > > > Shouldn't this just be a check for whether the caller_uid is 0 (or
> > > > whatever the correct check for the equivalent to the kernel's CAP_CHOWN
> > > > would be)?
> > 
> > Oops, I skipped over this part ^
> >  
> > > Without context, this does look a little weird — does it allow *any*
> > > change, given caller_uid needs to match both new and inode uid?
> > > Sort of the common case would be that the admin cap gets hit toward
> > > the beginning of the function and just allows it without ever reaching
> > > this point.
> > 
> > Yeah, the check is a bit weird.  It looks like
> > 
> > 1- A normal cap that specifies a uid can't ever change the uid.  This 
> > conditional could be simplified/clarified...
> > 
> > 2- If you have a pair of caps, like
> > 
> >   allow * uid=1, allow * uid=2
> > 
> > we still don't let you chown between uid 1 and 2.  Well, not as caller_uid 
> > 1 or 2 (which is fine), but
> > 
> > 3- Jeff is right, we don't allow root to chown between allowed uids.  
> > Like if you had
> > 
> >   allow * uid=0
> > 
> > shouldn't that let you chown anything?  I didn't really consider this 
> > case since most users would just do
> > 
> >   allow *
> > 
> > which can do anything (including chown).  But probably the 'allow * uid=0' 
> > case should be handled properly.
> > 
> > sage
> 
> It still seems to me like that should just be a check for superuser
> status. Something like:
> 
>       if (mask & MAY_CHOWN) {
> 	// only root can chown
>         if (i->match.uid != 0 || caller_uid != 0)
>           continue;
>         }
>       }
> 
> i.e. only allow chown if the capability has a uid of 0 and the
> caller_uid is also 0.
> 
> I don't think we want to ever grant an unprivileged user the ability to
> chown, do we?

Ah, yep.  Except that the Locker.cc caller needs to be fixed to only ask 
for MAY_CHOWN if the uid is changing.  Right now it's only passing 
MAY_WRITE which looks wrong too...

sage

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-11-11 14:48 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-04 11:34 [RFC PATCH 00/10] ceph: fix long stalls during fsync and write_inode Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 01/10] ceph: fix minor typo in unsafe_request_wait Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 02/10] ceph: move xattr initialzation before the encoding past the ceph_mds_caps Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 03/10] ceph: initialize i_version to 0 in new ceph inodes Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 04/10] ceph: save off btime and change_attr when we get an InodeStat Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 05/10] ceph: handle btime and change_attr updates in cap messages Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 06/10] ceph: define new argument structure for send_cap_msg Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 07/10] ceph: update cap message struct version to 9 Jeff Layton
2016-11-04 12:57   ` Jeff Layton
2016-11-07  8:43     ` Yan, Zheng
2016-11-07 11:21       ` Jeff Layton
2016-11-07 14:05         ` Sage Weil
2016-11-07 14:22           ` Jeff Layton
2016-11-07 14:36             ` Sage Weil
2016-11-07 18:39               ` Jeff Layton
2016-11-07 19:15                 ` Sage Weil
2016-11-07 19:53                 ` Gregory Farnum
2016-11-07 20:09                   ` Sage Weil
2016-11-07 21:16                     ` Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 08/10] ceph: add sync parameter to send_cap_msg Jeff Layton
2016-11-07  8:32   ` Yan, Zheng
2016-11-07 10:51     ` Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 09/10] ceph: plumb "sync" parameter into __send_cap Jeff Layton
2016-11-04 11:34 ` [RFC PATCH 10/10] ceph: turn on btime and change_attr support Jeff Layton
  -- strict thread matches above, loose matches on Subject: below --
2016-11-07 21:21 [RFC PATCH 07/10] ceph: update cap message struct version to 9 Sage Weil
2016-11-07 21:51 ` Jeff Layton
2016-11-07 23:15 ` Gregory Farnum
2016-11-07 23:21   ` Sage Weil
2016-11-11 12:45     ` Jeff Layton
2016-11-11 14:48       ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.