Re: [PATCH 2/2] ceph: quota: fix quota subdir mounts

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Luis Henriques <lhenriques@suse.com>
To: "Yan\, Zheng" <ukernel@gmail.com>
Cc: "Yan\, Zheng" <zyan@redhat.com>, Sage Weil <sage@redhat.com>,
	Ilya Dryomov <idryomov@gmail.com>,
	ceph-devel <ceph-devel@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Hendrik Peyerl <hpeyerl@plusline.net>
Subject: Re: [PATCH 2/2] ceph: quota: fix quota subdir mounts
Date: Mon, 11 Mar 2019 11:43:07 +0000	[thread overview]
Message-ID: <87h8c9fxh0.fsf@suse.com> (raw)
In-Reply-To: <CAAM7YAnDvg0y8qkdGjO-EEdC6s5dMeHHF-PgeAxPqN_LPSSrEA@mail.gmail.com> (Zheng Yan's message of "Mon, 11 Mar 2019 11:12:19 +0800")

"Yan, Zheng" <ukernel@gmail.com> writes:

> On Sat, Mar 9, 2019 at 12:30 AM Luis Henriques <lhenriques@suse.com> wrote:
>>
>> The CephFS kernel client does not enforce quotas set in a directory that isn't
>> visible from the mount point.  For example, given the path '/dir1/dir2', if quotas
>> are set in 'dir1' and the filesystem is mounted with
>>
>>   mount -t ceph <server>:<port>:/dir1/ /mnt
>>
>> then the client won't be able to access 'dir1' inode, even if 'dir2' belongs to
>> a quota realm that points to it.
>>
>> This patch fixes this issue by simply doing an MDS LOOKUPINO operation for
>> unknown inodes.  Any inode reference obtained this way will be added to a list
>> in ceph_mds_client, and will only be released when the filesystem is umounted.
>>
>> Link: https://tracker.ceph.com/issues/38482
>> Reported-by: Hendrik Peyerl <hpeyerl@plusline.net>
>> Signed-off-by: Luis Henriques <lhenriques@suse.com>
>> ---
>>  fs/ceph/mds_client.c | 14 +++++++
>>  fs/ceph/mds_client.h |  2 +
>>  fs/ceph/quota.c      | 91 +++++++++++++++++++++++++++++++++++++++-----
>>  fs/ceph/super.h      |  2 +
>>  4 files changed, 99 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index 163fc74bf221..72c5ce5e4209 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -3656,6 +3656,8 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
>>         mdsc->max_sessions = 0;
>>         mdsc->stopping = 0;
>>         atomic64_set(&mdsc->quotarealms_count, 0);
>> +       INIT_LIST_HEAD(&mdsc->quotarealms_inodes_list);
>> +       spin_lock_init(&mdsc->quotarealms_inodes_lock);
>>         mdsc->last_snap_seq = 0;
>>         init_rwsem(&mdsc->snap_rwsem);
>>         mdsc->snap_realms = RB_ROOT;
>> @@ -3726,9 +3728,21 @@ static void wait_requests(struct ceph_mds_client *mdsc)
>>   */
>>  void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
>>  {
>> +       struct ceph_inode_info *ci;
>> +
>>         dout("pre_umount\n");
>>         mdsc->stopping = 1;
>>
>> +       spin_lock(&mdsc->quotarealms_inodes_lock);
>> +       while(!list_empty(&mdsc->quotarealms_inodes_list)) {
>> +               ci = list_first_entry(&mdsc->quotarealms_inodes_list,
>> +                                     struct ceph_inode_info,
>> +                                     i_quotarealms_inode_item);
>> +               list_del(&ci->i_quotarealms_inode_item);
>> +               iput(&ci->vfs_inode);
>
> iput while holding spinlock is not good

Yep, that's definitely not a good idea.  And maybe this loop could even
race with a lookup currently in progress and a new inode could be added
to the list after the cleanup.  I didn't verified this race could really
occur, because an easy fix is to simply move this loop into
ceph_mdsc_destroy() where we don't really need the spinlock anymore.

<snip>

>> +restart:
>>         down_read(&mdsc->snap_rwsem);
>> -       old_realm = get_quota_realm(mdsc, old);
>> -       new_realm = get_quota_realm(mdsc, new);
>> +       old_realm = get_quota_realm(mdsc, old, true);
>> +       /* This needs to be atomic, so we need to hold snap_rwsem */
>
> I don't understand this comment. get_quota_realm() unlock snap_rwsem
> no matter the 'retry' parameter is true or not/

Right, maybe the parameter name and the comment are a bit misleading.

get_quota_realm may need to do an inode lookup and then retry to get the
quota realm.  If this lookup is required, it has to drop the rwsem.

However, in ceph_quota_is_same_realm we need to lookup 2 quota realms
"atomically", i.e. with the rwsem held.  If get_quota_realm needs to
drop it, it will do the MDS inode lookup anyway but instead of retrying
to get the quota realm it will return -EAGAIN (because 'retry' will be
'false').  This allows for ceph_quota_is_same_realm to restart the
operation itself, and retry to get both quota realms without
get_quota_realm dropping the rwsem.

Does it make sense?  I agree the design isn't great :-/
I tried to describe this behavior in get_quota_realm comment, but it's
probably not good enough.

>> +       new_realm = get_quota_realm(mdsc, new, false);
>> +       if (PTR_ERR(new_realm) == -EAGAIN) {
>> +               up_read(&mdsc->snap_rwsem);
>> +               if (old_realm)
>> +                       ceph_put_snap_realm(mdsc, old_realm);
>> +               goto restart;
>> +       }
>>         is_same = (old_realm == new_realm);
>>         up_read(&mdsc->snap_rwsem);
>>
>> @@ -166,6 +230,7 @@ static bool check_quota_exceeded(struct inode *inode, enum quota_check_op op,
>>                 return false;
>>
>>         down_read(&mdsc->snap_rwsem);
>> +restart:
>>         realm = ceph_inode(inode)->i_snap_realm;
>>         if (realm)
>>                 ceph_get_snap_realm(mdsc, realm);
>> @@ -176,9 +241,15 @@ static bool check_quota_exceeded(struct inode *inode, enum quota_check_op op,
>>                 spin_lock(&realm->inodes_with_caps_lock);
>>                 in = realm->inode ? igrab(realm->inode) : NULL;
>>                 spin_unlock(&realm->inodes_with_caps_lock);
>> -               if (!in)
>> -                       break;
>> -
>> +               if (!in) {
>
> maybe we should  distinguish ‘realm->inode is null' from 'igrab fails'

Sure, that makes sense.

Cheers,
-- 
Luis


>
>> +                       up_read(&mdsc->snap_rwsem);
>> +                       in = lookup_quotarealm_inode(mdsc, inode->i_sb, realm);
>> +                       down_read(&mdsc->snap_rwsem);
>> +                       if (IS_ERR(in))
>> +                               break;
>> +                       ceph_put_snap_realm(mdsc, realm);
>> +                       goto restart;
>> +               }
>>                 ci = ceph_inode(in);
>>                 spin_lock(&ci->i_ceph_lock);
>>                 if (op == QUOTA_CHECK_MAX_FILES_OP) {
>> @@ -314,7 +385,7 @@ bool ceph_quota_update_statfs(struct ceph_fs_client *fsc, struct kstatfs *buf)
>>         bool is_updated = false;
>>
>>         down_read(&mdsc->snap_rwsem);
>> -       realm = get_quota_realm(mdsc, d_inode(fsc->sb->s_root));
>> +       realm = get_quota_realm(mdsc, d_inode(fsc->sb->s_root), true);
>>         up_read(&mdsc->snap_rwsem);
>>         if (!realm)
>>                 return false;
>> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
>> index ce51e98b08ec..cc7766aeb73b 100644
>> --- a/fs/ceph/super.h
>> +++ b/fs/ceph/super.h
>> @@ -375,6 +375,8 @@ struct ceph_inode_info {
>>         struct list_head i_snap_realm_item;
>>         struct list_head i_snap_flush_item;
>>
>> +       struct list_head i_quotarealms_inode_item;
>> +
>>         struct work_struct i_wb_work;  /* writeback work */
>>         struct work_struct i_pg_inv_work;  /* page invalidation work */
>>
>

     prev parent reply	other threads:[~2019-03-11 11:43 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-08 16:27 [PATCH 0/2] fix quota subdir mounts Luis Henriques
2019-03-08 16:27 ` [PATCH 1/2] ceph: factor out ceph_lookup_inode() Luis Henriques
2019-03-08 16:27 ` [PATCH 2/2] ceph: quota: fix quota subdir mounts Luis Henriques
2019-03-11  3:12   ` Yan, Zheng
2019-03-11 11:43     ` Luis Henriques [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h8c9fxh0.fsf@suse.com \
    --to=lhenriques@suse.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=hpeyerl@plusline.net \
    --cc=idryomov@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sage@redhat.com \
    --cc=ukernel@gmail.com \
    --cc=zyan@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox