From: Luis Henriques <lhenriques@suse.com>
To: "Yan\, Zheng" <ukernel@gmail.com>
Cc: "Yan\, Zheng" <zyan@redhat.com>, Sage Weil <sage@redhat.com>,
Ilya Dryomov <idryomov@gmail.com>,
ceph-devel <ceph-devel@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Hendrik Peyerl <hpeyerl@plusline.net>
Subject: Re: [PATCH 2/2] ceph: quota: fix quota subdir mounts
Date: Mon, 11 Mar 2019 11:43:07 +0000 [thread overview]
Message-ID: <87h8c9fxh0.fsf@suse.com> (raw)
In-Reply-To: <CAAM7YAnDvg0y8qkdGjO-EEdC6s5dMeHHF-PgeAxPqN_LPSSrEA@mail.gmail.com> (Zheng Yan's message of "Mon, 11 Mar 2019 11:12:19 +0800")
"Yan, Zheng" <ukernel@gmail.com> writes:
> On Sat, Mar 9, 2019 at 12:30 AM Luis Henriques <lhenriques@suse.com> wrote:
>>
>> The CephFS kernel client does not enforce quotas set in a directory that isn't
>> visible from the mount point. For example, given the path '/dir1/dir2', if quotas
>> are set in 'dir1' and the filesystem is mounted with
>>
>> mount -t ceph <server>:<port>:/dir1/ /mnt
>>
>> then the client won't be able to access 'dir1' inode, even if 'dir2' belongs to
>> a quota realm that points to it.
>>
>> This patch fixes this issue by simply doing an MDS LOOKUPINO operation for
>> unknown inodes. Any inode reference obtained this way will be added to a list
>> in ceph_mds_client, and will only be released when the filesystem is umounted.
>>
>> Link: https://tracker.ceph.com/issues/38482
>> Reported-by: Hendrik Peyerl <hpeyerl@plusline.net>
>> Signed-off-by: Luis Henriques <lhenriques@suse.com>
>> ---
>> fs/ceph/mds_client.c | 14 +++++++
>> fs/ceph/mds_client.h | 2 +
>> fs/ceph/quota.c | 91 +++++++++++++++++++++++++++++++++++++++-----
>> fs/ceph/super.h | 2 +
>> 4 files changed, 99 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index 163fc74bf221..72c5ce5e4209 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -3656,6 +3656,8 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
>> mdsc->max_sessions = 0;
>> mdsc->stopping = 0;
>> atomic64_set(&mdsc->quotarealms_count, 0);
>> + INIT_LIST_HEAD(&mdsc->quotarealms_inodes_list);
>> + spin_lock_init(&mdsc->quotarealms_inodes_lock);
>> mdsc->last_snap_seq = 0;
>> init_rwsem(&mdsc->snap_rwsem);
>> mdsc->snap_realms = RB_ROOT;
>> @@ -3726,9 +3728,21 @@ static void wait_requests(struct ceph_mds_client *mdsc)
>> */
>> void ceph_mdsc_pre_umount(struct ceph_mds_client *mdsc)
>> {
>> + struct ceph_inode_info *ci;
>> +
>> dout("pre_umount\n");
>> mdsc->stopping = 1;
>>
>> + spin_lock(&mdsc->quotarealms_inodes_lock);
>> + while(!list_empty(&mdsc->quotarealms_inodes_list)) {
>> + ci = list_first_entry(&mdsc->quotarealms_inodes_list,
>> + struct ceph_inode_info,
>> + i_quotarealms_inode_item);
>> + list_del(&ci->i_quotarealms_inode_item);
>> + iput(&ci->vfs_inode);
>
> iput while holding spinlock is not good
Yep, that's definitely not a good idea. And maybe this loop could even
race with a lookup currently in progress and a new inode could be added
to the list after the cleanup. I didn't verified this race could really
occur, because an easy fix is to simply move this loop into
ceph_mdsc_destroy() where we don't really need the spinlock anymore.
<snip>
>> +restart:
>> down_read(&mdsc->snap_rwsem);
>> - old_realm = get_quota_realm(mdsc, old);
>> - new_realm = get_quota_realm(mdsc, new);
>> + old_realm = get_quota_realm(mdsc, old, true);
>> + /* This needs to be atomic, so we need to hold snap_rwsem */
>
> I don't understand this comment. get_quota_realm() unlock snap_rwsem
> no matter the 'retry' parameter is true or not/
Right, maybe the parameter name and the comment are a bit misleading.
get_quota_realm may need to do an inode lookup and then retry to get the
quota realm. If this lookup is required, it has to drop the rwsem.
However, in ceph_quota_is_same_realm we need to lookup 2 quota realms
"atomically", i.e. with the rwsem held. If get_quota_realm needs to
drop it, it will do the MDS inode lookup anyway but instead of retrying
to get the quota realm it will return -EAGAIN (because 'retry' will be
'false'). This allows for ceph_quota_is_same_realm to restart the
operation itself, and retry to get both quota realms without
get_quota_realm dropping the rwsem.
Does it make sense? I agree the design isn't great :-/
I tried to describe this behavior in get_quota_realm comment, but it's
probably not good enough.
>> + new_realm = get_quota_realm(mdsc, new, false);
>> + if (PTR_ERR(new_realm) == -EAGAIN) {
>> + up_read(&mdsc->snap_rwsem);
>> + if (old_realm)
>> + ceph_put_snap_realm(mdsc, old_realm);
>> + goto restart;
>> + }
>> is_same = (old_realm == new_realm);
>> up_read(&mdsc->snap_rwsem);
>>
>> @@ -166,6 +230,7 @@ static bool check_quota_exceeded(struct inode *inode, enum quota_check_op op,
>> return false;
>>
>> down_read(&mdsc->snap_rwsem);
>> +restart:
>> realm = ceph_inode(inode)->i_snap_realm;
>> if (realm)
>> ceph_get_snap_realm(mdsc, realm);
>> @@ -176,9 +241,15 @@ static bool check_quota_exceeded(struct inode *inode, enum quota_check_op op,
>> spin_lock(&realm->inodes_with_caps_lock);
>> in = realm->inode ? igrab(realm->inode) : NULL;
>> spin_unlock(&realm->inodes_with_caps_lock);
>> - if (!in)
>> - break;
>> -
>> + if (!in) {
>
> maybe we should distinguish ‘realm->inode is null' from 'igrab fails'
Sure, that makes sense.
Cheers,
--
Luis
>
>> + up_read(&mdsc->snap_rwsem);
>> + in = lookup_quotarealm_inode(mdsc, inode->i_sb, realm);
>> + down_read(&mdsc->snap_rwsem);
>> + if (IS_ERR(in))
>> + break;
>> + ceph_put_snap_realm(mdsc, realm);
>> + goto restart;
>> + }
>> ci = ceph_inode(in);
>> spin_lock(&ci->i_ceph_lock);
>> if (op == QUOTA_CHECK_MAX_FILES_OP) {
>> @@ -314,7 +385,7 @@ bool ceph_quota_update_statfs(struct ceph_fs_client *fsc, struct kstatfs *buf)
>> bool is_updated = false;
>>
>> down_read(&mdsc->snap_rwsem);
>> - realm = get_quota_realm(mdsc, d_inode(fsc->sb->s_root));
>> + realm = get_quota_realm(mdsc, d_inode(fsc->sb->s_root), true);
>> up_read(&mdsc->snap_rwsem);
>> if (!realm)
>> return false;
>> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
>> index ce51e98b08ec..cc7766aeb73b 100644
>> --- a/fs/ceph/super.h
>> +++ b/fs/ceph/super.h
>> @@ -375,6 +375,8 @@ struct ceph_inode_info {
>> struct list_head i_snap_realm_item;
>> struct list_head i_snap_flush_item;
>>
>> + struct list_head i_quotarealms_inode_item;
>> +
>> struct work_struct i_wb_work; /* writeback work */
>> struct work_struct i_pg_inv_work; /* page invalidation work */
>>
>
prev parent reply other threads:[~2019-03-11 11:43 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-03-08 16:27 [PATCH 0/2] fix quota subdir mounts Luis Henriques
2019-03-08 16:27 ` [PATCH 1/2] ceph: factor out ceph_lookup_inode() Luis Henriques
2019-03-08 16:27 ` [PATCH 2/2] ceph: quota: fix quota subdir mounts Luis Henriques
2019-03-11 3:12 ` Yan, Zheng
2019-03-11 11:43 ` Luis Henriques [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87h8c9fxh0.fsf@suse.com \
--to=lhenriques@suse.com \
--cc=ceph-devel@vger.kernel.org \
--cc=hpeyerl@plusline.net \
--cc=idryomov@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sage@redhat.com \
--cc=ukernel@gmail.com \
--cc=zyan@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox