From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mx1.redhat.com ([209.132.183.28]:36836 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1726678AbfDLDiA (ORCPT <rfc822;fstests@vger.kernel.org>);
        Thu, 11 Apr 2019 23:38:00 -0400
Subject: Re: [RFC PATCH 2/2] ceph: test basic ceph.quota.max_bytes quota
References: <20190402103428.21435-1-lhenriques@suse.com>
 <20190402103428.21435-3-lhenriques@suse.com> <20190402210931.GV23020@dastard>
 <87d0m3e81f.fsf@suse.com> <d38a4d84-8df2-984e-cf1c-045d85644796@suse.com>
 <874l7fdy5s.fsf@suse.com> <20190403214708.GA26298@dastard>
 <87tvfecbv5.fsf@suse.com> <20190412011559.GE1695@dread.disaster.area>
From: "Yan, Zheng" <zyan@redhat.com>
Message-ID: <740207e9-b4ef-e4b4-4097-9ece2ac189a7@redhat.com>
Date: Fri, 12 Apr 2019 11:37:55 +0800
MIME-Version: 1.0
In-Reply-To: <20190412011559.GE1695@dread.disaster.area>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Sender: fstests-owner@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
To: Dave Chinner <david@fromorbit.com>, Luis Henriques <lhenriques@suse.com>
Cc: Nikolay Borisov <nborisov@suse.com>, fstests@vger.kernel.org, ceph-devel@vger.kernel.org
List-ID: <fstests@vger.kernel.org>

On 4/12/19 9:15 AM, Dave Chinner wrote:
> On Thu, Apr 04, 2019 at 11:18:22AM +0100, Luis Henriques wrote:
>> Dave Chinner <david@fromorbit.com> writes:
>>
>>> On Wed, Apr 03, 2019 at 02:19:11PM +0100, Luis Henriques wrote:
>>>> Nikolay Borisov <nborisov@suse.com> writes:
>>>>> On 3.04.19 =D0=B3. 12:45 =D1=87., Luis Henriques wrote:
>>>>>> Dave Chinner <david@fromorbit.com> writes:
>>>>>>> Makes no sense to me. xfs_io does a write() loop internally with
>>>>>>> this pwrite command of 4kB writes - the default buffer size. If y=
ou
>>>>>>> want xfs_io to loop doing 1MB sized pwrite() calls, then all you
>>>>>>> need is this:
>>>>>>>
>>>>>>> 	$XFS_IO_PROG -f -c "pwrite -w -B 1m 0 ${size}m" $file | _filter_=
xfs_io
>>>>>>>
>>>>>>
>>>>>> Thank you for your review, Dave.  I'll make sure the next revision=
 of
>>>>>> these tests will include all your comments implemented... except f=
or
>>>>>> this one.
>>>>>>
>>>>>> The reason I'm using a loop for writing a file is due to the natur=
e of
>>>>>> the (very!) loose definition of quotas in CephFS.  Basically, clie=
nts
>>>>>> will likely write some amount of data over the configured limit be=
cause
>>>>>> the servers they are communicating with to write the data (the OSD=
s)
>>>>>> have no idea about the concept of quotas (or files even); the file=
system
>>>>>> view in the cluster is managed at a different level, with the help=
 of
>>>>>> the MDS and the client itself.
>>>>>>
>>>>>> So, the loop in this function is simply to allow the metadata asso=
ciated
>>>>>> with the file to be updated while we're writing the file.  If I us=
e a
>>>>>
>>>>> But the metadata will be modified while writing the file even with =
a
>>>>> single invocation of xfs_io.
>>>>
>>>> No, that's not true.  It would be too expensive to keep the metadata
>>>> server updated while writing to a file.  So, making sure there's
>>>> actually an open/close to the file (plus the fsync in pwrite) helps
>>>> making sure the metadata is flushed into the MDS.
>>>
>>> /me sighs.
>>>
>>> So you want:
>>>
>>> 	loop until ${size}MB written:
>>> 		write 1MB
>>> 		fsync
>>> 		  -> flush data to server
>>> 		  -> flush metadata to server
>>>
>>> i.e. this one liner:
>>>
>>> xfs_io -f -c "pwrite -D -B 1m 0 ${size}m" /path/to/file
>>
>> Unfortunately, that doesn't do what I want either :-/
>> (and I guess you meant '-b 1m', not '-B 1m', right?)
>=20
> Yes. But I definitely did mean "-D" so that RWF_DSYNC was used with
> each 1MB write.
>=20
>> [ Zheng: please feel free to correct me if I'm saying something really
>>    stupid below. ]
>>
>> So, one of the key things in my loop is the open/close operations.  Wh=
en
>> a file is closed in cephfs the capabilities (that's ceph jargon for wh=
at
>> sort of operations a client is allowed to perform on an inode) will
>> likely be released and that's when the metadata server will get the
>> updated file size.  Before that, the client is allowed to modify the
>> file size if it has acquired the capabilities for doing so.
>=20
> So you are saying that O_DSYNC writes on ceph do not force file
> size metadata changes to the metadata server to be made stable?
>=20
>> OTOH, a pwrite operation will eventually get the -EDQUOT even with the
>> one-liner above because the client itself will realize it has exceeded=
 a
>> certain threshold set by the MDS and will eventually update the server
>> with the new file size.
>=20
> Sure, but if the client crashes without having sent the updated file
> size to the server as part of an extending O_DSYNC write, then how
> is it recovered when the client reconnects to the server and
> accesses the file again?


For DSYNC write, client has already written data to object store. If=20
client crashes, MDS will set file to 'recovering' state and probe file=20
size by checking object store. Accessing the file is blocked during=20
recovery.

Regards
Yan, Zheng


>=20
>> However that won't happen at a deterministic
>> file size.  For example, if quota is 10m and we're writing 20m, we may
>> get the error after writing 15m.
>>
>> Does this make sense?
>=20
> Only makes sense to me if O_DSYNC is ignored by the ceph client...
>=20
>> So, I guess I *could* use your one-liner in the test, but I would need
>> to slightly change the test logic -- I would need to write enough data
>> to the file to make sure I would get the -EDQUOT but I wouldn't be abl=
e
>> to actually check the file size as it will not be constant.
>>
>>> Fundamentally, if you find yourself writing a loop around xfs_io to
>>> break up a sequential IO stream into individual chunks, then you are
>>> most likely doing something xfs_io can already do. And if xfs_io
>>> cannot do it, then the right thing to do is to modify xfs_io to be
>>> able to do it and then use xfs_io....
>>
>> Got it!  But I guess it wouldn't make sense to change xfs_io for this
>> specific scenario where I want several open-write-close cycles.
>=20
> That's how individual NFS client writes appear to filesystem under
> the NFS server. I've previously considered adding an option in
> xfs_io to mimic this open-write-close loop per buffer so it's easy
> to exercise such behaviours, but never actually required it to
> reproduce the problems I was chasing. So it's definitely something
> that xfs_io /could/ do if necessary.
>=20
> Cheers,
>=20
> Dave.
>=20