public inbox for fstests@vger.kernel.org
 help / color / mirror / Atom feed
From: Luis Henriques <lhenriques@suse.com>
To: "Yan, Zheng" <zyan@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>,
	Nikolay Borisov <nborisov@suse.com>,
	fstests@vger.kernel.org, ceph-devel@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] ceph: test basic ceph.quota.max_bytes quota
Date: Fri, 12 Apr 2019 12:04:28 +0100	[thread overview]
Message-ID: <87imvjpjr7.fsf@suse.com> (raw)
In-Reply-To: <740207e9-b4ef-e4b4-4097-9ece2ac189a7@redhat.com> (Zheng Yan's message of "Fri, 12 Apr 2019 11:37:55 +0800")

"Yan, Zheng" <zyan@redhat.com> writes:

> On 4/12/19 9:15 AM, Dave Chinner wrote:
>> On Thu, Apr 04, 2019 at 11:18:22AM +0100, Luis Henriques wrote:
>>> Dave Chinner <david@fromorbit.com> writes:
>>>
>>>> On Wed, Apr 03, 2019 at 02:19:11PM +0100, Luis Henriques wrote:
>>>>> Nikolay Borisov <nborisov@suse.com> writes:
>>>>>> On 3.04.19 г. 12:45 ч., Luis Henriques wrote:
>>>>>>> Dave Chinner <david@fromorbit.com> writes:
>>>>>>>> Makes no sense to me. xfs_io does a write() loop internally with
>>>>>>>> this pwrite command of 4kB writes - the default buffer size. If you
>>>>>>>> want xfs_io to loop doing 1MB sized pwrite() calls, then all you
>>>>>>>> need is this:
>>>>>>>>
>>>>>>>> 	$XFS_IO_PROG -f -c "pwrite -w -B 1m 0 ${size}m" $file | _filter_xfs_io
>>>>>>>>
>>>>>>>
>>>>>>> Thank you for your review, Dave.  I'll make sure the next revision of
>>>>>>> these tests will include all your comments implemented... except for
>>>>>>> this one.
>>>>>>>
>>>>>>> The reason I'm using a loop for writing a file is due to the nature of
>>>>>>> the (very!) loose definition of quotas in CephFS.  Basically, clients
>>>>>>> will likely write some amount of data over the configured limit because
>>>>>>> the servers they are communicating with to write the data (the OSDs)
>>>>>>> have no idea about the concept of quotas (or files even); the filesystem
>>>>>>> view in the cluster is managed at a different level, with the help of
>>>>>>> the MDS and the client itself.
>>>>>>>
>>>>>>> So, the loop in this function is simply to allow the metadata associated
>>>>>>> with the file to be updated while we're writing the file.  If I use a
>>>>>>
>>>>>> But the metadata will be modified while writing the file even with a
>>>>>> single invocation of xfs_io.
>>>>>
>>>>> No, that's not true.  It would be too expensive to keep the metadata
>>>>> server updated while writing to a file.  So, making sure there's
>>>>> actually an open/close to the file (plus the fsync in pwrite) helps
>>>>> making sure the metadata is flushed into the MDS.
>>>>
>>>> /me sighs.
>>>>
>>>> So you want:
>>>>
>>>> 	loop until ${size}MB written:
>>>> 		write 1MB
>>>> 		fsync
>>>> 		  -> flush data to server
>>>> 		  -> flush metadata to server
>>>>
>>>> i.e. this one liner:
>>>>
>>>> xfs_io -f -c "pwrite -D -B 1m 0 ${size}m" /path/to/file
>>>
>>> Unfortunately, that doesn't do what I want either :-/
>>> (and I guess you meant '-b 1m', not '-B 1m', right?)
>>
>> Yes. But I definitely did mean "-D" so that RWF_DSYNC was used with
>> each 1MB write.
>>
>>> [ Zheng: please feel free to correct me if I'm saying something really
>>>    stupid below. ]
>>>
>>> So, one of the key things in my loop is the open/close operations.  When
>>> a file is closed in cephfs the capabilities (that's ceph jargon for what
>>> sort of operations a client is allowed to perform on an inode) will
>>> likely be released and that's when the metadata server will get the
>>> updated file size.  Before that, the client is allowed to modify the
>>> file size if it has acquired the capabilities for doing so.
>>
>> So you are saying that O_DSYNC writes on ceph do not force file
>> size metadata changes to the metadata server to be made stable?
>>
>>> OTOH, a pwrite operation will eventually get the -EDQUOT even with the
>>> one-liner above because the client itself will realize it has exceeded a
>>> certain threshold set by the MDS and will eventually update the server
>>> with the new file size.
>>
>> Sure, but if the client crashes without having sent the updated file
>> size to the server as part of an extending O_DSYNC write, then how
>> is it recovered when the client reconnects to the server and
>> accesses the file again?
>
>
> For DSYNC write, client has already written data to object store. If client
> crashes, MDS will set file to 'recovering' state and probe file size by checking
> object store. Accessing the file is blocked during recovery.

Thank you for chiming in, Zheng.

>
> Regards
> Yan, Zheng
>
>
>
>
>>
>>> However that won't happen at a deterministic
>>> file size.  For example, if quota is 10m and we're writing 20m, we may
>>> get the error after writing 15m.
>>>
>>> Does this make sense?
>>
>> Only makes sense to me if O_DSYNC is ignored by the ceph client...
>>
>>> So, I guess I *could* use your one-liner in the test, but I would need
>>> to slightly change the test logic -- I would need to write enough data
>>> to the file to make sure I would get the -EDQUOT but I wouldn't be able
>>> to actually check the file size as it will not be constant.
>>>
>>>> Fundamentally, if you find yourself writing a loop around xfs_io to
>>>> break up a sequential IO stream into individual chunks, then you are
>>>> most likely doing something xfs_io can already do. And if xfs_io
>>>> cannot do it, then the right thing to do is to modify xfs_io to be
>>>> able to do it and then use xfs_io....
>>>
>>> Got it!  But I guess it wouldn't make sense to change xfs_io for this
>>> specific scenario where I want several open-write-close cycles.
>>
>> That's how individual NFS client writes appear to filesystem under
>> the NFS server. I've previously considered adding an option in
>> xfs_io to mimic this open-write-close loop per buffer so it's easy
>> to exercise such behaviours, but never actually required it to
>> reproduce the problems I was chasing. So it's definitely something
>> that xfs_io /could/ do if necessary.

Ok, since there seems to be other use-cases for this, I agree it may be
worth adding that option then.  I'll see if I can come up with a patch
for that.

Cheers,
-- 
Luis

  reply	other threads:[~2019-04-12 11:04 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-02 10:34 [RFC PATCH 0/2] Initial CephFS tests Luis Henriques
2019-04-02 10:34 ` [RFC PATCH 1/2] ceph: test basic ceph.quota.max_files quota Luis Henriques
2019-04-02 10:34 ` [RFC PATCH 2/2] ceph: test basic ceph.quota.max_bytes quota Luis Henriques
2019-04-02 21:09   ` Dave Chinner
2019-04-03  9:45     ` Luis Henriques
2019-04-03 12:17       ` Nikolay Borisov
2019-04-03 13:19         ` Luis Henriques
2019-04-03 21:47           ` Dave Chinner
2019-04-04 10:18             ` Luis Henriques
2019-04-12  1:15               ` Dave Chinner
2019-04-12  3:37                 ` Yan, Zheng
2019-04-12 11:04                   ` Luis Henriques [this message]
2019-04-14 22:15                   ` Dave Chinner
2019-04-15  2:16                     ` Yan, Zheng
2019-04-16  8:13                       ` Dave Chinner
2019-04-16 10:48                         ` Luis Henriques
2019-04-16 18:38                           ` Gregory Farnum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87imvjpjr7.fsf@suse.com \
    --to=lhenriques@suse.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=david@fromorbit.com \
    --cc=fstests@vger.kernel.org \
    --cc=nborisov@suse.com \
    --cc=zyan@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox