[Ocfs2-devel] [PATCH 0/8] ocfs2: fix ocfs2 direct io code patch to support sparse file and data ordering semantics

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Joseph Qi <joseph.qi@huawei.com>
To: ocfs2-devel@oss.oracle.com
Subject: [Ocfs2-devel] [PATCH 0/8] ocfs2: fix ocfs2 direct io code patch to support sparse file and data ordering semantics
Date: Mon, 14 Dec 2015 18:36:23 +0800	[thread overview]
Message-ID: <566E9BA7.5040609@huawei.com> (raw)
In-Reply-To: <566E542B.4060605@oracle.com>

Hi Ryan,

On 2015/12/14 13:31, Ryan Ding wrote:
> Hi Joseph,
> 
> On 12/10/2015 06:36 PM, Joseph Qi wrote:
>> Hi Ryan,
>>
>> On 2015/12/10 16:48, Ryan Ding wrote:
>>> Hi Joseph,
>>>
>>> Thanks for your comments, please see my reply:
>>>
>>> On 12/10/2015 03:54 PM, Joseph Qi wrote:
>>>> Hi Ryan,
>>>>
>>>> On 2015/10/12 14:34, Ryan Ding wrote:
>>>>> Hi Joseph,
>>>>>
>>>>> On 10/08/2015 02:13 PM, Joseph Qi wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>> On 2015/10/8 11:12, Ryan Ding wrote:
>>>>>>> Hi Joseph,
>>>>>>>
>>>>>>> On 09/28/2015 06:20 PM, Joseph Qi wrote:
>>>>>>>> Hi Ryan,
>>>>>>>> I have gone through this patch set and done a simple performance test
>>>>>>>> using direct dd, it indeed brings much performance promotion.
>>>>>>>>              Before      After
>>>>>>>> bs=4K    1.4 MB/s    5.0 MB/s
>>>>>>>> bs=256k  40.5 MB/s   56.3 MB/s
>>>>>>>>
>>>>>>>> My questions are:
>>>>>>>> 1) You solution is still using orphan dir to keep inode and allocation
>>>>>>>> consistency, am I right? From our test, it is the most complicated part
>>>>>>>> and has many race cases to be taken consideration. So I wonder if this
>>>>>>>> can be restructured.
>>>>>>> I have not got a better idea to do this. I think the only reason why direct io using orphan is to prevent space lost when system crash during append direct write. But maybe a 'fsck -f' will do that job. Is it necessary to use orphan?
>>>>>> The idea is taken from ext4, but since ocfs2 is cluster filesystem, so
>>>>>> it is much more complicated than ext4.
>>>>>> And fsck can only be used offline, but using orphan is to perform
>>>>>> recovering online. So I don't think fsck can replace it in all cases.
>>>>>>
>>>>>>>> 2) Rather than using normal block direct io, you introduce a way to use
>>>>>>>> write begin/end in buffer io. IMO, if it wants to perform like direct
>>>>>>>> io, it should be committed to disk by forcing committing journal. But
>>>>>>>> journal committing will consume much time. Why does it bring performance
>>>>>>>> promotion instead?
>>>>>>> I use buffer io to write only the zero pages. Actual data payload is written as direct io. I think there is no need to do a force commit. Because direct means "Try to minimize cache effects of the I/O to and from this file.", it does not means "write all data & meta data to disk before write return".
>>>> I think we cannot mix zero pages with direct io here, which will lead
>>>> to direct io data to be overwritten by zero pages.
>>>> For example, a ocfs2 volume with block size 4K and cluster size 4K.
>>>> Firstly I create a file with size of 5K and it will be allocated 2
>>>> clusters (8K) and the last 3K without zeroed (no need at this time).
>>> I think the last 3K will be zeroed no matter you use direct io or buffer io to create the a file with 5K.
>>>> Then I seek to offset 9K and do direct write 1K, then back to 4K and do
>>>> direct write 5K. Here we have to zero allocated space to avoid dirty
>>>> data. But since direct write data goes to disk directly and zero pages
>>>> depends on journal commit, so direct write data will be overwritten and
>>>> file corrupts.
>>> do_blockdev_direct_IO() will zero unwritten area within block size(in this case, 6K~8K), when get_block callback return a map with buffer_new flag. This zero operation is also using direct io.
>>> So the buffer io zero operation in my design will not work at all in this case.It only works to zero the area beyond block size, but within cluster size. For example, when block size 4KB cluster size 1MB, a 4KB direct write will trigger a zero buffer page of size 1MB-4KB=1020KB.
>>> I think your question is this zero buffer page will conflict with the later direct io writing to the same area. The truth is conflict will not exist, because before direct write, all conflict buffer page will be flushed to disk first (in __generic_file_write_iter()).
>> How can it make sure the zero pages to be flushed to disk first? In
>> ocfs2_direct_IO, it calls ocfs2_dio_get_block which uses write_begin
>> and write_end, and then __blockdev_direct_IO.
>> I've backported your patch set to kernel 3.0 and tested with vhd-util,
>> and the result fails. The test case is below.
>> 1) create a 1G dynamic vhd file, the actual size is 5K.
>> # vhd-util create -n test.vhd -s 1024
>> 2) resize it to 4G, the actual size becomes to 11K
>> # vhd-util resize -n test.vhd -s 4096 -j test.log
>> 3) hexdump the data, say hexdump1
>> 4) umount to commit journal and mount again, and hexdump the data again,
>> say hexdump2, which is not equal to hexdump1.
>> I am not sure if there is any relations with kernel version, which
>> indeed has many differences due to refactoring.
> I have backported it to kernel 3.8, and run the scripts below (I think it's the same as your test):
> 
>     mount /dev/dm-1 /mnt
>     pushd /mnt/
>     rm test.vhd -f
>     vhd-util create -n test.vhd -s 1024
>     vhd-util resize -n test.vhd -s 4096 -j test.log
>     hexdump test.vhd > ~/test.hex.1
>     popd
>     umount /mnt/
>     mount /dev/dm-1 /mnt/
>     hexdump /mnt/test.vhd > ~/test.hex.2
>     umount /mnt
> 
> block size & cluster size are all 4K.
> It shows there is no difference between test.hex.1 and test.hex.2. I think this issue is related to specified kernel version, so which version is your kernel? Please provide the backport patches if you wish :)
I am using kernel 3.0.93. But I think it have no relations with kernel.
In one direct io, use buffer to zero first and then do direct write, you
cannot make sure the order. In other words, direct io may goes to disk
first and then zero buffers. That's why I am using blkdev_issue_zeroout
to do this in my patches.
And I am using jbd2_journal_force_commit to get metadata go to disk at
the same time, which will make performance poorer than yours. It can be
removed if direct io's semantics does not require.

> 
> Thanks,
> Ryan
>>
>> Thanks,
>> Joseph
>>
>>> BTW, there is a lot testcases to test the operations like buffer write, direct write, lseek.. (it's a mix of these operations) in ltp (Linux Test Project). This patch set has passed all of them. :)
>>>>>> So this is protected by "UNWRITTEN" flag, right?
>>>>>>
>>>>>>>> 3) Do you have a test in case of lack of memory?
>>>>>>> I tested it in a system with 2GB memory. Is that enough?
>>>>>> What I mean is doing many direct io jobs in case system free memory is
>>>>>> low.
>>>>> I understand what you mean, but did not find a better way to test it. Since if free memory is too low, even the process can not be started. If free memory is fairlyenough, the test has no meaning.
>>>>> So I try to collect the memory usage during io, and do a comparison test with buffer io. The result is:
>>>>> 1. start 100 dd to do 4KB direct write:
>>>>> [root at hnode3 ~]# cat /proc/meminfo | grep -E "^Cached|^Dirty|^MemFree|^MemTotal|^Buffers|^Writeback:"
>>>>> MemTotal:        2809788 kB
>>>>> MemFree:           21824 kB
>>>>> Buffers:           55176 kB
>>>>> Cached:          2513968 kB
>>>>> Dirty:               412 kB
>>>>> Writeback:            36 kB
>>>>>
>>>>> 2. start 100 dd to do 4KB buffer write:
>>>>> [root at hnode3 ~]# cat /proc/meminfo | grep -E "^Cached|^Dirty|^MemFree|^MemTotal|^Buffers|^Writeback:"
>>>>> MemTotal:        2809788 kB
>>>>> MemFree:           22476 kB
>>>>> Buffers:           15696 kB
>>>>> Cached:          2544892 kB
>>>>> Dirty:            320136 kB
>>>>> Writeback:        146404 kB
>>>>>
>>>>> You can see from the 'Dirty' and 'Writeback' field that there is not so much memory used as buffer io. So I think what you concern is no longer exist. :-)
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>> Thanks,
>>>>>> Joesph
>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>
>>
> 
> 
> .
>

next prev parent reply	other threads:[~2015-12-14 10:36 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-11  8:19 [Ocfs2-devel] [PATCH 0/8] ocfs2: fix ocfs2 direct io code patch to support sparse file and data ordering semantics Ryan Ding
2015-09-11  8:19 ` [Ocfs2-devel] [PATCH 1/8] ocfs2: add ocfs2_write_type_t type to identify the caller of write Ryan Ding
2015-09-11  8:19 ` [Ocfs2-devel] [PATCH 2/8] ocfs2: use c_new to indicate newly allocated extents Ryan Ding
2015-09-11  8:19 ` [Ocfs2-devel] [PATCH 3/8] ocfs2: test target page before change it Ryan Ding
2015-09-11  8:19 ` [Ocfs2-devel] [PATCH 4/8] ocfs2: do not change i_size in write_end for direct io Ryan Ding
2015-09-11  8:19 ` [Ocfs2-devel] [PATCH 5/8] ocfs2: return the physical address in ocfs2_write_cluster Ryan Ding
2015-09-11  8:19 ` [Ocfs2-devel] [PATCH 6/8] ocfs2: record UNWRITTEN extents when populate write desc Ryan Ding
2015-09-11  8:19 ` [Ocfs2-devel] [PATCH 7/8] ocfs2: fix sparse file & data ordering issue in direct io Ryan Ding
2015-09-11  8:19 ` [Ocfs2-devel] [PATCH 8/8] ocfs2: code clean up for " Ryan Ding
2015-09-28 10:20 ` [Ocfs2-devel] [PATCH 0/8] ocfs2: fix ocfs2 direct io code patch to support sparse file and data ordering semantics Joseph Qi
2015-10-08  3:12   ` Ryan Ding
2015-10-08  6:13     ` Joseph Qi
2015-10-08  7:13       ` Ryan Ding
2015-10-12  6:34       ` Ryan Ding
2015-12-10  7:54         ` Joseph Qi
2015-12-10  8:48           ` Ryan Ding
2015-12-10 10:36             ` Joseph Qi
2015-12-14  5:31               ` Ryan Ding
2015-12-14 10:36                 ` Joseph Qi [this message]
2015-12-16  1:39                   ` Ryan Ding
2015-12-16  2:26                     ` Joseph Qi
2015-12-16  3:12                       ` Ryan Ding

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=566E9BA7.5040609@huawei.com \
    --to=joseph.qi@huawei.com \
    --cc=ocfs2-devel@oss.oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.