From: Boaz Harrosh <bharrosh@panasas.com>
To: Peng Tao <bergwolf@gmail.com>
Cc: linuxnfs <linux-nfs@vger.kernel.org>, Benny Halevy <bhalevy@tonian.com>
Subject: Re: pnfs LD partial sector write
Date: Thu, 26 Jul 2012 17:12:47 +0300 [thread overview]
Message-ID: <5011505F.1020508@panasas.com> (raw)
In-Reply-To: <CA+a=Yy7b54qE3v8Yjq6-z3zjqrPH3Vt1madnVZk2aysYcOdbAA@mail.gmail.com>
On 07/26/2012 12:12 PM, Peng Tao wrote:
> On Thu, Jul 26, 2012 at 3:47 PM, Boaz Harrosh <bharrosh@panasas.com> wrote:
>> On 07/26/2012 05:43 AM, Peng Tao wrote:
>>
>>> Another thing is, this further complicates direct writes, where I
>>> cannot use pagecache to ensure proper locking for concurrent writers
>>> in the same BLOCK, and sector-aligned partial BLOCK DIO writes need to
>>> be serialized internally. IOW, the same code cannot be reused by DIO
>>> writes. sigh...
>>>
>>
>>
>> One last thing. Applications who use direct IO know to allocate
>> and issue sector aligned requests both at offset and length.
>> That's a Kernel requirement. It is not for NFS, but even so.
>>
>> Just refuse sector unaligned DIO and revert to MDS.
>>
>> With sector aligned IO you directly DIO to DIO pages,
>> problem solved.
>>
>> If you need the COW of partial blocks, you still use
>> page-cache pages, which is fine because they do not
>> intersect any of the DIO.
>>
> I certainly thought about it, but it doesn't work for AIO DIO case.
> Assuming BLOCK size is 8K, process A write to 0~4095 bytes of file foo
> with AIO DIO, at the same time process B write to 4096~8191 with AIO
> DIO at the same time. If kernel ever tries to reply on page cache to
> cope with invalid extent, it ends up with data corruption.
>
> This is a common problem for any extent based file system to deal with
> partial BLOCK (_NOT SECTOR_) AIODIO writes. If you wonder why, take a
> look at ext4_unaligned_aio() and all the ext4 AIODIO locking
> mechanisms... And that's the reason I bailed out non-block aligned AIO
> in previous DIO alignment patches. I think I should just keep the
> AIODIO bailout logic since adding locking method is slowing down
> writers while they can go locklessly through MDS. I will revive the
> bailout patches after fixing the buffer IO things.
>
There is an easy locking solution for DIO which will not cost much
for DIO and will cost nothing for buffered IO. You use the page-cache
page lock.
What you do is grab the zero-page of each block lock before/during writing to
any block. So for your example above they will all be serialized by page-zero
lock.
Of course you need like before to flush the page-cache pages before DIO and
invalidate all pages (NotUpToDate). You keep at least one page in page-cache
per block, but during DIO it will always be in Not-Up-To-Date empty state.
Then if needed, like example above the first time COW you still do through
page-cache
*
* That said I think your solution for only allowing BLOCK aligned DIO is good
* Applications should learn. They should however find out what BLOCK size is.
*
You could keep the proper info at the DM device you create for each device_id
See here: http://people.redhat.com/msnitzer/docs/io-limits.txt
The "logical_block_size" should be the proper BLOCK size above.
But we will need to talk about how to associate files with device_id's.
Perhaps at /proc or some IOCTL. There was also talks going on in the
Luster/cluster filesystems people about defining a FILE_LAYOUT xattr, which
will return these layout information in a generic way, for the likes of tar,
backup, and sorts.
> Cheers,
> Tao
Cheers
Boaz
next prev parent reply other threads:[~2012-07-26 14:13 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-07-25 7:31 pnfs LD partial sector write Peng Tao
2012-07-25 10:28 ` Boaz Harrosh
2012-07-25 10:45 ` Boaz Harrosh
2012-07-25 14:43 ` Peng Tao
2012-07-25 20:29 ` Boaz Harrosh
2012-07-26 2:43 ` Peng Tao
2012-07-26 7:29 ` Boaz Harrosh
2012-07-26 8:25 ` Peng Tao
2012-07-26 12:16 ` Boaz Harrosh
2012-07-26 13:57 ` Peng Tao
2012-07-26 14:30 ` Boaz Harrosh
2012-07-26 15:30 ` Peng Tao
2012-07-26 15:44 ` Boaz Harrosh
2012-07-26 7:47 ` Boaz Harrosh
2012-07-26 9:12 ` Peng Tao
2012-07-26 14:12 ` Boaz Harrosh [this message]
2012-07-26 15:07 ` Peng Tao
2012-07-26 16:00 ` Boaz Harrosh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5011505F.1020508@panasas.com \
--to=bharrosh@panasas.com \
--cc=bergwolf@gmail.com \
--cc=bhalevy@tonian.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox