All of lore.kernel.org
 help / color / mirror / Atom feed
From: Boaz Harrosh <bharrosh@panasas.com>
To: Peng Tao <bergwolf@gmail.com>
Cc: linuxnfs <linux-nfs@vger.kernel.org>, Benny Halevy <bhalevy@tonian.com>
Subject: Re: pnfs LD partial sector write
Date: Thu, 26 Jul 2012 17:12:47 +0300	[thread overview]
Message-ID: <5011505F.1020508@panasas.com> (raw)
In-Reply-To: <CA+a=Yy7b54qE3v8Yjq6-z3zjqrPH3Vt1madnVZk2aysYcOdbAA@mail.gmail.com>

On 07/26/2012 12:12 PM, Peng Tao wrote:

> On Thu, Jul 26, 2012 at 3:47 PM, Boaz Harrosh <bharrosh@panasas.com> wrote:
>> On 07/26/2012 05:43 AM, Peng Tao wrote:
>>
>>> Another thing is, this further complicates direct writes, where I
>>> cannot use pagecache to ensure proper locking for concurrent writers
>>> in the same BLOCK, and sector-aligned partial BLOCK DIO writes need to
>>> be serialized internally. IOW, the same code cannot be reused by DIO
>>> writes. sigh...
>>>
>>
>>
>> One last thing. Applications who use direct IO know to allocate
>> and issue sector aligned requests both at offset and length.
>> That's a Kernel requirement. It is not for NFS, but even so.
>>
>> Just refuse sector unaligned DIO and revert to MDS.
>>
>> With sector aligned IO you directly DIO to DIO pages,
>> problem solved.
>>
>> If you need the COW of partial blocks, you still use
>> page-cache pages, which is fine because they do not
>> intersect any of the DIO.
>>
> I certainly thought about it, but it doesn't work for AIO DIO case.
> Assuming BLOCK size is 8K, process A write to 0~4095 bytes of file foo
> with AIO DIO, at the same time process B write to 4096~8191 with AIO
> DIO at the same time. If kernel ever tries to reply on page cache to
> cope with invalid extent, it ends up with data corruption.
> 
> This is a common problem for any extent based file system to deal with
> partial BLOCK (_NOT SECTOR_) AIODIO writes. If you wonder why, take a
> look at ext4_unaligned_aio() and all the ext4 AIODIO locking
> mechanisms... And that's the reason I bailed out non-block aligned AIO
> in previous DIO alignment patches. I think I should just keep the
> AIODIO bailout logic since adding locking method is slowing down
> writers while they can go locklessly through MDS. I will revive the
> bailout patches after fixing the buffer IO things.
> 


There is an easy locking solution for DIO which will not cost much
for DIO and will cost nothing for buffered IO. You use the page-cache
page lock.

What you do is grab the zero-page of each block lock before/during writing to
any block. So for your example above they will all be serialized by page-zero
lock.

Of course you need like before to flush the page-cache pages before DIO and
invalidate all pages (NotUpToDate). You keep at least one page in page-cache
per block, but during DIO it will always be in Not-Up-To-Date empty state.

Then if needed, like example above the first time COW you still do through
page-cache

*
* That said I think your solution for only allowing BLOCK aligned DIO is good
* Applications should learn. They should however find out what BLOCK size is.
*

You could keep the proper info at the DM device you create for each device_id
See here: http://people.redhat.com/msnitzer/docs/io-limits.txt
The "logical_block_size" should be the proper BLOCK size above.

But we will need to talk about how to associate files with device_id's.
Perhaps at /proc or some IOCTL. There was also talks going on in the
Luster/cluster filesystems people about defining a FILE_LAYOUT xattr, which
will return these layout information in a generic way, for the likes of tar,
backup, and sorts.

> Cheers,
> Tao


Cheers
Boaz

  reply	other threads:[~2012-07-26 14:13 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-25  7:31 pnfs LD partial sector write Peng Tao
2012-07-25 10:28 ` Boaz Harrosh
2012-07-25 10:45   ` Boaz Harrosh
2012-07-25 14:43   ` Peng Tao
2012-07-25 20:29     ` Boaz Harrosh
2012-07-26  2:43       ` Peng Tao
2012-07-26  7:29         ` Boaz Harrosh
2012-07-26  8:25           ` Peng Tao
2012-07-26 12:16             ` Boaz Harrosh
2012-07-26 13:57               ` Peng Tao
2012-07-26 14:30                 ` Boaz Harrosh
2012-07-26 15:30                   ` Peng Tao
2012-07-26 15:44                     ` Boaz Harrosh
2012-07-26  7:47         ` Boaz Harrosh
2012-07-26  9:12           ` Peng Tao
2012-07-26 14:12             ` Boaz Harrosh [this message]
2012-07-26 15:07               ` Peng Tao
2012-07-26 16:00                 ` Boaz Harrosh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5011505F.1020508@panasas.com \
    --to=bharrosh@panasas.com \
    --cc=bergwolf@gmail.com \
    --cc=bhalevy@tonian.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.