Re: [PATCH] xfs: allow logical-sector sized O_DIRECT for any fs sector size

From: Eric Sandeen <sandeen@sandeen.net>
To: Dave Chinner <david@fromorbit.com>
Cc: Eric Sandeen <sandeen@redhat.com>, xfs-oss <xfs@oss.sgi.com>
Subject: Re: [PATCH] xfs: allow logical-sector sized O_DIRECT for any fs sector size
Date: Fri, 17 Jan 2014 11:35:30 -0600	[thread overview]
Message-ID: <52D969E2.7030308@sandeen.net> (raw)
In-Reply-To: <20140116232132.GT3431@dastard>

On 1/16/14, 5:21 PM, Dave Chinner wrote:
> On Wed, Jan 15, 2014 at 04:52:05PM -0600, Eric Sandeen wrote:
>> On 1/15/14, 4:38 PM, Dave Chinner wrote:
>>> On Wed, Jan 15, 2014 at 11:59:45AM -0600, Eric Sandeen wrote:
>>>> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
>>>> index 33ad9a7..1f3431f 100644
>>>> --- a/fs/xfs/xfs_ioctl.c
>>>> +++ b/fs/xfs/xfs_ioctl.c
>>>> @@ -1587,7 +1587,12 @@ xfs_file_ioctl(
>>>>  			XFS_IS_REALTIME_INODE(ip) ?
>>>>  			mp->m_rtdev_targp : mp->m_ddev_targp;
>>>>  
>>>> -		da.d_mem = da.d_miniosz = 1 << target->bt_sshift;
>>>> +		/*
>>>> +		 * Report device physical sector size as "optimal" minimum,
>>>> +		 * unless blocksize is smaller than that.
>>>> +		 */
>>>> +		da.d_miniosz = min(target->bt_pssize, target->bt_bsize);
>>>
>>> Just grab the filesysetm block size from the xfs_mount:
>>>
>>> 		da.d_miniosz = min(target->bt_pssize, mp->m_sb.sb_blocksize);
>>>
>>>> +		da.d_mem = da.d_miniosz;
>>>
>>> I'd suggest that this should be PAGE_SIZE - it's for memory buffer
>>> alignment, not IO alignment, so using the IO alignment just seems
>>> wrong to me...
>>
>> Ok.  Was just sticking close to what we had before.
>>
>> So:
>>  		da.d_miniosz = min(target->bt_pssize, mp->m_sb.sb_blocksize);
>> 		da.d_mem = PAGE_SIZE;
>>
>> ?  Then we can have a minimum IO size of 512, and a memory alignment of
>> 4k, isn't that a little odd?
>>
>> (IOWs we could do 512-aligned memory before, right?  What's the downside,
>> or the value in changing it now?)
> 
> We can do arbitrary byte aligned buffers if I understand
> get_user_pages() correctly - it just maps the page under the buffer
> into the kernel address space and then the bio is pointed at it.
> AFAICT, the reason for the "memory buffer needs 512 byte alignment"
> is simply that this is the minimum IO size supported.

Actually, it's fs/direct-io.c which enforces this (not sure why I couldn't
find that before), in do_blockdev_direct_IO(); the *enforced* minimum
memory alignment is the size of the bev's logical block size.

> However, for large IOs, 512 byte alignment is not really optimal. If
> we don't align the buffer to PAGE_SIZE, then we have partial head
> and tail pages, so for a 512k IO we need to map 129 pages into a bio
> instead of 128. When you have hardware that can only handle 128
> segments in a DMA transfer, that means the 512k IO needs to be sent
> in two IOs rather than one.

Ok, but I have a bit of a problem with changing what XFS_IOC_DIOINFO
reports.  (I had originally thought to change the minimum IO size, but
I have talked myself out of that, too).

The xfsctl(3) manpage says that XFS_IOC_DIOINFO:  "Get(s) information
required to perform direct I/O on the specified file descriptor."
and "the user’s data buffer must conform to the same type of 
constraints as required for accessing a raw disk partition."

IOWs, the ioctl is documented as returning minimum, not optimal,
requirements, and it has always been implemented as such.  Changing
its meaning now seems wrong.  At least, I would not like to do so
as part of this functional change; to do so would probably be best
in a new ioctl which reports both minimum & optimal sizes.  And at
that point we should just do a vfs interface.  :)

Of course, applications don't have to use the minimum sizes reported
by the ioctl; they are free to be smarter and do larger sizes and
alignments.  But if the ioctl was designed and documented to report
required *minimums*, I think we should leave it as such.

I'm going to resend the change, split up a bit more to separate
cleanups from functional changes, and maybe we can discuss the ioctl
change as a potential additional patch.

Thanks,
-Eric

> There's quite a bit of hardware out there that have a limit of 128
> segments to each IO....
> 
> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs