public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Eric Sandeen <sandeen@redhat.com>
To: Dave Chinner <david@fromorbit.com>, Eric Sandeen <sandeen@sandeen.net>
Cc: Christoph Hellwig <hch@infradead.org>, xfs-oss <xfs@oss.sgi.com>
Subject: Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
Date: Wed, 13 Nov 2013 15:32:46 -0600	[thread overview]
Message-ID: <5283EFFE.5090700@redhat.com> (raw)
In-Reply-To: <20131113212658.GJ6188@dastard>

On 11/13/13, 3:26 PM, Dave Chinner wrote:
> On Wed, Nov 13, 2013 at 01:08:30PM -0600, Eric Sandeen wrote:
>> On 11/13/13, 12:56 PM, Christoph Hellwig wrote:
>>> On Wed, Nov 13, 2013 at 12:25:33PM -0600, Eric Sandeen wrote:
>>>> Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:
>>>>
>>>> Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
>>>> drive.  (that change was done by me).  The thought was that it'd be an
>>>> efficiency gain to not make the drive do the (possible) RMW cycles on
>>>> 512-byte log IO, primarily.
>>>>
>>>> However, now this restricts all DIO to 4k alignment, not the otherwise-
>>>> possible 512.
>>>>
>>>> This came up when qemu-kvm, in cache=none mode, tries to boot off an
>>>> image hosted on such a filesystem, and its bios wants to do a 512 byte
>>>> direct IO read off the disk - it fails.
>>>>
>>>> But I'm wondering - the buftarg's bt_sshift and bt_smask are only used
>>>> in a few places.  
>>>
>>> No need to mess with kernel code IFF we want to change that, just keep
>>> the sector size at 512 bytes and set a log stripe unit at mkfs time.
>>>
>>> I have to admit that I'm not really sure if that's what we really want,
>>> through.  A drive that has a larger physical block size will need
>>> read-modify-write cycles internally, which we try to avoid.
>>
>> Yeah, the problem comes up when it is 100% impossible to boot a
>> qemu-kvm guest hosted on such a filesystem/drive.  :(
> 
> No it's not. Just use cache=writethrough and the page cache will
> take care of the mismatch when it occurs.

Sorry, I meant impossible w/ cache=none.

TBH, I don't know what best practice is.

>> (of course I guess that means it fails on a hard 4k drive too)
> 
> And on any other filesystem that thinks it has sectors larger than
> 512 bytes underlying it (e.g. cdrom has a 2k sector size).
> 
>> I don't know what the guest sees for logical/physical on its
>> file-backed block device in these cases.
> 
> Seems like that's the avenue for improvement here to me. i.e. expose
> the correct values to the guest so it's mkfs does the right thing.
> Or, alternatively, make qemu buffer non-aligned/sized IOs itself
> internally.

The guest never _boots_ - it's not a guest mkfs issue.

The guest bios wants to read 512 via DIO off the image on this 4k
sector FS, and fails.

> After all, it has been told to use direct IO, and when that happens
> it is the application's responsibility to ensure IO alignment
> requirements are met...

Agreed, but in talking to a qemu guy... 

"In my understanding, that's a limitation that directly comes from the BIOS interface."
"int 13h just assumes 512 bytes"

But this is above my pay grade.  I don't speak BIOS.

>> Anyway, if we took your suggestion, normal internal fs operations
>> (log IO) wouldn't RMW.  But we'd still presumably advertise and allow
>> smaller DIO sizes, which are inefficient.  We could advertise 4k, but
>> still allow 512 for less-smart apps, maybe?
> 
> I'd say such a problem is a matter of user education and making qemu
> aware of logical/physical differences - hacking weird corner cases
> into what a sector size means is only going to lead to confusion and
> bite us in unexpected ways...

Probably so; hence the "crazy" disclaimer.  ;)

But it does seem a little odd to semi-artificially reject DIOs which
the drive could actually handle.

Indeed, do_blockdev_direct_IO looks right at the logical block size,
and allows it:

        if (offset & blocksize_mask) {
                if (bdev)
                        blkbits = blksize_bits(bdev_logical_block_size(bdev));
                blocksize_mask = (1 << blkbits) - 1;
                if (offset & blocksize_mask)
                        goto out;
        }

it's our checks in XFS that fail.

-Eric

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2013-11-13 21:33 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-13 18:25 [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg Eric Sandeen
2013-11-13 18:56 ` Christoph Hellwig
2013-11-13 19:08   ` Eric Sandeen
2013-11-13 21:26     ` Dave Chinner
2013-11-13 21:32       ` Eric Sandeen [this message]
2013-11-13 22:10         ` Dave Chinner
2013-11-13 22:18           ` Eric Sandeen
2013-11-14  0:34             ` Dave Chinner
2013-11-14 13:37       ` Christoph Hellwig
2013-11-14 14:56         ` Eric Sandeen
2013-11-14 21:01           ` Dave Chinner
2013-11-22 14:13             ` Ric Wheeler
2013-11-22 14:20               ` Christoph Hellwig
2013-11-22 14:26                 ` Ric Wheeler
2013-11-22 14:57               ` Eric Sandeen
2013-11-14  0:35 ` Eric Sandeen
2013-11-14  6:49   ` Dave Chinner
2013-11-14 13:09     ` Ric Wheeler
2013-11-14 15:03       ` Eric Sandeen
2013-11-14 15:18     ` Eric Sandeen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5283EFFE.5090700@redhat.com \
    --to=sandeen@redhat.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=sandeen@sandeen.net \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox