All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Sandeen <sandeen@redhat.com>
To: Dave Chinner <david@fromorbit.com>, Eric Sandeen <sandeen@sandeen.net>
Cc: Christoph Hellwig <hch@infradead.org>, xfs-oss <xfs@oss.sgi.com>
Subject: Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
Date: Wed, 13 Nov 2013 15:32:46 -0600	[thread overview]
Message-ID: <5283EFFE.5090700@redhat.com> (raw)
In-Reply-To: <20131113212658.GJ6188@dastard>

On 11/13/13, 3:26 PM, Dave Chinner wrote:
> On Wed, Nov 13, 2013 at 01:08:30PM -0600, Eric Sandeen wrote:
>> On 11/13/13, 12:56 PM, Christoph Hellwig wrote:
>>> On Wed, Nov 13, 2013 at 12:25:33PM -0600, Eric Sandeen wrote:
>>>> Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:
>>>>
>>>> Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
>>>> drive.  (that change was done by me).  The thought was that it'd be an
>>>> efficiency gain to not make the drive do the (possible) RMW cycles on
>>>> 512-byte log IO, primarily.
>>>>
>>>> However, now this restricts all DIO to 4k alignment, not the otherwise-
>>>> possible 512.
>>>>
>>>> This came up when qemu-kvm, in cache=none mode, tries to boot off an
>>>> image hosted on such a filesystem, and its bios wants to do a 512 byte
>>>> direct IO read off the disk - it fails.
>>>>
>>>> But I'm wondering - the buftarg's bt_sshift and bt_smask are only used
>>>> in a few places.  
>>>
>>> No need to mess with kernel code IFF we want to change that, just keep
>>> the sector size at 512 bytes and set a log stripe unit at mkfs time.
>>>
>>> I have to admit that I'm not really sure if that's what we really want,
>>> through.  A drive that has a larger physical block size will need
>>> read-modify-write cycles internally, which we try to avoid.
>>
>> Yeah, the problem comes up when it is 100% impossible to boot a
>> qemu-kvm guest hosted on such a filesystem/drive.  :(
> 
> No it's not. Just use cache=writethrough and the page cache will
> take care of the mismatch when it occurs.

Sorry, I meant impossible w/ cache=none.

TBH, I don't know what best practice is.

>> (of course I guess that means it fails on a hard 4k drive too)
> 
> And on any other filesystem that thinks it has sectors larger than
> 512 bytes underlying it (e.g. cdrom has a 2k sector size).
> 
>> I don't know what the guest sees for logical/physical on its
>> file-backed block device in these cases.
> 
> Seems like that's the avenue for improvement here to me. i.e. expose
> the correct values to the guest so it's mkfs does the right thing.
> Or, alternatively, make qemu buffer non-aligned/sized IOs itself
> internally.

The guest never _boots_ - it's not a guest mkfs issue.

The guest bios wants to read 512 via DIO off the image on this 4k
sector FS, and fails.

> After all, it has been told to use direct IO, and when that happens
> it is the application's responsibility to ensure IO alignment
> requirements are met...

Agreed, but in talking to a qemu guy... 

"In my understanding, that's a limitation that directly comes from the BIOS interface."
"int 13h just assumes 512 bytes"

But this is above my pay grade.  I don't speak BIOS.

>> Anyway, if we took your suggestion, normal internal fs operations
>> (log IO) wouldn't RMW.  But we'd still presumably advertise and allow
>> smaller DIO sizes, which are inefficient.  We could advertise 4k, but
>> still allow 512 for less-smart apps, maybe?
> 
> I'd say such a problem is a matter of user education and making qemu
> aware of logical/physical differences - hacking weird corner cases
> into what a sector size means is only going to lead to confusion and
> bite us in unexpected ways...

Probably so; hence the "crazy" disclaimer.  ;)

But it does seem a little odd to semi-artificially reject DIOs which
the drive could actually handle.

Indeed, do_blockdev_direct_IO looks right at the logical block size,
and allows it:

        if (offset & blocksize_mask) {
                if (bdev)
                        blkbits = blksize_bits(bdev_logical_block_size(bdev));
                blocksize_mask = (1 << blkbits) - 1;
                if (offset & blocksize_mask)
                        goto out;
        }

it's our checks in XFS that fail.

-Eric

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2013-11-13 21:33 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-13 18:25 [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg Eric Sandeen
2013-11-13 18:56 ` Christoph Hellwig
2013-11-13 19:08   ` Eric Sandeen
2013-11-13 21:26     ` Dave Chinner
2013-11-13 21:32       ` Eric Sandeen [this message]
2013-11-13 22:10         ` Dave Chinner
2013-11-13 22:18           ` Eric Sandeen
2013-11-14  0:34             ` Dave Chinner
2013-11-14 13:37       ` Christoph Hellwig
2013-11-14 14:56         ` Eric Sandeen
2013-11-14 21:01           ` Dave Chinner
2013-11-22 14:13             ` Ric Wheeler
2013-11-22 14:20               ` Christoph Hellwig
2013-11-22 14:26                 ` Ric Wheeler
2013-11-22 14:57               ` Eric Sandeen
2013-11-14  0:35 ` Eric Sandeen
2013-11-14  6:49   ` Dave Chinner
2013-11-14 13:09     ` Ric Wheeler
2013-11-14 15:03       ` Eric Sandeen
2013-11-14 15:18     ` Eric Sandeen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5283EFFE.5090700@redhat.com \
    --to=sandeen@redhat.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=sandeen@sandeen.net \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.