public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Eric Sandeen <sandeen@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Eric Sandeen <sandeen@sandeen.net>, xfs-oss <xfs@oss.sgi.com>
Subject: Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
Date: Thu, 14 Nov 2013 09:10:09 +1100	[thread overview]
Message-ID: <20131113221009.GK6188@dastard> (raw)
In-Reply-To: <5283EFFE.5090700@redhat.com>

On Wed, Nov 13, 2013 at 03:32:46PM -0600, Eric Sandeen wrote:
> On 11/13/13, 3:26 PM, Dave Chinner wrote:
> > On Wed, Nov 13, 2013 at 01:08:30PM -0600, Eric Sandeen wrote:
> >> On 11/13/13, 12:56 PM, Christoph Hellwig wrote:
> >>> On Wed, Nov 13, 2013 at 12:25:33PM -0600, Eric Sandeen wrote:
> >>>> Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:
> >>>>
> >>>> Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
> >>>> drive.  (that change was done by me).  The thought was that it'd be an
> >>>> efficiency gain to not make the drive do the (possible) RMW cycles on
> >>>> 512-byte log IO, primarily.
> >>>>
> >>>> However, now this restricts all DIO to 4k alignment, not the otherwise-
> >>>> possible 512.
> >>>>
> >>>> This came up when qemu-kvm, in cache=none mode, tries to boot off an
> >>>> image hosted on such a filesystem, and its bios wants to do a 512 byte
> >>>> direct IO read off the disk - it fails.
> >>>>
> >>>> But I'm wondering - the buftarg's bt_sshift and bt_smask are only used
> >>>> in a few places.  
> >>>
> >>> No need to mess with kernel code IFF we want to change that, just keep
> >>> the sector size at 512 bytes and set a log stripe unit at mkfs time.
> >>>
> >>> I have to admit that I'm not really sure if that's what we really want,
> >>> through.  A drive that has a larger physical block size will need
> >>> read-modify-write cycles internally, which we try to avoid.
> >>
> >> Yeah, the problem comes up when it is 100% impossible to boot a
> >> qemu-kvm guest hosted on such a filesystem/drive.  :(
> > 
> > No it's not. Just use cache=writethrough and the page cache will
> > take care of the mismatch when it occurs.
> 
> Sorry, I meant impossible w/ cache=none.
> 
> TBH, I don't know what best practice is.

No idea what best practice for the virt side is, but best practice
from a storage perspective is that everything in the IO stack should
use the sector size from the underlying layer. Given that virt adds
layers to the storage stack above the filesystem, that means it
needs to support whatever sector size the filesystem is using...

> >> (of course I guess that means it fails on a hard 4k drive too)
> > 
> > And on any other filesystem that thinks it has sectors larger than
> > 512 bytes underlying it (e.g. cdrom has a 2k sector size).
> > 
> >> I don't know what the guest sees for logical/physical on its
> >> file-backed block device in these cases.
> > 
> > Seems like that's the avenue for improvement here to me. i.e. expose
> > the correct values to the guest so it's mkfs does the right thing.
> > Or, alternatively, make qemu buffer non-aligned/sized IOs itself
> > internally.
> 
> The guest never _boots_ - it's not a guest mkfs issue.

Oh, that wasn't clear.

> The guest bios wants to read 512 via DIO off the image on this 4k
> sector FS, and fails.

So the bios has never been updated to handle 4k sector devices?

> > After all, it has been told to use direct IO, and when that happens
> > it is the application's responsibility to ensure IO alignment
> > requirements are met...
> 
> Agreed, but in talking to a qemu guy... 
> 
> "In my understanding, that's a limitation that directly comes from the BIOS interface."
> "int 13h just assumes 512 bytes"
> 
> But this is above my pay grade.  I don't speak BIOS.

Yet all modern bios implementations you find in hardware can boot 4k
sector devices just fine. So, what bios does qemu use?

$ man qemu
.....
QEMU uses the PC BIOS from the Bochs project and the Plex86/Bochs
LGPL VGA BIOS.
.....

So what we have here is an *open source bios* that doesn't handle
drives 4k sector sizes. There's the problem that needs to be fixed....

> >> Anyway, if we took your suggestion, normal internal fs operations
> >> (log IO) wouldn't RMW.  But we'd still presumably advertise and allow
> >> smaller DIO sizes, which are inefficient.  We could advertise 4k, but
> >> still allow 512 for less-smart apps, maybe?
> > 
> > I'd say such a problem is a matter of user education and making qemu
> > aware of logical/physical differences - hacking weird corner cases
> > into what a sector size means is only going to lead to confusion and
> > bite us in unexpected ways...
> 
> Probably so; hence the "crazy" disclaimer.  ;)
> 
> But it does seem a little odd to semi-artificially reject DIOs which
> the drive could actually handle.
> 
> Indeed, do_blockdev_direct_IO looks right at the logical block size,
> and allows it:
> 
>         if (offset & blocksize_mask) {
>                 if (bdev)
>                         blkbits = blksize_bits(bdev_logical_block_size(bdev));
>                 blocksize_mask = (1 << blkbits) - 1;
>                 if (offset & blocksize_mask)
>                         goto out;
>         }
> 
> it's our checks in XFS that fail.

No they don't - they are working just fine. We've told XFS that the
sector size is X, and therefore we don't allow IO in smaller units,
data or metadata.  That's the whole point of the filesystem having a
configurable sector size - we can *enforce* a larger minimum IO
requirement than the underlying hardware supports.

We've done this for years - e.g. long time ago MD devices had a
massive performance penalty for sub-page sized IOs, so mkfs set the
sector size to 4k to avoid that problem, even though we could have
done 512 byte IOs to the underlying devices.

Lets fix the problem at the source - the bios that doesn't support
4k sector devices - like we've done for all the other utilities that
need to be aware of disk sector sizes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2013-11-13 22:10 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-13 18:25 [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg Eric Sandeen
2013-11-13 18:56 ` Christoph Hellwig
2013-11-13 19:08   ` Eric Sandeen
2013-11-13 21:26     ` Dave Chinner
2013-11-13 21:32       ` Eric Sandeen
2013-11-13 22:10         ` Dave Chinner [this message]
2013-11-13 22:18           ` Eric Sandeen
2013-11-14  0:34             ` Dave Chinner
2013-11-14 13:37       ` Christoph Hellwig
2013-11-14 14:56         ` Eric Sandeen
2013-11-14 21:01           ` Dave Chinner
2013-11-22 14:13             ` Ric Wheeler
2013-11-22 14:20               ` Christoph Hellwig
2013-11-22 14:26                 ` Ric Wheeler
2013-11-22 14:57               ` Eric Sandeen
2013-11-14  0:35 ` Eric Sandeen
2013-11-14  6:49   ` Dave Chinner
2013-11-14 13:09     ` Ric Wheeler
2013-11-14 15:03       ` Eric Sandeen
2013-11-14 15:18     ` Eric Sandeen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131113221009.GK6188@dastard \
    --to=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=sandeen@redhat.com \
    --cc=sandeen@sandeen.net \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox