Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Ross Zwisler <zwisler@kernel.org>,
	Vishal L Verma <vishal.l.verma@intel.com>,
	xfs <linux-xfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax
Date: Fri, 22 Feb 2019 16:46:53 -0800	[thread overview]
Message-ID: <20190223004653.GD21626@magnolia> (raw)
In-Reply-To: <20190222233038.GD23020@dastard>

On Sat, Feb 23, 2019 at 10:30:38AM +1100, Dave Chinner wrote:
> On Fri, Feb 22, 2019 at 10:45:25AM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote:
> > > On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong
> > > <darrick.wong@oracle.com> wrote:
> > > >
> > > > Hi all!
> > > >
> > > > Uh, we have an internal customer <cough> who's been trying out MAP_SYNC
> > > > on pmem, and they've observed that one has to do a fair amount of
> > > > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> > > > 2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
> > > > so the PMD mappings are much more efficient.
> 
> Are you really saying that "mkfs.xfs -d su=2MB,sw=1 <dev>" is
> considered "too much legwork" to set up the filesystem for DAX and
> PMD alignment?

Yes.  I mean ... userspace /can/ figure out the page sizes on arm64 &
ppc64le (or extract it from sysfs), but why not just advertise it as a
io hint on the pmem "block" device?

Hmm, now having watched various xfstests blow up because they don't
expect blocks to be larger than 64k, maybe I'll rethink this as a
default behavior. :)

> > > > I started poking around w.r.t. what mkfs.xfs was doing and realized that
> > > > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> > > > set up all the parameters automatically.  Below is my ham-handed attempt
> > > > to teach the kernel to do this.
> 
> Still need extent size hints so that writes that are smaller than
> the PMD size are allocated correctly aligned and sized to map to
> PMDs...

I think we're generally planning to use the RT device where we can make
2M alignment mandatory, so for the data device the effectiveness of the
extent hint doesn't really matter.

> > > > Comments, flames, "WTF is this guy smoking?" are all welcome. :)
> > > >
> > > > --D
> > > >
> > > > ---
> > > > Configure pmem devices to advertise the default page alignment when said
> > > > block device supports fsdax.  Certain filesystems use these iomin/ioopt
> > > > hints to try to create aligned file extents, which makes it much easier
> > > > for mmaps to take advantage of huge page table entries.
> > > >
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  drivers/nvdimm/pmem.c |    5 ++++-
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > > > index bc2f700feef8..3eeb9dd117d5 100644
> > > > --- a/drivers/nvdimm/pmem.c
> > > > +++ b/drivers/nvdimm/pmem.c
> > > > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev,
> > > >         blk_queue_logical_block_size(q, pmem_sector_size(ndns));
> > > >         blk_queue_max_hw_sectors(q, UINT_MAX);
> > > >         blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> > > > -       if (pmem->pfn_flags & PFN_MAP)
> > > > +       if (pmem->pfn_flags & PFN_MAP) {
> > > >                 blk_queue_flag_set(QUEUE_FLAG_DAX, q);
> > > > +               blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT);
> > > > +               blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT);
> > > 
> > > The device alignment might sometimes be bigger than this default.
> > > Would there be any detrimental effects for filesystems if io_min and
> > > io_opt were set to 1GB?
> > 
> > Hmmm, that's going to be a struggle on ext4 and the xfs data device
> > because we'd be preferentially skipping the 1023.8MB immediately after
> > each allocation group's metadata.  It already does this now with a 2MB
> > io hint, but losing 1.8MB here and there isn't so bad.
> > 
> > We'd have to study it further, though; filesystems historically have
> > interpreted the iomin/ioopt hints as RAID striping geometry, and I don't
> > think very many people set up 1GB raid stripe units.
> 
> Setting sunit=1GB is really going to cause havoc with things like
> inode chunk allocation alignment, and the first write() will either
> have to be >=1GB or use 1GB extent size hints to trigger alignment.
> And, AFAICT, it will prevent us from doing 2MB alignment on other
> files, even with 2MB extent size hints set.
> 
> IOWs, I don't think 1GB alignment is a good idea as a default.

<nods>

> > (I doubt very many people have done 2M raid stripes either, but it seems
> > to work easily where we've tried it...)
> 
> That's been pretty common with stacked hardware raid for as long as
> I've worked on XFS. e.g. a software RAID0 stripe of hardware RAID5/6
> luns was pretty common with large storage arrays in HPC environments
> (i.e. huge streaming read/write bandwidth). In these cases, XFS was
> set up with the RAID5/6 lun width as the stripe unit (commonly 2MB
> with 8+1 and 256k raid chunk size), and the RAID 0
> width as the stripe width (commonly 8-16 wide spread across 8-16 FC
> portsi w/ multipath) and it wasn't uncommon to see widths in the
> 16-32MB range.
> 
> This aligned the filesystem to the underlying RAID5/6 luns, and
> allows stripe width IO to be aligned an hit every RAID5/6 lun
> evenly. Ensuring applications could do this easily with large direct
> IO reads and writes is where the swalloc and largeio mount
> options come into their own....

<nod>

> > > I'm thinking and xfs-realtime configuration might be able to support
> > > 1GB mappings in the future.
> > 
> > The xfs realtime device ought to be able to support 1g alignment pretty
> > easily though. :)
> 
> Yup, but I think that's the maximum "block" size it can support and

It is; our users with 16G page size are out of luck.

> DAX will have some serious long tail latency and CPU usage issues at
> allocation time because each new 1GB "block" that is dynamically
> allocated will have to be completely zeroed during the allocation
> inside the page fault handler.....

Agreed, 1G pages are most probably too unwieldly to be worth advertising.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

next prev parent reply	other threads:[~2019-02-23  0:47 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-22 18:20 [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax Darrick J. Wong
2019-02-22 18:28 ` Dan Williams
2019-02-22 18:45   ` Darrick J. Wong
2019-02-22 23:30     ` Dave Chinner
2019-02-23  0:46       ` Darrick J. Wong [this message]
2019-02-22 23:11 ` Dave Chinner
2019-02-22 23:28   ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190223004653.GD21626@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=vishal.l.verma@intel.com \
    --cc=zwisler@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).