From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
linux-nvdimm <linux-nvdimm@lists.01.org>,
Ross Zwisler <zwisler@kernel.org>,
Vishal L Verma <vishal.l.verma@intel.com>,
xfs <linux-xfs@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax
Date: Fri, 22 Feb 2019 16:46:53 -0800 [thread overview]
Message-ID: <20190223004653.GD21626@magnolia> (raw)
In-Reply-To: <20190222233038.GD23020@dastard>
On Sat, Feb 23, 2019 at 10:30:38AM +1100, Dave Chinner wrote:
> On Fri, Feb 22, 2019 at 10:45:25AM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote:
> > > On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong
> > > <darrick.wong@oracle.com> wrote:
> > > >
> > > > Hi all!
> > > >
> > > > Uh, we have an internal customer <cough> who's been trying out MAP_SYNC
> > > > on pmem, and they've observed that one has to do a fair amount of
> > > > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> > > > 2M PMD mappings. They (of course) want to mmap hundreds of GB of pmem,
> > > > so the PMD mappings are much more efficient.
>
> Are you really saying that "mkfs.xfs -d su=2MB,sw=1 <dev>" is
> considered "too much legwork" to set up the filesystem for DAX and
> PMD alignment?
Yes. I mean ... userspace /can/ figure out the page sizes on arm64 &
ppc64le (or extract it from sysfs), but why not just advertise it as a
io hint on the pmem "block" device?
Hmm, now having watched various xfstests blow up because they don't
expect blocks to be larger than 64k, maybe I'll rethink this as a
default behavior. :)
> > > > I started poking around w.r.t. what mkfs.xfs was doing and realized that
> > > > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> > > > set up all the parameters automatically. Below is my ham-handed attempt
> > > > to teach the kernel to do this.
>
> Still need extent size hints so that writes that are smaller than
> the PMD size are allocated correctly aligned and sized to map to
> PMDs...
I think we're generally planning to use the RT device where we can make
2M alignment mandatory, so for the data device the effectiveness of the
extent hint doesn't really matter.
> > > > Comments, flames, "WTF is this guy smoking?" are all welcome. :)
> > > >
> > > > --D
> > > >
> > > > ---
> > > > Configure pmem devices to advertise the default page alignment when said
> > > > block device supports fsdax. Certain filesystems use these iomin/ioopt
> > > > hints to try to create aligned file extents, which makes it much easier
> > > > for mmaps to take advantage of huge page table entries.
> > > >
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > > drivers/nvdimm/pmem.c | 5 ++++-
> > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > > > index bc2f700feef8..3eeb9dd117d5 100644
> > > > --- a/drivers/nvdimm/pmem.c
> > > > +++ b/drivers/nvdimm/pmem.c
> > > > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev,
> > > > blk_queue_logical_block_size(q, pmem_sector_size(ndns));
> > > > blk_queue_max_hw_sectors(q, UINT_MAX);
> > > > blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> > > > - if (pmem->pfn_flags & PFN_MAP)
> > > > + if (pmem->pfn_flags & PFN_MAP) {
> > > > blk_queue_flag_set(QUEUE_FLAG_DAX, q);
> > > > + blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT);
> > > > + blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT);
> > >
> > > The device alignment might sometimes be bigger than this default.
> > > Would there be any detrimental effects for filesystems if io_min and
> > > io_opt were set to 1GB?
> >
> > Hmmm, that's going to be a struggle on ext4 and the xfs data device
> > because we'd be preferentially skipping the 1023.8MB immediately after
> > each allocation group's metadata. It already does this now with a 2MB
> > io hint, but losing 1.8MB here and there isn't so bad.
> >
> > We'd have to study it further, though; filesystems historically have
> > interpreted the iomin/ioopt hints as RAID striping geometry, and I don't
> > think very many people set up 1GB raid stripe units.
>
> Setting sunit=1GB is really going to cause havoc with things like
> inode chunk allocation alignment, and the first write() will either
> have to be >=1GB or use 1GB extent size hints to trigger alignment.
> And, AFAICT, it will prevent us from doing 2MB alignment on other
> files, even with 2MB extent size hints set.
>
> IOWs, I don't think 1GB alignment is a good idea as a default.
<nods>
> > (I doubt very many people have done 2M raid stripes either, but it seems
> > to work easily where we've tried it...)
>
> That's been pretty common with stacked hardware raid for as long as
> I've worked on XFS. e.g. a software RAID0 stripe of hardware RAID5/6
> luns was pretty common with large storage arrays in HPC environments
> (i.e. huge streaming read/write bandwidth). In these cases, XFS was
> set up with the RAID5/6 lun width as the stripe unit (commonly 2MB
> with 8+1 and 256k raid chunk size), and the RAID 0
> width as the stripe width (commonly 8-16 wide spread across 8-16 FC
> portsi w/ multipath) and it wasn't uncommon to see widths in the
> 16-32MB range.
>
> This aligned the filesystem to the underlying RAID5/6 luns, and
> allows stripe width IO to be aligned an hit every RAID5/6 lun
> evenly. Ensuring applications could do this easily with large direct
> IO reads and writes is where the swalloc and largeio mount
> options come into their own....
<nod>
> > > I'm thinking and xfs-realtime configuration might be able to support
> > > 1GB mappings in the future.
> >
> > The xfs realtime device ought to be able to support 1g alignment pretty
> > easily though. :)
>
> Yup, but I think that's the maximum "block" size it can support and
It is; our users with 16G page size are out of luck.
> DAX will have some serious long tail latency and CPU usage issues at
> allocation time because each new 1GB "block" that is dynamically
> allocated will have to be completely zeroed during the allocation
> inside the page fault handler.....
Agreed, 1G pages are most probably too unwieldly to be worth advertising.
--D
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2019-02-23 0:47 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-22 18:20 [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax Darrick J. Wong
2019-02-22 18:28 ` Dan Williams
2019-02-22 18:45 ` Darrick J. Wong
2019-02-22 23:30 ` Dave Chinner
2019-02-23 0:46 ` Darrick J. Wong [this message]
2019-02-22 23:11 ` Dave Chinner
2019-02-22 23:28 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190223004653.GD21626@magnolia \
--to=darrick.wong@oracle.com \
--cc=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=linux-xfs@vger.kernel.org \
--cc=vishal.l.verma@intel.com \
--cc=zwisler@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).