Re: [PATCH] dax, pmem: add support for msync

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Boaz Harrosh <boaz@plexistor.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-kernel@vger.kernel.org, Andrew Morton <akpm@osdl.org>,
	x86@kernel.org, linux-nvdimm@lists.01.org,
	Peter Zijlstra <peterz@infradead.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Hugh Dickins <hughd@google.com>,
	linux-mm@kvack.org, Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	"H. Peter Anvin" <hpa@zytor.com>,
	linux-fsdevel@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH] dax, pmem: add support for msync
Date: Wed, 2 Sep 2015 11:47:52 -0600	[thread overview]
Message-ID: <20150902174752.GA26189@linux.intel.com> (raw)
In-Reply-To: <55E5A44A.1050206@plexistor.com>

On Tue, Sep 01, 2015 at 04:12:42PM +0300, Boaz Harrosh wrote:
> On 08/31/2015 09:59 PM, Ross Zwisler wrote:
> > @@ -753,3 +755,18 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
> >  	return dax_zero_page_range(inode, from, length, get_block);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_truncate_page);
> > +
> > +void dax_sync_range(unsigned long addr, size_t len)
> > +{
> > +	while (len) {
> > +		size_t chunk_len = min_t(size_t, SZ_1G, len);
> > +
> 
> Where does the  SZ_1G come from is it because you want to do cond_resched()
> every 1G bytes so not to get stuck for a long time?
> 
> It took me a while to catch, At first I thought it might be do to wb_cache_pmem()
> limitations. Would you put a comment in the next iteration?

Yep, the SZ_1G is just to make sure we cond_reshced() every once in a while.
Is there a documented guideline somewhere as to how long a kernel thread is
allowed to spin before calling cond_resched()?   So far I haven' been able to
find anything solid on this - it seems like each developer has their own
preferences, and that those preferences vary pretty widely.

In any case, assuming we continue to separate the msync() and fsync()
implementations for DAX (which right now I'm doubting, to be honest), I'll add
in a comment to explain this logic.

> > diff --git a/include/linux/pmem.h b/include/linux/pmem.h
> > index 85f810b3..aa29ebb 100644
> > --- a/include/linux/pmem.h
> > +++ b/include/linux/pmem.h
> > @@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
> >  {
> >  	BUG();
> 
> See below
> 
> >  }
> > +
> > +static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
> > +{
> > +	BUG();
> 
> There is a clflush_cache_range() defined for generic use. On ADR systems (even without pcommit)
> this works perfectly and is persistent. why not use that in the generic case?

Nope, we really do need to use wb_cache_pmem() because clflush_cache_range()
isn't an architecture neutral API.  wb_cache_pmem() also has the advantage
that on x86 it will take advantage of the new CLWB instruction if it is
available on the platform, and it doesn't introduce any unnecessary memory
fencing.  This works on both PCOMMIT-aware systems and on ADR boxes without
PCOMMIT.

> One usage of pmem is overlooked by all this API. The use of DRAM as pmem, across a VM
> or cross reboot. you have a piece of memory exposed as pmem to the subsytem which survives
> past the boot of that system. The CPU cache still needs flushing in this case.
> (People are already using this for logs and crash dumps)

I'm confused about this "DRAM as pmem" use case - are the requirements
essentially the same as the ADR case?  You need to make sure that pre-reboot
the dirty cache lines have been flushed from the processor cache, but if they
are in platform buffers (the "safe zone" for ADR) you're fine?

If so, we're good to go, I think.  Dan's most recent patch series made it so
we correctly handle systems that have the PMEM API but not PCOMMIT:

https://lists.01.org/pipermail/linux-nvdimm/2015-August/002005.html

If the "DRAM as pmem across reboots" case isn't okay with your dirty data
being in the ADR safe zone, I think you're toast.  Without PCOMMIT the kernel
cannot guarantee that the data has ever made it durably to the DIMMs,
regardless what clflush/clflushopt/clwb magic you do.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

     prev parent reply	other threads:[~2015-09-02 17:47 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-31 18:59 [PATCH] dax, pmem: add support for msync Ross Zwisler
2015-08-31 19:06 ` Christoph Hellwig
2015-08-31 19:26   ` Ross Zwisler
2015-08-31 19:34     ` Christoph Hellwig
2015-08-31 23:38 ` Dave Chinner
2015-09-01  7:06   ` Christoph Hellwig
2015-09-01 12:18     ` Boaz Harrosh
2015-09-02 19:04       ` Ross Zwisler
2015-09-02 20:17         ` Kirill A. Shutemov
2015-09-03  6:32         ` Boaz Harrosh
2015-09-03 16:44           ` Ross Zwisler
2015-09-01 22:21     ` Dave Chinner
2015-09-02  3:19       ` Ross Zwisler
2015-09-02  5:17         ` Dave Chinner
2015-09-02 10:27           ` Boaz Harrosh
2015-09-02 14:23             ` Dave Hansen
2015-09-02 15:18               ` Boaz Harrosh
2015-09-02 15:39                 ` Dave Hansen
2015-09-02 16:00                   ` Boaz Harrosh
2015-09-02 16:19                     ` Dave Hansen
2015-09-03  6:41                       ` Boaz Harrosh
2015-09-02 10:04         ` Boaz Harrosh
2015-09-01 10:08   ` Kirill A. Shutemov
2015-09-01 11:27     ` Boaz Harrosh
2015-09-01 22:49     ` Dave Chinner
2015-09-02  9:13       ` Kirill A. Shutemov
2015-09-02  9:37         ` Boaz Harrosh
2015-09-02  9:41           ` Boaz Harrosh
2015-09-02  9:47             ` Kirill A. Shutemov
2015-09-02 10:28               ` Boaz Harrosh
2015-09-03  0:57         ` Dave Chinner
2015-09-01 13:12 ` Boaz Harrosh
2015-09-02 17:47   ` Ross Zwisler [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150902174752.GA26189@linux.intel.com \
    --to=ross.zwisler@linux.intel.com \
    --cc=akpm@osdl.org \
    --cc=boaz@plexistor.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hch@lst.de \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=viro@zeniv.linux.org.uk \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).