Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: Jan Kara <jack@suse.cz>, "Williams, Dan J" <dan.j.williams@intel.com>
Cc: linux-nvdimm <linux-nvdimm@lists.01.org>,
	"jack@suse.de" <jack@suse.de>, "Jeff Moyer" <jmoyer@redhat.com>,
	"Michal Suchánek" <msuchanek@suse.de>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.
Date: Mon, 08 Jun 2020 13:12:58 +0530	[thread overview]
Message-ID: <877dwi9f5p.fsf@linux.ibm.com> (raw)
In-Reply-To: <f8113a5b-be3b-3627-7535-ed2c9e0293f9@linux.ibm.com>



"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> On 6/3/20 1:56 PM, Jan Kara wrote:
>> On Tue 02-06-20 17:59:08, Williams, Dan J wrote:
>>> [ forgive formatting, a series of unfortunate events has me using Outlook for the moment ]
>>>
>>>> From: Jan Kara <jack@suse.cz>
>>>>>>> These flags are device properties that affect the kernel and
>>>>>>> userspace's handling of persistence.
>>>>>>>
>>>>>>
>>>>>> That will not handle the scenario with multiple applications using
>>>>>> the same fsdax mount point where one is updated to use the new
>>>>>> instruction and the other is not.
>>>>>
>>>>> Right, it needs to be a global setting / flag day to switch from one
>>>>> regime to another. Per-process control is a recipe for disaster.
>>>>
>>>> First I'd like to mention that hopefully the concern is mostly theoretical since
>>>> as Aneesh wrote above, real persistent memory never shipped for PPC and
>>>> so there are very few apps (if any) using the old way to ensure cache
>>>> flushing.
>>>>
>>>> But I'd like to understand why do you think per-process control is a recipe for
>>>> disaster? Because from my POV the sysfs interface you propose is actually
>>>> difficult to use in practice. As a distributor, you have hard time picking the
>>>> default because you have a choice between picking safe option which is
>>>> going to confuse users because of failing MAP_SYNC and unsafe option
>>>> where everyone will be happy until someone looses data because of some
>>>> ancient application using wrong instructions to persist data. Poor experience
>>>> for users in either way. And when distro defaults to "safe option", then the
>>>> burden is on the sysadmin to toggle the switch but how is he supposed to
>>>> decide when that is safe? First he has to understand what the problem
>>>> actually is, then he has to audit all the applications using pmem whether they
>>>> use the new instruction - which is IMO a lot of effort if you have a couple of
>>>> applications and practically infeasible if you have more of them.
>>>> So IMO the burden should be *on the application* to declare that it is aware
>>>> of the new instructions to flush pmem on the platform and only to such
>>>> application the kernel should give the trust to use MAP_SYNC mappings.
>>>
>>> The "disaster" in my mind is this need to globally change the ABI for
>>> persistence semantics for all of Linux because one CPU wants a do over.
>>> What does a generic "MAP_SYNC_ENABLE" knob even mean to the existing
>>> deployed base of persistent memory applications? Yes, sysfs is awkward,
>>> but it's trying to provide some relief without imposing unexplainable
>>> semantics on everyone else. I think a comprehensive (overengineered)
>>> solution would involve not introducing another "I know what I'm doing"
>>> flag to the interface, but maybe requiring applications to call a pmem
>>> sync API in something like a vsyscall. Or, also overengineered, some
>>> binary translation / interpretation to actively detect and kill
>>> applications that deploy the old instructions. Something horrid like on
>>> first write fault to a MAP_SYNC try to look ahead in the binary for the
>>> correct sync sequence and kill the application otherwise. That would at
>>> least provide some enforcement and safety without requiring other
>>> architectures to consider what MAP_SYNC_ENABLE means to them.
>> 
>> Thanks for explanation. So I absolutely agree that other architectures (and
>> even older versions of POWER architecture) must not be influenced by the new
>> tunable. That's why I wrote in my reply to Aneesh that I'd be for checking
>> during mmap(2) with MAP_SYNC, whether we are in a situation where new PPC
>> flush instructions are required and *only in that case* decide based on the
>> prctl value whether MAP_SYNC should be allowed or not.
>> 
>
> v2 version of the patch series does that
>
> https://lore.kernel.org/linuxppc-dev/20200602074909.36738-1-aneesh.kumar@linux.ibm.com/
>
>> Whether this solution is overengineering or not depends on how you think
>> it's likely there will be applications trying to use old flush instructions
>> with MAP_SYNC on POWER10 platforms...
>> 
>
> Now considering that with ppc64 we never had a real persistent memory 
> device available for the end-user to try and the new instructions are 
> only needed on newer hardware, can we assume we have enough time to get 
> the userspace to use new instructions?
>
> As a safety net, we can keep the dax device-specific sysfs control. But 
> in reality, by the time newer hardware gets released, we can get the 
> distributions updated to flip the CONFIG_ARCH_MAP_SYNC_DISABLE=n?
>
> With this:
> 1) vPMEM continues to work and since it is a volatile region. That 
> doesn't need any flush instructions.
>
> 2) We get pmdk and other user applications updated to use new 
> instructions and make sure updated packages are made available to all 
> distributions
>
> 3) On newer hardware, the device will appear with a new compat string. 
> Hence older distributions won't initialize pmem on newer hardware.
>
> 4) If we have a newer kernel with an older distro, we use the per 
> namespace sysfs knob that prevents the usage of MAP_SYNC.
>
> 5) After a year or so we mark the CONFIG_ARCH_MAP_SYNC_DISABLE=n
> on ppc64 when we are confident that everybody is using the new flush 
> instruction.
>

Is this approach ok for distributions? If so I can repost the series
dropping the prctl change.

-aneesh

next prev parent reply	other threads:[~2020-06-08  7:44 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-29  5:41 [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support Aneesh Kumar K.V
2020-05-29  5:41 ` [RFC PATCH 2/2] powerpc/pmem: Disable synchronous fault by default Aneesh Kumar K.V
2020-05-29  9:33 ` [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support Michal Suchánek
2020-05-29  9:37   ` Aneesh Kumar K.V
2020-05-29  9:52     ` Jan Kara
2020-05-29 10:55       ` Aneesh Kumar K.V
2020-05-29 19:22         ` Dan Williams
2020-05-30  7:18           ` Aneesh Kumar K.V
2020-05-30 16:35             ` Dan Williams
2020-06-01  9:50               ` Jan Kara
2020-06-02 17:59                 ` Williams, Dan J
2020-06-03  8:26                   ` Jan Kara
2020-06-03  9:09                     ` Aneesh Kumar K.V
2020-06-08  7:42                       ` Aneesh Kumar K.V [this message]
2020-06-01 10:09         ` Jan Kara
2020-06-01 12:01           ` Aneesh Kumar K.V
2020-06-01 12:07             ` Michal Suchánek
2020-06-01 12:20               ` Aneesh Kumar K.V
2020-06-02  7:57                 ` Aneesh Kumar K.V
2020-06-01 14:56             ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877dwi9f5p.fsf@linux.ibm.com \
    --to=aneesh.kumar@linux.ibm.com \
    --cc=dan.j.williams@intel.com \
    --cc=jack@suse.cz \
    --cc=jack@suse.de \
    --cc=jmoyer@redhat.com \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=msuchanek@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).