Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [RFC] implementing tape statistics single file vs multi-file in sysfs
From: James Bottomley @ 2015-02-08 17:35 UTC (permalink / raw)
  To: Greg KH
  Cc: Laurence Oberman, Bryn M. Reeves, Seymour, Shane M,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kai.Makisara-9Aww8k/80nUxHbG02/KK1g@public.gmane.org,
	Laurence Oberman (loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org)
In-Reply-To: <20150208024506.GC15396-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>

On Sun, 2015-02-08 at 10:45 +0800, Greg KH wrote:
> On Sat, Feb 07, 2015 at 09:27:05PM -0500, Laurence Oberman wrote:
> > Hello
> > Its not going to be tens of thousands of devices. That count was an
> > aggregate based on 1000's of servers.
> > In reality its unlikely to ever be more than 100 tapes drives per
> > individual Linux kernel instance.
> > Therefore sysfs will be the valid way to do this and make the data
> > available to user space.
> 
> Even if it is only 2 tape drives, again, what's wrong with using the
> existing i/o statistic interfaces that all block devices have?

Tape is a character device.  It only uses block via SCSI (SCSI uses
block to give an issue queue for every device).  One of the problems
with this model is that the block kobj, where all the statistics hang,
is actually never exposed for these devices because they don't have a
block name.  Even granted that we could alter block to give names to the
nameless queues and expose them in /sys/block, we'd still have the
problem, the queue statistics are the property of the pluggable I/O
scheduler, so there's a disconnect between the SCSI upper layer drivers
and the block scheduler (since the latter is embedded by design).
Pulling that apart would get us into a fairly nasty layering violation
(drivers aren't supposed to care about the scheulders).

>   Don't go
> making special one-off interfaces for one type of device if at all
> possible.

I don't really see any way around this.  The statistics the block
schedulers collect are relevant to I/O load balancing; that's not at all
the same class of statistics as the users of tape are interested in.
This problem is equivalent to the fibrechannel one where we collect the
fc_host_statistics in the scsi_transport_fc.c class as an attribute
group (block doesn't want to see or know any of the information because
it's all relevant to the transport, not the block abstraction).

James

^ permalink raw reply

* Re: [PATCH 01/13] kdbus: add documentation
From: Andy Lutomirski @ 2015-02-08 16:54 UTC (permalink / raw)
  To: David Herrmann
  Cc: Arnd Bergmann, Ted Ts'o, Linux API, Michael Kerrisk,
	Daniel Mack, One Thousand Gnomes, Austin S Hemmelgarn,
	Tom Gundersen, Greg Kroah-Hartman, linux-kernel,
	Eric W. Biederman, Djalal Harouni, Johannes Stezenbach,
	Christoph Hellwig
In-Reply-To: <CANq1E4ShseFuTp0wPrHM9mFmgA-y9Kqz1m0-FmU9qALuxQ8Qvg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Feb 4, 2015 4:16 PM, "David Herrmann" <dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> Hi
>
> On Thu, Feb 5, 2015 at 12:03 AM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> > I see "latencies" of around 20 microseconds with lockdep and context
> > tracking off.  For example:
>
> Without metadata nor memfd transmission, I get 2.5us for kdbus, 1.5us
> for UDS (8k payload). With 8-byte payloads, I get 2.2us and 1.2us. I
> suspect you enabled metadata transmission, which I think is not a fair
> comparison.

I tried to disable metadata.  I may have failed.

Regardless, if metadata is very slow, then that's more reason not to
use it on send.  And if you shouldn't use it, then maybe the kernel
shouldn't provide it.

I assumed there was a context switch in there.  I can try to test
differently.  If UDS is twice as fast *with* a contest switch, then a
userspace solution should be faster.

Also, UDS can use memfds, too.

>
> A few notes on that:
>
> * kdbus is a bus layer. We don't intend to replace UDS, but improve
> dbus. Comparing roundtrip times with UDS is tempting, but in no way
> fair. To the very least, a bus layer has to perform peer-lookup, which
> UDS does not have to do. Imo, 2.5us vs. 1.5us is already pretty nice.
> Compare this to ~77us for dbus1 without marshaling.

This makes me wonder what dbus1 is doing wrong.

>
> * We have not optimized kdbus code-paths for speed, yet. Our main
> concerns are algorithmic challenges, and we believe they've been
> improved considerably with kdbus. I have constantly measured kdbus
> performance with 'perf' and flame-graphs, and there're a lot of
> possible optimizations (especially on locking). However, I think this
> can be done afterwards just fine. Neither API nor ioctl overhead has
> shown up in my measurements. If anyone has counter evidence, please
> let us know. But I'm a bit reluctant to change our API solely based on
> performance guesses.

But removal of send-time metadata can't be done after the fact.

--Andy

^ permalink raw reply

* Re: [RFC] implementing tape statistics single file vs multi-file in sysfs
From: "Kai Mäkisara (Kolumbus)" @ 2015-02-08  8:32 UTC (permalink / raw)
  To: Greg KH
  Cc: Laurence Oberman, Bryn M. Reeves, Seymour, Shane M,
	linux-api@vger.kernel.org, linux-scsi@vger.kernel.org,
	James E.J. Bottomley (JBottomley@parallels.com),
	Laurence Oberman (loberman@redhat.com)
In-Reply-To: <20150208024506.GC15396@kroah.com>


> On 8.2.2015, at 4.45, Greg KH <gregkh@linuxfoundation.org> wrote:
> 
> On Sat, Feb 07, 2015 at 09:27:05PM -0500, Laurence Oberman wrote:
>> Hello
>> Its not going to be tens of thousands of devices. That count was an
>> aggregate based on 1000's of servers.
>> In reality its unlikely to ever be more than 100 tapes drives per
>> individual Linux kernel instance.
>> Therefore sysfs will be the valid way to do this and make the data
>> available to user space.
> 
> Even if it is only 2 tape drives, again, what's wrong with using the
> existing i/o statistic interfaces that all block devices have?  Don't go
> making special one-off interfaces for one type of device if at all
> possible.
> 
The tape driver does not have access to the block device i/o statistics because the tapes are not block devices but character devices. They use the Linux block subsystem enough to do i/o but this does not include access to the statistics counters.

The purpose of the suggested text vector format patch is to create statistics that have the same format as the existing i/o statistics that the block devices so that the existing tools can be used with minimal modifications. One alternative is, of course, to tie the st driver more into the Linux block device system. I have looked into this several times but have never seen how to do this in a simple enough way.

Thanks,
Kai


^ permalink raw reply

* Re: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
From: Haggai Eran @ 2015-02-08  7:27 UTC (permalink / raw)
  To: Weiny, Ira, Jason Gunthorpe, Yann Droneaud
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Roland Dreier,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <2807E5FD2F6FDA4886F6618EAC48510E0CC201F7-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>

On 07/02/2015 02:52, Weiny, Ira wrote:
>>
>> On 05/02/2015 04:54, Weiny, Ira wrote:
>>>>
>>>> On Thu, Jan 29, 2015 at 09:50:38PM +0100, Yann Droneaud wrote:
>>>>
>>>>> Anyway, I recognize that uverb way of abusing write() syscall is
>>>>> borderline (at best) regarding other Linux subsystems and Unix
>>>>> paradigm in general. But it's not enough to screw it more.
>>>>
>>>> Then we must return the correct output size explicitly in the struct.
>>>
>>> I was thinking this very same thing as I read through this thread.
>>>
>>> I too would like to avoid the use of comp_masks if at all possible.  The query
>> call seems to be a verb where the structure size is all you really need to know
>> the set of values returned.
>>>
>>> As Jason says, other calls may require the comp_mask where 0 is not
>> sufficient to indicate "missing".
>>
...
>  
>> However, the only other alternative I see is to add it explicitly to every uverb
>> response struct.
>>
> 
> I think this is the best solution.  There is a 32 bit reserved field in ib_uverbs_ex_query_device_resp.  Could we use all or part of that to be the size?

Yes, I think 32-bit for the response length are more than enough.

I will send patches for 3.20 to re-introduce ib_uverbs_ex_query_device
with the response size instead of the reserved field, and with Yann's
changes.

Thanks,
Haggai

^ permalink raw reply

* Re: [RFC] implementing tape statistics single file vs multi-file in sysfs
From: Greg KH @ 2015-02-08  2:45 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Bryn M. Reeves, Seymour, Shane M, linux-api@vger.kernel.org,
	linux-scsi@vger.kernel.org, Kai.Makisara@kolumbus.fi,
	James E.J. Bottomley (JBottomley@parallels.com),
	Laurence Oberman (loberman@redhat.com)
In-Reply-To: <ACE69613-157F-4516-8499-EF8242D26736@gmail.com>

On Sat, Feb 07, 2015 at 09:27:05PM -0500, Laurence Oberman wrote:
> Hello
> Its not going to be tens of thousands of devices. That count was an
> aggregate based on 1000's of servers.
> In reality its unlikely to ever be more than 100 tapes drives per
> individual Linux kernel instance.
> Therefore sysfs will be the valid way to do this and make the data
> available to user space.

Even if it is only 2 tape drives, again, what's wrong with using the
existing i/o statistic interfaces that all block devices have?  Don't go
making special one-off interfaces for one type of device if at all
possible.

thanks,

greg k-h

^ permalink raw reply

* Re: [RFC] implementing tape statistics single file vs multi-file in sysfs
From: Laurence Oberman @ 2015-02-08  2:27 UTC (permalink / raw)
  To: Greg KH
  Cc: Bryn M. Reeves, Seymour, Shane M, linux-api@vger.kernel.org,
	linux-scsi@vger.kernel.org, Kai.Makisara@kolumbus.fi,
	James E.J. Bottomley (JBottomley@parallels.com),
	Laurence Oberman (loberman@redhat.com)
In-Reply-To: <20150207040743.GB29944@kroah.com>

Hello
Its not going to be tens of thousands of devices. That count was an
aggregate based on 1000's of servers.
In reality its unlikely to ever be more than 100 tapes drives per
individual Linux kernel instance.
Therefore sysfs will be the valid way to do this and make the data
available to user space.

Thanks
Laurence


> On Feb 6, 2015, at 11:07 PM, Greg KH <gregkh@linuxfoundation.org> wrote:
> 
>> On Fri, Feb 06, 2015 at 03:41:58PM +0000, Bryn M. Reeves wrote:
>>> On Fri, Feb 06, 2015 at 04:59:16AM -0800, Greg KH wrote:
>>>> On Fri, Feb 06, 2015 at 12:20:53AM +0000, Seymour, Shane M wrote:
>>>> The current patch that implements tape statistics is here:
>>>> 
>>>> http://marc.info/?l=linux-scsi&m=142112067313723&w=2
>>> 
>>> Aside from the "do we want to do this all in a single file" issue that I
>>> will say more on below, this patch has issues.  Please don't use a
>>> kobject for _ANYTHING_ in sysfs that has a struct device as a parent.
>>> If you do that, it can't be seen by userspace tools very well, if at
>>> all.
>> 
>> I can't speak for Shane but wouldn't spend too much time looking at the
>> current v2 patch: it's the result of a pretty ugly compromise suggested
>> on linux-scsi.
> 
> Fair enough, but please feel free to cc: me on the patch that you do
> feel is correct to get a sysfs-related review.
> 
>>>> Recently there was was another discussion here about one file vs a collection of files for tape statistics:
>>>> 
>>>> http://marc.info/?l=linux-scsi&m=142316255501550&w=2
>>>> 
>>>> The result was that I should ask here what method I should use. I
>>>> would like to get feedback in relation to tape statistics and one file
>>>> vs multi-file in sysfs. I'm happy to keep the existing code or change
>>>> to a single file approach.
>>> 
>>> One of the primary reasons we created sysfs and the "one value per file"
>>> rule is that multi-value files just do not work well.  Yes, you get an
>>> atomic snapshot, and you save some open/read/close syscall roundtrips,
>>> but you do so at the expense of forcing userspace to "know" what the
>>> format of the file is.  And once you create it, you can NEVER CHANGE IT
>>> AGAIN.
>> 
>> I am not convinced this is a concern for tape statistics: they are pretty
>> much a solved problem. The commercial *nixes have had this for decades.
>> 
>> Likewise for disk stats: although fluff like maj:min/name etc. has been
>> shuffled a few times the basic fields have remained unchanged for a very
>> long time and sysfs already removes the need to include an identity
>> field.
> 
> We already handle i/o stats just fine, why create a special sysfs
> interface for just a tape device interface?  What makes them so special?
> 
>>> Yes, that's right, if you come up with some new statistic in the future,
>>> or realize that one of the ones you have now is wrong, you can't change
>>> it, you have to make a whole new file, otherwise you could break
>>> userspace tools.
>> 
>> I understand the fact that you can't change them; I just don't think it's
>> a big problem in this specific case (and much less than some of the
>> more imaginative sysfs content - 2d int arrays with column headers
>> anyone?).
> 
> What sysfs file is a 2d int array?  I'll be glad to fix it.
> 
> Also, everyone doesn't think their solution will ever need to be
> changed.  Until later when it does :)
> 
>>> And yes, open/read/close does take take a few extra cycles, but you
>>> can't really measure it for a virtual filesystem like this on any modern
>>> system.
>> 
>> I'll try to get some numbers when I get back home next week - Shane is
>> talking about use cases involving tens of thousands of tape devices. I
>> am not certain that the overhead would be unmeasurable in that case: the
>> additional context switching & TLB flushes alone seem like they would
>> add up.
> 
> If you want to measure tens of thousands of tape devices then you
> shouldn't be using sysfs in the first place as it is not designed for
> "speed" at all.  Use the existing i/o rate interfaces instead, don't try
> to cram something into sysfs that doesn't belong there.
> 
> thanks,
> 
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC] implementing tape statistics single file vs multi-file in sysfs
From: Greg KH @ 2015-02-07  4:07 UTC (permalink / raw)
  To: Bryn M. Reeves
  Cc: Seymour, Shane M, linux-api@vger.kernel.org,
	linux-scsi@vger.kernel.org, Kai.Makisara@kolumbus.fi,
	James E.J. Bottomley (JBottomley@parallels.com),
	Laurence Oberman (loberman@redhat.com)
In-Reply-To: <20150206154157.GD1143@localhost.localdomain>

On Fri, Feb 06, 2015 at 03:41:58PM +0000, Bryn M. Reeves wrote:
> On Fri, Feb 06, 2015 at 04:59:16AM -0800, Greg KH wrote:
> > On Fri, Feb 06, 2015 at 12:20:53AM +0000, Seymour, Shane M wrote:
> > > The current patch that implements tape statistics is here:
> > > 
> > > http://marc.info/?l=linux-scsi&m=142112067313723&w=2
> > 
> > Aside from the "do we want to do this all in a single file" issue that I
> > will say more on below, this patch has issues.  Please don't use a
> > kobject for _ANYTHING_ in sysfs that has a struct device as a parent.
> > If you do that, it can't be seen by userspace tools very well, if at
> > all.
> 
> I can't speak for Shane but wouldn't spend too much time looking at the
> current v2 patch: it's the result of a pretty ugly compromise suggested
> on linux-scsi.

Fair enough, but please feel free to cc: me on the patch that you do
feel is correct to get a sysfs-related review.

> > > Recently there was was another discussion here about one file vs a collection of files for tape statistics:
> > > 
> > > http://marc.info/?l=linux-scsi&m=142316255501550&w=2
> > > 
> > > The result was that I should ask here what method I should use. I
> > > would like to get feedback in relation to tape statistics and one file
> > > vs multi-file in sysfs. I'm happy to keep the existing code or change
> > > to a single file approach.
> > 
> > One of the primary reasons we created sysfs and the "one value per file"
> > rule is that multi-value files just do not work well.  Yes, you get an
> > atomic snapshot, and you save some open/read/close syscall roundtrips,
> > but you do so at the expense of forcing userspace to "know" what the
> > format of the file is.  And once you create it, you can NEVER CHANGE IT
> > AGAIN.
> 
> I am not convinced this is a concern for tape statistics: they are pretty
> much a solved problem. The commercial *nixes have had this for decades.
> 
> Likewise for disk stats: although fluff like maj:min/name etc. has been
> shuffled a few times the basic fields have remained unchanged for a very
> long time and sysfs already removes the need to include an identity
> field.

We already handle i/o stats just fine, why create a special sysfs
interface for just a tape device interface?  What makes them so special?

> > Yes, that's right, if you come up with some new statistic in the future,
> > or realize that one of the ones you have now is wrong, you can't change
> > it, you have to make a whole new file, otherwise you could break
> > userspace tools.
> 
> I understand the fact that you can't change them; I just don't think it's
> a big problem in this specific case (and much less than some of the
> more imaginative sysfs content - 2d int arrays with column headers
> anyone?).

What sysfs file is a 2d int array?  I'll be glad to fix it.

Also, everyone doesn't think their solution will ever need to be
changed.  Until later when it does :)

> > And yes, open/read/close does take take a few extra cycles, but you
> > can't really measure it for a virtual filesystem like this on any modern
> > system.
> 
> I'll try to get some numbers when I get back home next week - Shane is
> talking about use cases involving tens of thousands of tape devices. I
> am not certain that the overhead would be unmeasurable in that case: the
> additional context switching & TLB flushes alone seem like they would
> add up.

If you want to measure tens of thousands of tape devices then you
shouldn't be using sysfs in the first place as it is not designed for
"speed" at all.  Use the existing i/o rate interfaces instead, don't try
to cram something into sysfs that doesn't belong there.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v3] samsung-laptop: enable better lid handling
From: Darren Hart @ 2015-02-07  2:38 UTC (permalink / raw)
  To: Julijonas Kikutis
  Cc: Corentin Chary, open list:ABI/API, open list,
	open list:SAMSUNG LAPTOP DR...
In-Reply-To: <1422536696-1718-1-git-send-email-julijonas.kikutis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Thu, Jan 29, 2015 at 01:04:37PM +0000, Julijonas Kikutis wrote:
> Some Samsung laptops with SABI3 delay the sleep for 10 seconds after
> the lid is closed and do not wake up from sleep after the lid is opened.
> A SABI command is needed to enable the better behavior.
> 
> Command = 0x6e, d0 = 0x81 enables this behavior. Returns d0 = 0x01.
> Command = 0x6e, d0 = 0x80 disables this behavior. Returns d0 = 0x00.
> 
> Command = 0x6d and any d0 queries the state. This returns:
> d0 = 0x00000*01, d1 = 0x00, d2 = 0x00, d3 = 0x0* when it is enabled.
> d0 = 0x00000*00, d1 = 0x00, d2 = 0x00, d3 = 0x0* when it is disabled.
> Where * is 0 - laptop has never slept or hibernated after switch on,
>            1 - laptop has hibernated just before,
>            2 - laptop has slept just before.
> 
> Patch addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=75901 .
> It adds a sysfs attribute lid_handling with a description and also an
> addition to the quirks structure to enable the mode by default.
> 
> A user with another laptop in the bug report says that "power button has
> to be pressed twice to wake the machine" when he or she enabled the mode
> manually using the SABI command. Therefore, it is enabled by default
> only for the single laptop that I have tested.
> 
> Signed-off-by: Julijonas Kikutis <julijonas.kikutis-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Queued up, thanks Jujijonas.

-- 
Darren Hart
Intel Open Source Technology Center

^ permalink raw reply

* RE: [PATCH v1 1/5] IB/uverbs: ex_query_device: answer must not depend on request's comp_mask
From: Weiny, Ira @ 2015-02-07  0:52 UTC (permalink / raw)
  To: Haggai Eran, Jason Gunthorpe, Yann Droneaud
  Cc: Sagi Grimberg, Shachar Raindel, Eli Cohen, Roland Dreier,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <54D32933.8080307-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

> 
> On 05/02/2015 04:54, Weiny, Ira wrote:
> >>
> >> On Thu, Jan 29, 2015 at 09:50:38PM +0100, Yann Droneaud wrote:
> >>
> >>> Anyway, I recognize that uverb way of abusing write() syscall is
> >>> borderline (at best) regarding other Linux subsystems and Unix
> >>> paradigm in general. But it's not enough to screw it more.
> >>
> >> Then we must return the correct output size explicitly in the struct.
> >
> > I was thinking this very same thing as I read through this thread.
> >
> > I too would like to avoid the use of comp_masks if at all possible.  The query
> call seems to be a verb where the structure size is all you really need to know
> the set of values returned.
> >
> > As Jason says, other calls may require the comp_mask where 0 is not
> sufficient to indicate "missing".
> 
> Would it be okay to return it in the ib_uverbs_cmd_hdr.out_words? That would
> further abuse the write() syscall by writing to the input buffer.

I don't think that is such a great idea.
 
> However, the only other alternative I see is to add it explicitly to every uverb
> response struct.
> 

I think this is the best solution.  There is a 32 bit reserved field in ib_uverbs_ex_query_device_resp.  Could we use all or part of that to be the size?

For other extended commands I'm not sure what to do.

Ira

^ permalink raw reply

* Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID
From: Linus Torvalds @ 2015-02-06 22:10 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Konstantin Khlebnikov, Linux API, Andrew Morton,
	Linux Kernel Mailing List, Roman Gushchin, Nikita Vetoshkin,
	Oleg Nesterov, Pavel Emelyanov
In-Reply-To: <CALYGNiMv021=WC2uXsjo5zT8JwewweZUDdk0x8FGHh9V5j6bFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Fri, Feb 6, 2015 at 1:13 PM, Konstantin Khlebnikov <koct9i-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> Currently that works fine only because kernel retries 0-order allocations
> endlessly. But pagefault_out_of_memory() is never called for non-user PF.
> For kernel PF all oom-kills are triggered by buddy-allocator.

This makes no sense.

You're trying to fix what you perceive as a problem in the page fault
handling in some totally different place.

If *that* is what you are worried about, then damnit, just fix the
page fault handler for the kernel case to send a signal or something
for VM_FAULT_ENOMEM. Or, better yet, make it just trigger oom at
return to user space - we could add a _TIF_OOM flag, for example, and
make it part of the user-return logic (do_notify_resume), kind of how
_TIF_SIGPENDING triggers a signal.

Don't try to make horrible code in insane places that have nothing to
do with the fundamental problem. Why did you pick this particular
get/put user anyway? There are tons others that we don't test, why did
you happen pick these and then make it have that horrible and
senseless error handling?

Because at *NO* point was it obvious that that patch had anything at
all to do with "out of memory". Not in the code, not in your commit
messages, *nowhere*.

There is no way in hell I will take that kind of changes when you
didn't even articulate why you wanted to do them in the commit
messages, and the patches didn't look like they had anything to do
with oom either.

                    Linus

^ permalink raw reply

* Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID
From: Andy Lutomirski @ 2015-02-06 21:55 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Linus Torvalds, Konstantin Khlebnikov, Linux API, Andrew Morton,
	Linux Kernel Mailing List, Roman Gushchin, Nikita Vetoshkin,
	Oleg Nesterov, Pavel Emelyanov
In-Reply-To: <CALYGNiMv021=WC2uXsjo5zT8JwewweZUDdk0x8FGHh9V5j6bFQ@mail.gmail.com>

On Fri, Feb 6, 2015 at 1:13 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Fri, Feb 6, 2015 at 11:49 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Fri, Feb 6, 2015 at 8:23 AM, Konstantin Khlebnikov
>> <khlebnikov@yandex-team.ru> wrote:
>>> Handling of flag CLONE_PARENT_SETTID has the same problem: error returned
>>> from put_user() is ignored. Glibc completely relies on that feature and uses
>>> value returned from syscall only for error checking.
>>
>> I'm not seeing the advantage of the error checking part of the pacth
>> patch. It generates extra code, possibly changing existing interfaces,
>> and it doesn't actually buy us anything.
>>
>> What's the upside? If somebody passes in a bad pointer, it's their
>> problem. For all we know, people used to pass in NULL, even if they
>> had the SETTID bit set. This makes it now return EFAULT.
>
> Currently that works fine only because kernel retries 0-order allocations
> endlessly. But pagefault_out_of_memory() is never called for non-user PF.
> For kernel PF all oom-kills are triggered by buddy-allocator.
> If buddy allocator gave up earlier then page-faults from kernel space
> could fail without OOM. And in CoW area user-space will see stale data.
> So, either we must handle all put_user/copy_to_user errors (which isn't
> that bad idea) or kernel must force all PF to success-or-die policy.
>
> First patch is that ugly because kernel has never checked errors
> in that place. So, I've tried to find solution which could fix problem
> without breaking backward compatibility.

If you're really worried about compatibility, it would be possible, if
really really ugly, to check whether there's a vma at all at the
requested address and to return -EFAULT only in the case where there
is a vma but put_user still failed.

A less awful approach might be to accept put_user failures if the
address is NULL.

--Andy

^ permalink raw reply

* Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID
From: Konstantin Khlebnikov @ 2015-02-06 21:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Konstantin Khlebnikov, Linux API, Andrew Morton,
	Linux Kernel Mailing List, Roman Gushchin, Nikita Vetoshkin,
	Oleg Nesterov, Pavel Emelyanov
In-Reply-To: <CA+55aFxBuf-0UkoYCrwH_vNsWFnUkFOz5c9O_Mswe_w0BTkqbQ@mail.gmail.com>

On Fri, Feb 6, 2015 at 11:49 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Feb 6, 2015 at 8:23 AM, Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>> Handling of flag CLONE_PARENT_SETTID has the same problem: error returned
>> from put_user() is ignored. Glibc completely relies on that feature and uses
>> value returned from syscall only for error checking.
>
> I'm not seeing the advantage of the error checking part of the pacth
> patch. It generates extra code, possibly changing existing interfaces,
> and it doesn't actually buy us anything.
>
> What's the upside? If somebody passes in a bad pointer, it's their
> problem. For all we know, people used to pass in NULL, even if they
> had the SETTID bit set. This makes it now return EFAULT.

Currently that works fine only because kernel retries 0-order allocations
endlessly. But pagefault_out_of_memory() is never called for non-user PF.
For kernel PF all oom-kills are triggered by buddy-allocator.
If buddy allocator gave up earlier then page-faults from kernel space
could fail without OOM. And in CoW area user-space will see stale data.
So, either we must handle all put_user/copy_to_user errors (which isn't
that bad idea) or kernel must force all PF to success-or-die policy.

First patch is that ugly because kernel has never checked errors
in that place. So, I've tried to find solution which could fix problem
without breaking backward compatibility.

>
> So I don't mind moving things into copy_process(), but I *do* mind the
> new error return thing.
>
> It's actually better in this patch than in 1/2, because 1/2 was just
> insane with the whole "readable vs writable" thing. That I refuse to
> even look at, for fear of going blind.
>
>                       Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID
From: Oleg Nesterov @ 2015-02-06 21:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Konstantin Khlebnikov, Linux API, Andrew Morton,
	Linux Kernel Mailing List, Roman Gushchin, Nikita Vetoshkin,
	Pavel Emelyanov
In-Reply-To: <CA+55aFxBuf-0UkoYCrwH_vNsWFnUkFOz5c9O_Mswe_w0BTkqbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

I am not sure about these changes too, but

On 02/06, Linus Torvalds wrote:
>
> What's the upside? If somebody passes in a bad pointer, it's their
> problem.

Yes. But unless I am totally confused (quite possible) this put_user()
can fail even if the pointer is valid.

So at least I think Konstantin has found a real problem. Which (I think)
shoud be fixed, and this is not connected to fork.

> insane with the whole "readable vs writable" thing.

Agreed, this part doesn't look nice in any case.

Oleg.

^ permalink raw reply

* Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID
From: Linus Torvalds @ 2015-02-06 20:49 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Linux API, Andrew Morton, Linux Kernel Mailing List,
	Roman Gushchin, Nikita Vetoshkin, Oleg Nesterov, Pavel Emelyanov
In-Reply-To: <20150206162303.18031.37408.stgit@buzz>

On Fri, Feb 6, 2015 at 8:23 AM, Konstantin Khlebnikov
<khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> wrote:
> Handling of flag CLONE_PARENT_SETTID has the same problem: error returned
> from put_user() is ignored. Glibc completely relies on that feature and uses
> value returned from syscall only for error checking.

I'm not seeing the advantage of the error checking part of the pacth
patch. It generates extra code, possibly changing existing interfaces,
and it doesn't actually buy us anything.

What's the upside? If somebody passes in a bad pointer, it's their
problem. For all we know, people used to pass in NULL, even if they
had the SETTID bit set. This makes it now return EFAULT.

So I don't mind moving things into copy_process(), but I *do* mind the
new error return thing.

It's actually better in this patch than in 1/2, because 1/2 was just
insane with the whole "readable vs writable" thing. That I refuse to
even look at, for fear of going blind.

                      Linus

^ permalink raw reply

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Michal Hocko @ 2015-02-06 20:45 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Vlastimil Babka, Kirill A. Shutemov, Dave Hansen, Mel Gorman,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Minchan Kim,
	Andrew Morton, lkml, Linux API, linux-man, Hugh Dickins
In-Reply-To: <54D4E47E.4020509-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Fri 06-02-15 16:57:50, Michael Kerrisk wrote:
[...]
> > Yes, this wording is better because many users are not aware of
> > MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't
> > mention that.
> 
> (Michal, would you have a text to propose to add to the mmap(2) page?
> Maybe it would be useful to add something there.)

I am half way on vacation, but I can cook a patch after I am back after
week.
 
> > I am just wondering whether it makes sense to mention that MADV_DONTNEED
> > for shared mappings might be surprising and not freeing the backing
> > pages thus not really freeing memory until there is a memory
> > pressure. But maybe this is too implementation specific for a man
> > page. What about the following wording on top of yours?
> > "
> > Please note that the MADV_DONTNEED hint on shared mappings might not
> > lead to immediate freeing of pages in the range. The kernel is free to
> > delay this until an appropriate moment. RSS of the calling process will
> > be reduced however.
> > "
> 
> Thanks! I added this, but dropped in the word "immediately" in the last 
> sentence, since I assume that was implied. So now we have:
> 
>               After  a  successful MADV_DONTNEED operation, the seman‐
>               tics of  memory  access  in  the  specified  region  are
>               changed:  subsequent accesses of pages in the range will
>               succeed, but will result in either repopulating the mem‐
>               ory  contents from the up-to-date contents of the under‐
>               lying mapped file  (for  shared  file  mappings,  shared
>               anonymous  mappings,  and shmem-based techniques such as
>               System V shared memory segments) or  zero-fill-on-demand
>               pages for anonymous private mappings.
> 
>               Note  that,  when applied to shared mappings, MADV_DONT‐
>               NEED might not lead to immediate freeing of the pages in
>               the  range.   The  kernel  is  free to delay freeing the
>               pages until an appropriate  moment.   The  resident  set
>               size  (RSS)  of  the calling process will be immediately
>               reduced however.

This sounds good to me and it is definitely much better than the current
state. Thanks!

> The current draft of the page can be found in a branch,
> http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_madvise
> 
> Thanks,
> 
> Michael
> 
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* memcg && uaccess (Was: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID)
From: Oleg Nesterov @ 2015-02-06 20:32 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Roman Gushchin,
	Nikita Vetoshkin, Pavel Emelyanov
In-Reply-To: <20150206195529.GA15517-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Add cc's.

On 02/06, Oleg Nesterov wrote:
>
> And in fact I think that this is not set_child_tid/etc-specific. Perhaps
> I am totally confused, but I think that put_user() simply should not fail
> this way. Say, why a syscall should return -EFAULT if memory allocation
> "silently" fails? Confused.

Seriously. I must have missed something, but I can't understand 519e52473eb
"mm: memcg: enable memcg OOM killer only for user faults".

The changelog says:

	System calls and kernel faults (uaccess, gup) can handle an out of
	memory situation gracefully and just return -ENOMEM.

How can a system call know it should return -ENOMEM if put_user() can only
return -EFAULT ?

Oleg.

^ permalink raw reply

* Re: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID
From: Konstantin Khlebnikov @ 2015-02-06 20:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Konstantin Khlebnikov, Linux API, Andrew Morton, Linus Torvalds,
	Linux Kernel Mailing List, Roman Gushchin, Nikita Vetoshkin,
	Pavel Emelyanov
In-Reply-To: <20150206195529.GA15517-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Fri, Feb 6, 2015 at 10:55 PM, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 02/06, Oleg Nesterov wrote:
>>
>> On 02/06, Konstantin Khlebnikov wrote:
>> >
>> > Whole sequence looks like: task calls fork, glibc calls syscall clone with
>> > CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument.
>> > Child task gets read-only copy of VM including TLS. Child calls put_user()
>> > to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page
>> > fault and it fails because do_wp_page()  hits memcg limit without invoking
>> > OOM-killer because this is page-fault from kernel-space.
>>
>> Because of !FAULT_FLAG_USER?

Yep. As I see memcg triggers OOM only on page-faults and only from user-space.

>>
>> Perhaps we should fix this? Say mem_cgroup_oom_enable/disable around put_user(),
>> I dunno.
>>
>> > Put_user returns
>> > -EFAULT, which is ignored.  Child returns into user-space and catches here
>> > assert (THREAD_GETMEM (self, tid) != ppid),
>>
>> If only I understood why else we need CLONE_CHILD_SETTID ;)

I dunno, CLONE_PARENT_SETTID should be enough for everybody but it's
broken too. Twice. See the next patch =)

>>
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -2312,8 +2312,20 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
>> >     post_schedule(rq);
>> >     preempt_enable();
>> >
>> > -   if (current->set_child_tid)
>> > -           put_user(task_pid_vnr(current), current->set_child_tid);
>> > +   if (current->set_child_tid &&
>> > +       unlikely(put_user(task_pid_vnr(current), current->set_child_tid))) {
>> > +           int dummy;
>> > +
>> > +           /*
>> > +            * If this address is unreadable then userspace has not set
>> > +            * proper pointer. Application either doesn't care or will
>> > +            * notice this soon. If this address is readable then task
>> > +            * will be mislead about its own tid. It's better to die.
>> > +            */
>> > +           if (!get_user(dummy, current->set_child_tid) &&
>> > +               !fatal_signal_pending(current))
>> > +                   force_sig(SIGSEGV, current);
>> > +   }
>>
>> Well, get_user() can fail the same way? The page we need to cow can be
>> swapped out.
>>
>> At first glance, to me this problem should be solved somewhere else...
>> I'll try to reread this all tomorrow.
>
> And in fact I think that this is not set_child_tid/etc-specific. Perhaps
> I am totally confused, but I think that put_user() simply should not fail
> this way. Say, why a syscall should return -EFAULT if memory allocation
> "silently" fails? Confused.

That's how memcg works. All other places are handled explicitly and
returns into user-space as -ENOMEM or -EFAULT.
Probably some strange numa policy / memoryaffinity might trigger this too.

Probably all page-faults must be forced to succeed or die mode,
in this case all errors in put_user/copy_to_user could be simply ignored.

--
Konstantin

^ permalink raw reply

* Re: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID
From: Oleg Nesterov @ 2015-02-06 19:55 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Roman Gushchin,
	Nikita Vetoshkin, Pavel Emelyanov
In-Reply-To: <20150206194405.GA13960-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On 02/06, Oleg Nesterov wrote:
>
> On 02/06, Konstantin Khlebnikov wrote:
> >
> > Whole sequence looks like: task calls fork, glibc calls syscall clone with
> > CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument.
> > Child task gets read-only copy of VM including TLS. Child calls put_user()
> > to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page
> > fault and it fails because do_wp_page()  hits memcg limit without invoking
> > OOM-killer because this is page-fault from kernel-space.
> 
> Because of !FAULT_FLAG_USER?
> 
> Perhaps we should fix this? Say mem_cgroup_oom_enable/disable around put_user(),
> I dunno.
> 
> > Put_user returns
> > -EFAULT, which is ignored.  Child returns into user-space and catches here
> > assert (THREAD_GETMEM (self, tid) != ppid),
> 
> If only I understood why else we need CLONE_CHILD_SETTID ;)
> 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2312,8 +2312,20 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
> >  	post_schedule(rq);
> >  	preempt_enable();
> >
> > -	if (current->set_child_tid)
> > -		put_user(task_pid_vnr(current), current->set_child_tid);
> > +	if (current->set_child_tid &&
> > +	    unlikely(put_user(task_pid_vnr(current), current->set_child_tid))) {
> > +		int dummy;
> > +
> > +		/*
> > +		 * If this address is unreadable then userspace has not set
> > +		 * proper pointer. Application either doesn't care or will
> > +		 * notice this soon. If this address is readable then task
> > +		 * will be mislead about its own tid. It's better to die.
> > +		 */
> > +		if (!get_user(dummy, current->set_child_tid) &&
> > +		    !fatal_signal_pending(current))
> > +			force_sig(SIGSEGV, current);
> > +	}
>
> Well, get_user() can fail the same way? The page we need to cow can be
> swapped out.
>
> At first glance, to me this problem should be solved somewhere else...
> I'll try to reread this all tomorrow.

And in fact I think that this is not set_child_tid/etc-specific. Perhaps
I am totally confused, but I think that put_user() simply should not fail
this way. Say, why a syscall should return -EFAULT if memory allocation
"silently" fails? Confused.

Oleg.

^ permalink raw reply

* Re: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID
From: Oleg Nesterov @ 2015-02-06 19:44 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Roman Gushchin,
	Nikita Vetoshkin, Pavel Emelyanov
In-Reply-To: <20150206162301.18031.32251.stgit@buzz>

On 02/06, Konstantin Khlebnikov wrote:
>
> Whole sequence looks like: task calls fork, glibc calls syscall clone with
> CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument.
> Child task gets read-only copy of VM including TLS. Child calls put_user()
> to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page
> fault and it fails because do_wp_page()  hits memcg limit without invoking
> OOM-killer because this is page-fault from kernel-space.

Because of !FAULT_FLAG_USER?

Perhaps we should fix this? Say mem_cgroup_oom_enable/disable around put_user(),
I dunno.

> Put_user returns
> -EFAULT, which is ignored.  Child returns into user-space and catches here
> assert (THREAD_GETMEM (self, tid) != ppid),

If only I understood why else we need CLONE_CHILD_SETTID ;)

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2312,8 +2312,20 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
>  	post_schedule(rq);
>  	preempt_enable();
>  
> -	if (current->set_child_tid)
> -		put_user(task_pid_vnr(current), current->set_child_tid);
> +	if (current->set_child_tid &&
> +	    unlikely(put_user(task_pid_vnr(current), current->set_child_tid))) {
> +		int dummy;
> +
> +		/*
> +		 * If this address is unreadable then userspace has not set
> +		 * proper pointer. Application either doesn't care or will
> +		 * notice this soon. If this address is readable then task
> +		 * will be mislead about its own tid. It's better to die.
> +		 */
> +		if (!get_user(dummy, current->set_child_tid) &&
> +		    !fatal_signal_pending(current))
> +			force_sig(SIGSEGV, current);
> +	}

Well, get_user() can fail the same way? The page we need to cow can be
swapped out.

At first glance, to me this problem should be solved somewhere else...
I'll try to reread this all tomorrow.

Oleg.

^ permalink raw reply

* Re: [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Rik van Riel @ 2015-02-06 18:40 UTC (permalink / raw)
  To: Shaohua Li, Michal Hocko
  Cc: Minchan Kim, Michael Kerrisk (man-pages), Andrew Morton,
	linux-kernel, linux-mm, linux-api, Hugh Dickins, Johannes Weiner,
	KOSAKI Motohiro, Mel Gorman, Jason Evans, zhangyanfei,
	Kirill A. Shutemov, Kirill A. Shutemov
In-Reply-To: <20150206183242.GB2290@kernel.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/06/2015 01:32 PM, Shaohua Li wrote:
> On Fri, Feb 06, 2015 at 01:58:25PM +0100, Michal Hocko wrote:
>> On Thu 05-02-15 16:33:11, Shaohua Li wrote: [...]
>>> Did you think about move the MADV_FREE pages to the head of
>>> inactive LRU, so they can be reclaimed easily?
>> 
>> Yes this makes sense for pages living on the active LRU list. I
>> would preserve LRU ordering on the inactive list because there is
>> no good reason to make the operation more costly for inactive
>> pages. On the other hand having tons of to-be-freed pages on the
>> active list clearly sucks. Care to send a patch?
> 
> Considering anon pages are in active LRU first, it's likely
> MADV_FREE pages are in active list. I'm curious why preserves the
> order of inactive list.

Only before the first time MADV_FREE is called on those pages.

If a program repeatedly allocates and frees the same memory
region, not moving the MADV_FREE pages around in the LRU
several times can save some overhead.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU1QqHAAoJEM553pKExN6DTXoH/3bS+VhdIm1EpOc8OOFtBHvd
T63DHObtOY1FOog48CtgvUCfo7Q+g1aG/9hz7lJNP1G26B3+LNszM9OtE/9QrYUH
uzmuWvFL7l0W0qen/WsyO0RcyqN+0mEXvNVqynTmJJu8qAG0p5WsjA6L5Penzj//
tnBmn5xb1h3COjDZkHsxBfkpfCpNq5dm88K6B3nApHz4QhfcviKefczsrWdZ/bBc
2uMnlIebKY1Oq9MDHsg8p/b3lIHzwAf0xGSvGLN0YfzDPzlqBMbxSbVubYEA9EaU
OiS1XqRp8okeGgrxsRAb/F8wPgClpce+h0E5xpyUuew2rlD1OmciX6iIDcE5Zrk=
=KUng
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Shaohua Li @ 2015-02-06 18:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Michael Kerrisk (man-pages), Andrew Morton,
	linux-kernel, linux-mm, linux-api, Hugh Dickins, Johannes Weiner,
	Rik van Riel, KOSAKI Motohiro, Mel Gorman, Jason Evans,
	zhangyanfei, Kirill A. Shutemov, Kirill A. Shutemov
In-Reply-To: <20150206125825.GA4498@dhcp22.suse.cz>

On Fri, Feb 06, 2015 at 01:58:25PM +0100, Michal Hocko wrote:
> On Thu 05-02-15 16:33:11, Shaohua Li wrote:
> [...]
> > Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> > they can be reclaimed easily?
> 
> Yes this makes sense for pages living on the active LRU list. I would
> preserve LRU ordering on the inactive list because there is no good
> reason to make the operation more costly for inactive pages. On the
> other hand having tons of to-be-freed pages on the active list clearly
> sucks. Care to send a patch?

Considering anon pages are in active LRU first, it's likely MADV_FREE pages are
in active list. I'm curious why preserves the order of inactive list. App knows
which pages are cold, why don't take the advantages? I'll play the patch more
to see what I can do for it.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Shaohua Li @ 2015-02-06 18:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michael Kerrisk (man-pages), Michal Hocko, Andrew Morton,
	linux-kernel, linux-mm, linux-api, Hugh Dickins, Johannes Weiner,
	Rik van Riel, KOSAKI Motohiro, Mel Gorman, Jason Evans,
	zhangyanfei, Kirill A. Shutemov, Kirill A. Shutemov
In-Reply-To: <20150206055103.GA13244@blaptop>

On Fri, Feb 06, 2015 at 02:51:03PM +0900, Minchan Kim wrote:
> Hi Shaohua,
> 
> On Thu, Feb 05, 2015 at 04:33:11PM -0800, Shaohua Li wrote:
> > 
> > Hi Minchan,
> > 
> > Sorry to jump in this thread so later, and if some issues are discussed before.
> > I'm interesting in this patch, so tried it here. I use a simple test with
> 
> No problem at all. Interest is always win over ignorance.
> 
> > jemalloc. Obviously this can improve performance when there is no memory
> > pressure. Did you try setup with memory pressure?
> 
> Sure but it was not a huge memory system like yours.

Yes, I'd like to check the symptom in memory pressure, so choose such test.

> > In my test, jemalloc will map 61G vma, and use about 32G memory without
> > MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
> > madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch
> 
> Yes, IIUC, jemalloc replaces MADV_DONTNEED with MADV_FREE completely.

right.
> > slightly to make it work without swap), I got oom. If swap is enabled, my
> 
> You mean you modified anon aging logic so it works although there is no swap?
> If so, I have no idea why OOM happens. I guess it should free all of freeable
> pages during the aging so although system stall happens more, I don't expect
> OOM. Anyway, with MADV_FREE with no swap, we should consider more things
> about anonymous aging.

In the patch, MADV_FREE will be disabled and fallback to DONTNEED if no swap is
enabled. Our production environment doesn't enable swap, so I tried to delete
the 'no swap' check and make MADV_FREE always enabled regardless if swap is
enabled. I didn't change anything else. With such change, I saw oom
immediately. So definitely we have aging issue, the pages aren't reclaimed
fast.

> > system is totally stalled because of swap activity. Without the MADV_FREE,
> > everything is ok. Considering we definitely don't want to waste too much
> > memory, a system with memory pressure is normal, so sounds MADV_FREE will
> > introduce big trouble here.
> > 
> > Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> > they can be reclaimed easily?
> 
> I think it's desirable if the page lived in active LRU.
> The reason I didn't that was caused by volatile ranges system call which
> was motivaion for MADV_FREE in my mind.
> In last LSF/MM, there was concern about data's hotness.
> Some of users want to keep that as it is in LRU position, others want to
> handle that as cold(tail of inactive list)/warm(head of inactive list)/
> hot(head of active list), for example.
> The vrange syscall was just about volatiltiy, not depends on page hotness
> so the decision on my head was not to change LRU order and let's make new
> hotness advise if we need it later.
> 
> However, MADV_FREE's main customer is allocators and afaik, they want
> to replace MADV_DONTNEED with MADV_FREE so I think it is really cold,
> but we couldn't make sure so head of inactive is good compromise.
> Another concern about tail of inactive list is that there could be
> plenty of pages in there, which was asynchromos write-backed in
> previous reclaim path, not-yet reclaimed because of not being able
> to free the in softirq context of writeback. It means we ends up
> freeing more potential pages to become workingset in advance
> than pages VM already decided to evict.

Yes, they are definitely cold pages. I thought We should make sure the
MADV_FREE pages are reclaimed first before other pages, at least in the anon
LRU list, though there might be difficult to determine if we should reclaim
writeback pages first or MADV_FREE pages first.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID
From: Konstantin Khlebnikov @ 2015-02-06 16:23 UTC (permalink / raw)
  To: linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Roman Gushchin, Nikita Vetoshkin, Oleg Nesterov, Pavel Emelyanov
In-Reply-To: <20150206162301.18031.32251.stgit@buzz>

Handling of flag CLONE_PARENT_SETTID has the same problem: error returned
from put_user() is ignored. Glibc completely relies on that feature and uses
value returned from syscall only for error checking.

Kernels older than v2.6.24 handled that correctly but check has been removed
in commit 30e49c263e36 ("pid namespaces: allow cloning of new namespace").
Commit message tells nothing about reason. I guess that was fix for commit
425fb2b4bf5d ("pid namespaces: move alloc_pid() lower in copy_process()")
which breaks this logic: after it kernel writes parent pid as child pid.

This patch moves related put_user() from do_fork() back into copy_process()
where it was before and where error can be handled.

Another problem is that before v2.6.24 CLONE_PARENT_SETTID stored child pid
both in parent and child memory. Documentation in man clone(2) also tells so.
In v2.6.24 put_user() was placed after copy_mm(), now only parent sees it.
LTP test which should check that is buggy too: it clones child with CLONE_VM.
It seems nobody have noticed this for seven years, probably we should leave
it as is and document existing behavior.

Signed-off-by: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
---
 kernel/fork.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index f71305d..1ea2eae 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1194,6 +1194,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
 static struct task_struct *copy_process(unsigned long clone_flags,
 					unsigned long stack_start,
 					unsigned long stack_size,
+					int __user *parent_tidptr,
 					int __user *child_tidptr,
 					struct pid *pid,
 					int trace)
@@ -1416,6 +1417,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 			goto bad_fork_cleanup_io;
 	}
 
+	if (clone_flags & CLONE_PARENT_SETTID) {
+		retval = put_user(pid_vnr(pid), parent_tidptr);
+		if (retval)
+			goto bad_fork_free_pid;
+	}
+
 	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
 	/*
 	 * Clear TID on mm_release()?
@@ -1617,7 +1624,7 @@ static inline void init_idle_pids(struct pid_link *links)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0);
+	task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task->pids);
 		init_idle(task, cpu);
@@ -1661,7 +1668,7 @@ long do_fork(unsigned long clone_flags,
 	}
 
 	p = copy_process(clone_flags, stack_start, stack_size,
-			 child_tidptr, NULL, trace);
+			 parent_tidptr, child_tidptr, NULL, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
@@ -1675,9 +1682,6 @@ long do_fork(unsigned long clone_flags,
 		pid = get_task_pid(p, PIDTYPE_PID);
 		nr = pid_vnr(pid);
 
-		if (clone_flags & CLONE_PARENT_SETTID)
-			put_user(nr, parent_tidptr);
-
 		if (clone_flags & CLONE_VFORK) {
 			p->vfork_done = &vfork;
 			init_completion(&vfork);

^ permalink raw reply related

* [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID
From: Konstantin Khlebnikov @ 2015-02-06 16:23 UTC (permalink / raw)
  To: linux-api, Andrew Morton, Linus Torvalds, linux-kernel
  Cc: Roman Gushchin, Nikita Vetoshkin, Oleg Nesterov, Pavel Emelyanov

Currently kernel ignores put_user() errors when it writes tid for syscall
clone flags CLONE_CHILD_SETTID and CLONE_CHILD_CLEARTID.

Unfortunately this code always worked in this way. We cannot abort syscall
if client tid pointer is invalid.

This patch adds get_user() after failed put_user(): if this address is not
even readable then we leave it alone: user-space probably don't care about
pid or error will be noticed soon. If address is readable then application
might be mislead about it's own pid. In this case CLONE_CHILD_SETTID kills
child task. In same condition failed CLONE_CHILD_CLEARTID prints warning.

Nikita found script which sometimes hangs inside glibc in function fork()
in child task right after returning from syscall some glibc assert fails:

#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93
#1  0x00007fabcc699087 in _L_lock_11477 at malloc.c:5249
#2  0x00007fabcc6973ed in __GI___libc_realloc at malloc.c:3054
#3  0x00007fabcc686dbd in _IO_vasprintf at vasprintf.c:86
#4  0x00007fabcc667747 in ___asprintf at asprintf.c:37
#5  0x00007fabcc642cfb in __assert_fail_base at assert.c:59
#6  0x00007fabcc642e42 in __GI___assert_fail at assert.c:103
#7  0x00007fabcc6d40b6 in __libc_fork at ../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:142
#8  0x00007fabccad23cd in Perl_pp_system at pp_sys.c:4143c
#9  0x00007fabcca7fcd6 in Perl_runops_standard at run.c:41
#10 0x00007fabcca2139a in S_run_body at perl.c:2350
#11 perl_run at perl.c:2268
#12 0x0000000000400db9 in main at perlmain.c:120

Assert checks (THREAD_GETMEM (self, tid) != ppid). In child task pid still
equal to the parent pid! Write for CLONE_CHILD_SETTID had failed silently.
Unfortunately this assert in glibc is flawed, some internal locks are held
at this point and task hangs when tries to print out error message.

Usually memory allocations at page-faults are either completely successful
or task is killed by out of memory killer: buddy allocator handles 0-order
allocations as GFP_NOFAIL and retries them endlessly. OOM-killer in memory
cgroup works only for faulting from user-space. If kernel hits memcg limit
inside page-fault that will be handled as ordinary "Bad address": -EFAULT.

Whole sequence looks like: task calls fork, glibc calls syscall clone with
CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument.
Child task gets read-only copy of VM including TLS. Child calls put_user()
to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page
fault and it fails because do_wp_page()  hits memcg limit without invoking
OOM-killer because this is page-fault from kernel-space.  Put_user returns
-EFAULT, which is ignored.  Child returns into user-space and catches here
assert (THREAD_GETMEM (self, tid) != ppid), glibc tries to print something
but hangs on deadlock on internal locks. Halt and catch fire.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Nikita Vetoshkin <nekto0n@yandex-team.ru>
---
 kernel/fork.c       |   15 ++++++++++++---
 kernel/sched/core.c |   16 ++++++++++++++--
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..f71305d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -822,11 +822,20 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 	if (tsk->clear_child_tid) {
 		if (!(tsk->flags & PF_SIGNALED) &&
 		    atomic_read(&mm->mm_users) > 1) {
+			int dummy;
+
 			/*
-			 * We don't check the error code - if userspace has
-			 * not set up a proper pointer then tough luck.
+			 * We cannot report put_user fails - if userspace has
+			 * not set up a proper pointer then tough luck. It's
+			 * much worse if it's failed for some other reason:
+			 * for example memory shortage in CoW area, somebody
+			 * will wait us forever. Let's at least print warning.
 			 */
-			put_user(0, tsk->clear_child_tid);
+			if (unlikely(put_user(0, tsk->clear_child_tid)) &&
+			    !get_user(dummy, tsk->clear_child_tid) &&
+			    !fatal_signal_pending(current))
+				WARN_ON_ONCE(dummy);
+
 			sys_futex(tsk->clear_child_tid, FUTEX_WAKE,
 					1, NULL, NULL, 0);
 		}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e628cb1..74b33ff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2312,8 +2312,20 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 	post_schedule(rq);
 	preempt_enable();
 
-	if (current->set_child_tid)
-		put_user(task_pid_vnr(current), current->set_child_tid);
+	if (current->set_child_tid &&
+	    unlikely(put_user(task_pid_vnr(current), current->set_child_tid))) {
+		int dummy;
+
+		/*
+		 * If this address is unreadable then userspace has not set
+		 * proper pointer. Application either doesn't care or will
+		 * notice this soon. If this address is readable then task
+		 * will be mislead about its own tid. It's better to die.
+		 */
+		if (!get_user(dummy, current->set_child_tid) &&
+		    !fatal_signal_pending(current))
+			force_sig(SIGSEGV, current);
+	}
 }
 
 /*

^ permalink raw reply related

* Re: MADV_DONTNEED semantics? Was: [RFC PATCH] mm: madvise: Ignore repeated MADV_DONTNEED hints
From: Michael Kerrisk (man-pages) @ 2015-02-06 15:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: mtk.manpages, Vlastimil Babka, Kirill A. Shutemov, Dave Hansen,
	Mel Gorman, linux-mm@kvack.org, Minchan Kim, Andrew Morton, lkml,
	Linux API, linux-man, Hugh Dickins
In-Reply-To: <20150205154102.GA20607@dhcp22.suse.cz>

Hi Michael

On 02/05/2015 04:41 PM, Michal Hocko wrote:
> On Wed 04-02-15 20:24:27, Michael Kerrisk wrote:
> [...]
>> So, how about this text:
>>
>>               After a successful MADV_DONTNEED operation, the seman‐
>>               tics  of  memory  access  in  the specified region are
>>               changed: subsequent accesses of  pages  in  the  range
>>               will  succeed,  but will result in either reloading of
>>               the memory contents from the  underlying  mapped  file
> 
> "
> result in either providing the up-to-date contents of the underlying
> mapped file
> "

Thanks! I did something like that. See below.

> Would be more precise IMO because reload might be interpreted as a major
> fault which is not necessarily the case (see below).
> 
>>               (for  shared file mappings, shared anonymous mappings,
>>               and shmem-based techniques such  as  System  V  shared
>>               memory  segments)  or  zero-fill-on-demand  pages  for
>>               anonymous private mappings.
> 
> Yes, this wording is better because many users are not aware of
> MAP_ANON|MAP_SHARED being file backed in fact and mmap man page doesn't
> mention that.

(Michal, would you have a text to propose to add to the mmap(2) page?
Maybe it would be useful to add something there.)

> 
> I am just wondering whether it makes sense to mention that MADV_DONTNEED
> for shared mappings might be surprising and not freeing the backing
> pages thus not really freeing memory until there is a memory
> pressure. But maybe this is too implementation specific for a man
> page. What about the following wording on top of yours?
> "
> Please note that the MADV_DONTNEED hint on shared mappings might not
> lead to immediate freeing of pages in the range. The kernel is free to
> delay this until an appropriate moment. RSS of the calling process will
> be reduced however.
> "

Thanks! I added this, but dropped in the word "immediately" in the last 
sentence, since I assume that was implied. So now we have:

              After  a  successful MADV_DONTNEED operation, the seman‐
              tics of  memory  access  in  the  specified  region  are
              changed:  subsequent accesses of pages in the range will
              succeed, but will result in either repopulating the mem‐
              ory  contents from the up-to-date contents of the under‐
              lying mapped file  (for  shared  file  mappings,  shared
              anonymous  mappings,  and shmem-based techniques such as
              System V shared memory segments) or  zero-fill-on-demand
              pages for anonymous private mappings.

              Note  that,  when applied to shared mappings, MADV_DONT‐
              NEED might not lead to immediate freeing of the pages in
              the  range.   The  kernel  is  free to delay freeing the
              pages until an appropriate  moment.   The  resident  set
              size  (RSS)  of  the calling process will be immediately
              reduced however.

The current draft of the page can be found in a branch,
http://git.kernel.org/cgit/docs/man-pages/man-pages.git/log/?h=draft_madvise

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox