From: Ingo Molnar <mingo@elte.hu>
To: Andi Kleen <andi@firstfloor.org>
Cc: linux-kernel@vger.kernel.org, libc-alpha@sourceware.org,
Andi Kleen <ak@linux.intel.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH 4/5] Add a sysconf syscall
Date: Tue, 17 May 2011 13:25:19 +0200 [thread overview]
Message-ID: <20110517112519.GA31877@elte.hu> (raw)
In-Reply-To: <20110516133655.GG7128@elte.hu>
Andi,
AFAICS you have not replied to my detailed analysis below yet (you only replied
to a sub-reply, not addressing most of the points i raised) - should i expect a
reply in the days to come, or will you follow your past pattern of
unpredictably ignoring review concerns?
Firstly i think the obvious canonical fix is to expose the number of CPUs in a
separate file in /proc and not in /proc/stat which is expensive to construct
and parse. Also arguably sleepycat should not need to read the number of CPUs
all the time - why does it do that?
Secondly, even assuming that fast access to all these system globals is
desirable i question the choice to include rlimits and nsems_max: they are not
really system globals (so why are they in sysconf() to begin with?) and they
can already be queried cheaply using getrlimits()/semctl().
These are the values that make the sysconf() information per task and
complicate the data structure.
If we look at all non-rlimit values:
_SC_CLK_TCK: return HZ;
_SC_UIO_MAXIOV: return UIO_MAXIOV;
_SC_PAGESIZE: return PAGE_SIZE;
_SC_SYMLOOP_MAX: return SYMLOOP_MAX;
_SC_PHYS_PAGES: return totalram_pages;
_SC_AVPHYS_PAGES: return nr_free_pages();
_SC_NPROCESSORS_CONF: return num_possible_cpus();
_SC_NPROCESSORS_ONLN: return num_online_cpus();
_SC_NGROUPS_MAX: return NGROUPS_MAX;
These are all global, read-mostly values.
Were these rlimits not included we could do a sane global vdso page with the
NPROCESSORS number exposed (which is what the sleepycat example you cited
really needs it appears) and expose it to glibc using a new auxiliary entry (or
via a vsyscall that returns the pointer).
It would become very simple and no COW trickery is needed in that case either.
See my arguments in more detail below.
Thanks,
Ingo
* Ingo Molnar <mingo@elte.hu> wrote:
> * Andi Kleen <andi@firstfloor.org> wrote:
>
> > > What glibc does (opening /proc/stat) is rather stupid, but i think your
> > > syscall
> >
> > I don't think it has any other choice today. [...]
>
> Sure it has a choice: most of the sysconf values are global so creating a
> permanent mmap()-able data in /tmp or elsewhere and mapping it unless it's
> inaccessible isnt that particularly hard to cache most of the system-wide
> constants, now is it?
>
> The CPU count could be updated from CPU hotplug events.
>
> rlimits can be queried using getrlimit().
>
> If at that point glibc said to us: "hey, lets work in making it even faster"
> then there would be no reason for us to criticise glibc. Right now gibc does
> not even *try* to be smart about it.
>
> > [...] So if anything is "stupid" it is the kernel for not providing efficient
> > interfaces for this.
> >
> > > Note that these are mostly constant or semi-constant values that are
> > > updated very rarely:
> >
> > That's not true. Most of them are dynamic. Take a look at the patch.
> > Also several of those have changed recently.
>
> As i said they are mostly constant or semi-constant values that are updated
> very rarely.
>
> If you think that i am wrong then do me the basic courtesy of mentioning the
> examples that you think disprove my claim, instead of broadly pointing me to
> your patch ...
>
> > > If glibc is stupid and reads /proc/stat to receive something it could cache
> > > or mmap() itself then hey, did you consider fixing glibc or creating a sane
> > > libc?
> >
> > Caching doesn't help when you have a workload that exec()s a lot. Also some
> > of these values can change at runtime.
>
> Here you modify your claim. Now it's not 'dynamic' anymore but 'can change'?
>
> Which one is it now, "mostly constant or semi-constant values that are updated
> very rarely" as i claim, "dynamic" as you first claimed or "can change" as you
> claim here (which is also pretty ambiguous)?
>
> > > If we *really* want to offer kernel help for these values even then your
> > > solution is still wrong: then the proper solution would be to define a
> > > standard *data* structure and map it as a vsyscall *data* page -
> > > essentially a kernel-guaranteed data mmap(), with no extra syscall needed!
> >
> > That's quite complicted because several of those are dynamically computed
> > based on other values. Sometimes they are also not tied to the mm_struct --
> > like the vsyscall is. For example some of the rlimits are per task, not VM.
> > Basically your proposal doesn't interact well with clone().
>
> Threads with different rlimits but shared VM are extreme rarities.
>
> Could we please concentrate on the common case? A '-1' in the data page can let
> the code fall back to some slow path.
>
> Also note that rlimit itself already has an interface to query them:
> getrlimit(). So if you do not want the complexity of caching rlimits in the
> data page you do not have to start with that complexity.
>
> [ But it can be done: modifying the rlimit (which predominantly only happens in
> the login process) is rare and happens in a parent task. ]
>
> > Even if we ignored that semantic problem it would need another writable page
> > per task because the values cannot be shared.
>
> Sure most of the values can be shared.
>
> Most of them are exactly one of a low number of variants for all tasks in the
> system, for typical Linux bootups. I suspect if the data page was COW-ed but
> inherited across exec() it would behave exactly the way it should be: inherited
> by all tasks but privatized if a task modifies it for itself and all children.
>
> Also, the first step could be simplified by not exposing rlimits - as rlimits
> are already exposed via other channels.
>
> > Also I never liked the idea of having more writable pages per task, [...]
>
> If you limit it to your own faulty implementation then i'm not surprised that
> you do not like it.
>
> > [...] It increases the memory footprint of a single process more. Given a 4K
> > page is not a lot, but lots of 4K pages add up. Some workloads like to have
> > lots of small processes and I think that's a valuable use case Linux should
> > stay lean and efficient at.
> >
> > [OK in theory one could do COW for the page and share it but that would get
> > really complicated]
>
> Why would it get complicated? It's not writable to user-space, that's all that
> is special about it.
>
> > I also don't think it's THAT performance critical to justify the vsyscall.
>
> You apparently did not understand the gist of my point: it's the CONCEPT of
> adding a crappy catch-all sysconf() interface that sucks IMHO. It's simply bad
> taste.
>
> If you want to expose data then expose the data intelligently, not some poor
> system call interface that is also slower.
>
> > The simple syscall is already orders of magnitude faster than /proc, and
> > seems to solve the performance problems we've seen completely.
>
> A handful of horse manure is less stinky than a big pile of it, still i wouldnt
> want to eat either.
>
> > It's also simple and straight forward and simple to userstand and maintain. I
> > doubt any of that would apply to a vsyscall solution.
>
> Note: i did not suggest a vsyscall, but a vsyscall *data area*. There's a big
> difference between the two!
>
> It could be passed down to user-space using a new auxiliary entry (see
> fs/binfmt_elf.c), as it's really part of a task's environment conceptually.
>
> > I don't think the additional effort for a vsyscall would be worth it at this
> > point, unless there's some concrete example that would justify it. Even then
> > it wouldn't work for some of the values.
> >
> > Also a vsyscall doesn't help on non x86 anyways.
>
> There's nothing x86 about aux entries.
>
> > As for efficiency: I thought about doing a batched interface where
> > the user could pass in an array of values to fill in. But at least for the
> > workloads I looked at the application usually calls sysconf() where
> > the array size would be always 1. And that's the standard interface.
> > This might be still worth doing, but I would like to see a concrete
> > use case first.
> >
> > > That could have other uses as well in the future.
> >
> > Hmm for what?
>
> *Iff* the concensus is that we are fine with a new page per task/thread then we
> could use it for all sorts of nifty things like the current CPU id for example.
>
> > Note we already have a fast mechanism to pass some thing to glibc in the aux
> > vector.
>
> So when you got so far in your reply why did you not delete your above (false)
> outrage about the vsyscall, which i never suggested and which you thus forced
> me to reply to?
>
> > > That way it would much faster than your current code btw.
> > >
> > > So unless there are some compelling arguments in favor of sys_sysconf()
> > > that i missed, i have to NAK this:
> >
> > Well see above for lots of reasons you missed. They are understandable
> > mistakes for someone who first looks at the problem though. I'll attempt to
> > improve the documentation next time.
>
> I don't think your condescending tone towards me is warranted or fair, i
> offered a fair technical criticism of your patch series. Why are you
> attacking me like this?
>
> Thanks,
>
> Ingo
--
Thanks,
Ingo
next prev parent reply other threads:[~2011-05-17 11:25 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-05-13 23:24 Add a sysconf syscall Andi Kleen
2011-05-13 23:24 ` [PATCH 1/5] VFS: Make symlink nesting limit a define Andi Kleen
2011-05-13 23:24 ` [PATCH 2/5] Move max_threads variable declaration into include file Andi Kleen
2011-05-13 23:24 ` [PATCH 3/5] EXEC: Use define for stack to argument size limit Andi Kleen
2011-05-13 23:24 ` [PATCH 4/5] Add a sysconf syscall Andi Kleen
2011-05-14 6:57 ` Ingo Molnar
2011-05-14 16:34 ` Andi Kleen
2011-05-16 13:36 ` Ingo Molnar
2011-05-17 11:25 ` Ingo Molnar [this message]
2011-05-16 15:51 ` Andy Lutomirski
2011-05-16 16:08 ` Andi Kleen
2011-05-16 17:06 ` Andrew Lutomirski
[not found] ` <OFCC4C610A.F152D00D-ON86257892.005E11F4-86257892.005E22BA@us.ibm.com>
[not found] ` <4DD15E9B.2090809@linux.intel.com>
2011-05-17 10:59 ` Ingo Molnar
2011-05-16 15:42 ` Denys Vlasenko
2011-05-16 16:01 ` Andi Kleen
[not found] ` <OF30360F87.5C6D6DCF-ON86257892.005D7E68-86257892.005E0059@us.ibm.com>
2011-05-16 17:39 ` Andi Kleen
[not found] ` <OFD2EE69FB.301A458A-ON86257892.00631BE8-86257892.006A93AF@us.ibm.com>
2011-05-16 20:51 ` Andi Kleen
2011-05-17 12:33 ` Denys Vlasenko
2011-05-13 23:24 ` [PATCH 5/5] Hook up sysconf syscall for all architectures Andi Kleen
2011-05-14 1:21 ` David Miller
2011-05-14 2:51 ` Andi Kleen
2011-05-14 2:23 ` Mike Frysinger
2011-05-24 1:46 ` Mike Frysinger
2011-05-26 18:04 ` Mike Frysinger
2011-05-26 18:45 ` Andi Kleen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110517112519.GA31877@elte.hu \
--to=mingo@elte.hu \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=libc-alpha@sourceware.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.