* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
[not found] ` <1474564068.23058.144.camel@edumazet-glaptop3.roam.corp.google.com>
@ 2016-09-22 17:55 ` Vlastimil Babka
2016-09-23 9:42 ` David Laight
0 siblings, 1 reply; 6+ messages in thread
From: Vlastimil Babka @ 2016-09-22 17:55 UTC (permalink / raw)
To: Eric Dumazet
Cc: Alexander Viro, Andrew Morton, linux-fsdevel, linux-kernel,
linux-mm, Michal Hocko, netdev, Linux API, linux-man
On 09/22/2016 07:07 PM, Eric Dumazet wrote:
> On Thu, 2016-09-22 at 18:56 +0200, Vlastimil Babka wrote:
>> On 09/22/2016 06:49 PM, Eric Dumazet wrote:
>> > On Thu, 2016-09-22 at 18:43 +0200, Vlastimil Babka wrote:
>> >> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
>> >> with the number of fds passed. We had a customer report page allocation
>> >> failures of order-4 for this allocation. This is a costly order, so it might
>> >> easily fail, as the VM expects such allocation to have a lower-order fallback.
>> >>
>> >> Such trivial fallback is vmalloc(), as the memory doesn't have to be
>> >> physically contiguous. Also the allocation is temporary for the duration of the
>> >> syscall, so it's unlikely to stress vmalloc too much.
>> >
>> > vmalloc() uses a vmap_area_lock spinlock, and TLB flushes.
>> >
>> > So I guess allowing vmalloc() being called from an innocent application
>> > doing a select() might be dangerous, especially if this select() happens
>> > thousands of time per second.
>>
>> Isn't seq_buf_alloc() similarly exposed? And ipc_alloc()?
>
> Possibly.
>
> We don't have a library function (attempting kmalloc(), fallback to
> vmalloc() presumably to avoid abuses, but I guess some patches were
> accepted without thinking about this.
So in the case of select() it seems like the memory we need 6 bits per file
descriptor, multiplied by the highest possible file descriptor (nfds) as passed
to the syscall. According to the man page of select:
EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see
getrlimit(2)).
The code actually seems to silently cap the value instead of returning EINVAL
though? (IIUC):
/* max_fds can increase, so grab it once to avoid race */
rcu_read_lock();
fdt = files_fdtable(current->files);
max_fds = fdt->max_fds;
rcu_read_unlock();
if (n > max_fds)
n = max_fds;
The default for this cap seems to be 1024 where I checked (again, IIUC, it's
what ulimit -n returns?). I wasn't able to change it to more than 2048, which
makes the bitmaps still below PAGE_SIZE.
So if I get that right, the system admin would have to allow really large
RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large
concern?
Vlastimil
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH v2] fs/select: add vmalloc fallback for select(2)
2016-09-22 17:55 ` [PATCH v2] fs/select: add vmalloc fallback for select(2) Vlastimil Babka
@ 2016-09-23 9:42 ` David Laight
[not found] ` <063D6719AE5E284EB5DD2968C1650D6DB0107DC8-VkEWCZq2GCInGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: David Laight @ 2016-09-23 9:42 UTC (permalink / raw)
To: 'Vlastimil Babka', Eric Dumazet
Cc: Alexander Viro, Andrew Morton, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko,
netdev@vger.kernel.org, Linux API, linux-man@vger.kernel.org
From: Vlastimil Babka
> Sent: 22 September 2016 18:55
...
> So in the case of select() it seems like the memory we need 6 bits per file
> descriptor, multiplied by the highest possible file descriptor (nfds) as passed
> to the syscall. According to the man page of select:
>
> EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see
> getrlimit(2)).
That second clause is relatively recent.
> The code actually seems to silently cap the value instead of returning EINVAL
> though? (IIUC):
>
> /* max_fds can increase, so grab it once to avoid race */
> rcu_read_lock();
> fdt = files_fdtable(current->files);
> max_fds = fdt->max_fds;
> rcu_read_unlock();
> if (n > max_fds)
> n = max_fds;
>
> The default for this cap seems to be 1024 where I checked (again, IIUC, it's
> what ulimit -n returns?). I wasn't able to change it to more than 2048, which
> makes the bitmaps still below PAGE_SIZE.
>
> So if I get that right, the system admin would have to allow really large
> RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large
> concern?
4k open files isn't that many.
Especially for programs that are using pipes to emulate windows events.
I suspect that fdt->max_fds is an upper bound for the highest fd the
process has open - not the RLIMIT_NOFILE value.
select() shouldn't be silently ignoring large values of 'n' unless
the fd_set bits are zero.
Of course, select does scale well for high numbered fds
and neither poll nor select scale well for large numbers of fds.
David
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
[not found] ` <063D6719AE5E284EB5DD2968C1650D6DB0107DC8-VkEWCZq2GCInGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
@ 2016-09-23 9:58 ` Vlastimil Babka
2016-09-23 13:35 ` David Laight
0 siblings, 1 reply; 6+ messages in thread
From: Vlastimil Babka @ 2016-09-23 9:58 UTC (permalink / raw)
To: David Laight, Eric Dumazet
Cc: Alexander Viro, Andrew Morton,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, Michal Hocko,
netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
linux-man-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
On 09/23/2016 11:42 AM, David Laight wrote:
> From: Vlastimil Babka
>> Sent: 22 September 2016 18:55
> ...
>> So in the case of select() it seems like the memory we need 6 bits per file
>> descriptor, multiplied by the highest possible file descriptor (nfds) as passed
>> to the syscall. According to the man page of select:
>>
>> EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see
>> getrlimit(2)).
>
> That second clause is relatively recent.
Interesting... so it was added without actually being true in the kernel
code?
>> The code actually seems to silently cap the value instead of returning EINVAL
>> though? (IIUC):
>>
>> /* max_fds can increase, so grab it once to avoid race */
>> rcu_read_lock();
>> fdt = files_fdtable(current->files);
>> max_fds = fdt->max_fds;
>> rcu_read_unlock();
>> if (n > max_fds)
>> n = max_fds;
>>
>> The default for this cap seems to be 1024 where I checked (again, IIUC, it's
>> what ulimit -n returns?). I wasn't able to change it to more than 2048, which
>> makes the bitmaps still below PAGE_SIZE.
>>
>> So if I get that right, the system admin would have to allow really large
>> RLIMIT_NOFILE to even make vmalloc() possible here. So I don't see it as a large
>> concern?
>
> 4k open files isn't that many.
> Especially for programs that are using pipes to emulate windows events.
Sure but IIUC we need 6 bits per file. That means up to almost 42k
files, we should fit into order-3 allocation, which effectively cannot
fail right now.
> I suspect that fdt->max_fds is an upper bound for the highest fd the
> process has open - not the RLIMIT_NOFILE value.
I gathered that the highest fd effectively limits the number of files,
so it's the same. I might be wrong.
> select() shouldn't be silently ignoring large values of 'n' unless
> the fd_set bits are zero.
Yeah that doesn't seem to conform to the manpage.
> Of course, select does scale well for high numbered fds
> and neither poll nor select scale well for large numbers of fds.
True.
> David
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH v2] fs/select: add vmalloc fallback for select(2)
2016-09-23 9:58 ` Vlastimil Babka
@ 2016-09-23 13:35 ` David Laight
2016-09-26 10:01 ` Vlastimil Babka
0 siblings, 1 reply; 6+ messages in thread
From: David Laight @ 2016-09-23 13:35 UTC (permalink / raw)
To: 'Vlastimil Babka', Eric Dumazet
Cc: Alexander Viro, Andrew Morton, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko,
netdev@vger.kernel.org, Linux API, linux-man@vger.kernel.org
From: Vlastimil Babka
> Sent: 23 September 2016 10:59
...
> > I suspect that fdt->max_fds is an upper bound for the highest fd the
> > process has open - not the RLIMIT_NOFILE value.
>
> I gathered that the highest fd effectively limits the number of files,
> so it's the same. I might be wrong.
An application can reduce RLIMIT_NOFILE below that of an open file.
David
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] fs/select: add vmalloc fallback for select(2)
2016-09-23 13:35 ` David Laight
@ 2016-09-26 10:01 ` Vlastimil Babka
2016-09-26 15:02 ` David Laight
0 siblings, 1 reply; 6+ messages in thread
From: Vlastimil Babka @ 2016-09-26 10:01 UTC (permalink / raw)
To: David Laight, Eric Dumazet
Cc: Alexander Viro, Andrew Morton, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko,
netdev@vger.kernel.org, Linux API, linux-man@vger.kernel.org
On 09/23/2016 03:35 PM, David Laight wrote:
> From: Vlastimil Babka
>> Sent: 23 September 2016 10:59
> ...
>> > I suspect that fdt->max_fds is an upper bound for the highest fd the
>> > process has open - not the RLIMIT_NOFILE value.
>>
>> I gathered that the highest fd effectively limits the number of files,
>> so it's the same. I might be wrong.
>
> An application can reduce RLIMIT_NOFILE below that of an open file.
OK, I did some more digging in the code, and my understanding is that:
- fdt->max_fds is the current size of the fdtable, which isn't allocated upfront
to match the limit, but grows as needed. This means it's OK for
core_sys_select() to silently cap nfds, as it knows there are no fd's with
higher number in the fdtable, so it's a performance optimization. However, to
match what the manpage says, there should be another check against RLIMIT_NOFILE
to return -EINVAL, which there isn't, AFAICS.
- fdtable is expanded (and fdt->max_fds bumped) by
expand_files()->expand_fdtable() which checks against fs.nr_open sysctl, which
seems to be 1048576 where I checked.
- callers of expand_files(), such as dup(), check the rlimit(RLIMIT_NOFILE) to
limit the expansion.
So yeah, application can reduce RLIMIT_NOFILE, but it has no effect on fdtable
and fdt->max_fds that is already above the limit. Select syscall would have to
check the rlimit to conform to the manpage. Or (rather?) we should fix the manpage.
As for the original vmalloc() flood concern, I still think we're safe, as
ordinary users are limited by RLIMIT_NOFILE way below sizes that would need
vmalloc(), and root has many other options to DOS the system (or worse).
Vlastimil
> David
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [PATCH v2] fs/select: add vmalloc fallback for select(2)
2016-09-26 10:01 ` Vlastimil Babka
@ 2016-09-26 15:02 ` David Laight
0 siblings, 0 replies; 6+ messages in thread
From: David Laight @ 2016-09-26 15:02 UTC (permalink / raw)
To: 'Vlastimil Babka', Eric Dumazet
Cc: Alexander Viro, Andrew Morton, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko,
netdev@vger.kernel.org, Linux API, linux-man@vger.kernel.org
From: Vlastimil Babka
> Sent: 26 September 2016 11:02
> On 09/23/2016 03:35 PM, David Laight wrote:
> > From: Vlastimil Babka
> >> Sent: 23 September 2016 10:59
> > ...
> >> > I suspect that fdt->max_fds is an upper bound for the highest fd the
> >> > process has open - not the RLIMIT_NOFILE value.
> >>
> >> I gathered that the highest fd effectively limits the number of files,
> >> so it's the same. I might be wrong.
> >
> > An application can reduce RLIMIT_NOFILE below that of an open file.
>
> OK, I did some more digging in the code, and my understanding is that:
>
> - fdt->max_fds is the current size of the fdtable, which isn't allocated upfront
> to match the limit, but grows as needed. This means it's OK for
> core_sys_select() to silently cap nfds, as it knows there are no fd's with
> higher number in the fdtable, so it's a performance optimization.
Not entirely, if any bits are set for fd above fdt->max_fds then select()
call should fail - fd not open.
> However, to
> match what the manpage says, there should be another check against RLIMIT_NOFILE
> to return -EINVAL, which there isn't, AFAICS.
>
> - fdtable is expanded (and fdt->max_fds bumped) by
> expand_files()->expand_fdtable() which checks against fs.nr_open sysctl, which
> seems to be 1048576 where I checked.
>
> - callers of expand_files(), such as dup(), check the rlimit(RLIMIT_NOFILE) to
> limit the expansion.
>
> So yeah, application can reduce RLIMIT_NOFILE, but it has no effect on fdtable
> and fdt->max_fds that is already above the limit. Select syscall would have to
> check the rlimit to conform to the manpage. Or (rather?) we should fix the manpage.
I think the manpage should be fixed (delete that clause).
Then add code to the system call to scan the high bit sets (above fdt->max_fds)
for any non-zero bytes. This can be done into a small buffer.
> As for the original vmalloc() flood concern, I still think we're safe, as
> ordinary users are limited by RLIMIT_NOFILE way below sizes that would need
> vmalloc(), and root has many other options to DOS the system (or worse).
Some processes need very high numbers of fd.
Likely they don't use select() on them, but trashing performance if they
do is a bit silly.
Trying to slit the 3 masks first seems sensible.
David
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-09-26 15:02 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20160922164359.9035-1-vbabka@suse.cz>
[not found] ` <1474562982.23058.140.camel@edumazet-glaptop3.roam.corp.google.com>
[not found] ` <12efc491-a0e7-1012-5a8b-6d3533c720db@suse.cz>
[not found] ` <1474564068.23058.144.camel@edumazet-glaptop3.roam.corp.google.com>
2016-09-22 17:55 ` [PATCH v2] fs/select: add vmalloc fallback for select(2) Vlastimil Babka
2016-09-23 9:42 ` David Laight
[not found] ` <063D6719AE5E284EB5DD2968C1650D6DB0107DC8-VkEWCZq2GCInGFn1LkZF6NBPR1lH4CV8@public.gmane.org>
2016-09-23 9:58 ` Vlastimil Babka
2016-09-23 13:35 ` David Laight
2016-09-26 10:01 ` Vlastimil Babka
2016-09-26 15:02 ` David Laight
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).