From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alan Jenkins <alan.christopher.jenkins@gmail.com>
Subject: Re: [PATCH 13/18] io_uring: add file set registration
Date: Fri, 8 Feb 2019 14:02:28 +0000
Message-ID: <02e71636-5b63-41e6-0ffd-646f305011c9@gmail.com>
References: <20190207195552.22770-1-axboe@kernel.dk>
 <20190207195552.22770-14-axboe@kernel.dk>
 <bfd383aa-24fb-35ae-eef5-545ab0f493c7@gmail.com>
 <2ac73020-6ab0-e351-3a1a-180d0f1f801b@kernel.dk>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Return-path: <owner-linux-aio@kvack.org>
In-Reply-To: <2ac73020-6ab0-e351-3a1a-180d0f1f801b@kernel.dk>
Content-Language: en-GB
Sender: owner-linux-aio@kvack.org
To: Jens Axboe <axboe@kernel.dk>, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk
List-Id: linux-api@vger.kernel.org

On 08/02/2019 12:57, Jens Axboe wrote:
> On 2/8/19 5:17 AM, Alan Jenkins wrote:
>>> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
>>> +{
>>> +#if defined(CONFIG_NET)
>>> +	struct scm_fp_list *fpl = ctx->user_files;
>>> +	struct sk_buff *skb;
>>> +	int i;
>>> +
>>> +	skb =  __alloc_skb(0, GFP_KERNEL, 0, NUMA_NO_NODE);
>>> +	if (!skb)
>>> +		return -ENOMEM;
>>> +
>>> +	skb->sk = ctx->ring_sock->sk;
>>> +	skb->destructor = unix_destruct_scm;
>>> +
>>> +	fpl->user = get_uid(ctx->user);
>>> +	for (i = 0; i < fpl->count; i++) {
>>> +		get_file(fpl->fp[i]);
>>> +		unix_inflight(fpl->user, fpl->fp[i]);
>>> +		fput(fpl->fp[i]);
>>> +	}
>>> +
>>> +	UNIXCB(skb).fp = fpl;
>>> +	skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb);
>> This code sounds elegant if you know about the existence of unix_gc(),
>> but quite mysterious if you don't.  (E.g. why "inflight"?)  Could we
>> have a brief comment, to comfort mortal readers on their journey?
>>
>> /* A message on a unix socket can hold a reference to a file. This can
>> cause a reference cycle. So there is a garbage collector for unix
>> sockets, which we hook into here. */
> Yes that's a good idea, I've added a comment as to why we go through the
> trouble of doing this socket + skb dance.

Great, thanks.

>> I think this is bypassing too_many_unix_fds() though?  I understood that
>> was intended to bound kernel memory allocation, at least in principle.
> As the code stands above, it'll cap it at 253. I'm just now reworking it
> to NOT be limited to the SCM max fd count, but still impose a limit of
> 1024 on the number of registered files. This is important to cap the
> memory allocation attempt as well.

I saw you were limiting to SCM_MAX_FD per io_uring.  On the other hand, 
there's no specific limit on the number of io_urings you can open (only 
the standard limits on fds).  So this would let you allocate hundreds of 
times more files than the previous limit RLIMIT_NOFILE...

static inline bool too_many_unix_fds(struct task_struct *p)
{
	struct user_struct *user = current_user();

	if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE)))
		return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
	return false;
}

RLIMIT_NOFILE is technically per-task, but here it is capping 
unix_inflight per-user.  So the way I look at this, the number of file 
descriptors per user is bounded by NOFILE * NPROC.  Then 
user->unix_inflight can have one additional process' worth (NOFILE) of 
"inflight" files.  (Plus SCM_MAX_FD slop, because too_many_fds() is only 
called once per SCM_RIGHTS).

Because io_uring doesn't check too_many_unix_fds(), I think it will let 
you have about 253 (or 1024) more process' worth of open files. That 
could be big proportionally when RLIMIT_NPROC is low.

I don't know if it matters.  It maybe reads like an oversight though.

(If it does matter, it might be cleanest to change too_many_unix_fds() 
to get rid of the "slop".  Since that may be different between af_unix 
and io_uring; 253 v.s. 1024 or whatever. E.g. add a parameter for the 
number of inflight files we want to add.)

>> Also, this code relies on CONFIG_NET.  To handle the case where
>> CONFIG_NET is not enabled, don't you still need to forbid registering an
>> io_uring fd ?
> Good point, we do still need to reject the io_uring fd itself if
> CONFIG_UNIX is not enabled. Done.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>