From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [net-next PATCH] net: ptr_ring: otherwise safe empty checks can
 overrun array bounds
Date: Wed, 3 Jan 2018 20:34:18 +0200
Message-ID: <20180103195231-mutt-send-email-mst@kernel.org>
References: <20171228035024.14699.69453.stgit@john-Precision-Tower-5810>
 <20180102.115219.1101472320429215260.davem@davemloft.net>
 <20180102190107-mutt-send-email-mst@kernel.org>
 <20180102191329-mutt-send-email-mst@kernel.org>
 <149cba1c-e51a-e3fa-8105-34b95ad2e582@gmail.com>
 <20180103004619-mutt-send-email-mst@kernel.org>
 <457d384f-5cd1-d52d-f6e9-4c9ebb1f8d09@gmail.com>
 <20180103165249-mutt-send-email-mst@kernel.org>
 <127b09e0-2acb-f220-d525-e61ad34da794@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: David Miller <davem@davemloft.net>, jakub.kicinski@netronome.com,
        xiyou.wangcong@gmail.com, jiri@resnulli.us, netdev@vger.kernel.org
To: John Fastabend <john.fastabend@gmail.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:37268 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751134AbeACSeU (ORCPT <rfc822;netdev@vger.kernel.org>);
        Wed, 3 Jan 2018 13:34:20 -0500
Content-Disposition: inline
In-Reply-To: <127b09e0-2acb-f220-d525-e61ad34da794@gmail.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Wed, Jan 03, 2018 at 09:46:15AM -0800, John Fastabend wrote:
> On 01/03/2018 07:50 AM, Michael S. Tsirkin wrote:
> > On Tue, Jan 02, 2018 at 04:25:03PM -0800, John Fastabend wrote:
> >>>
> >>> More generally, what makes this usage safe?
> >>> Is there a way to formalize it at the API level?
> >>>
> >>
> >> Right I think these are good questions. I think the ptr_ring API should
> >> allow a peek operation to be used without a lock. The user has to ensure
> >> they only use it as a hint and if its dereferenced the user needs to
> >> ensure the object is not free'd from some other codepath while it is
> >> being dereferenced. The existing API seems to match this.
> >>
> >> This is how I used it in pfifo_fast expecting the above to be true. The
> >> API allows for false negatives which _should_ be OK if the user is expecting
> >> this. Alternatively, we could make it false positives if you want and
> >> that would also work for me considering this case is hit very rarely.
> > 
> > By now I'm confused by which are positives and which are negatives :)
> > OK so the guarantees we want would be:
> > 
> > - empty can return false if ring is empty.
> >   caller must re-check with consumer lock taken
> 
> Correct.
> 
> > - if multiple threads call it, only one thread
> 
> What is "it" here, __skb_array_empty() presumably.
> 
> >   is guaranteed that empty will not return true
> >   if ring is non empty.
> >   after detecting that ring is not empty,
> >   this thread shall ....
> > 
> The guarantee is even weaker.
> 
>   - if __skb_array_empty() returns true, either the
>     ring is empty OR a consumer operation is in
>     progress.

Well I'm not sure that's guaranteed actually, in that
in progress isn't all that well defined on multi-core
systems.

E.g. write of NULL could bypass index write and
empty will return true on another CPU even though
both completed on our CPU.

I still think our use is ok, I'm just still having
trouble formulating the exact rules.

> 
> For pfifo_fast this is OK because some thread is making
> progress at this point.
>
>
> Note, even if multiple threads
> are calling __skb_array_empty() there is no guarantee
> any of them will return false even if the ring has
> elements.
>
> > can you help me fill in the blank please?
> > 
> 
> Did that help?
> 
> > 
> > 
> > 
> > 
> >>>>> John, others, could you pls confirm it's not too bad performance-wise?
> >>>>> I'll then split it up properly and re-post.
> >>>>>
> >>>>
> >>>> I haven't benchmarked it but in the dequeue case taking a lock for
> >>>> every priority even when its empty seems unneeded.
> >>>
> >>> Well it does seem to make the code more comprehensible and more robust.
> >>>
> >>
> >> Its a trade-off between performance and robustness.
> >>
> >>> But OK -  I was looking at fixing the unlocked empty API to make sure it
> >>> actually does what it's supposed to. I posted a draft earlier in this
> >>> thread, it needs to be looked at in depth to figure out whether it can
> >>> ever give false negatives or positives, and document the results.
> >>>
> >>>
> >>
> >> I'll look at it. But I still think keeping a lockless version makes sense
> >> for many use cases.
> > 
> > Fine. Just let's try to document what are the guarantees.
> > 
> 
> Yep. Guarantees should be (on ptr_ring implementation and similar
> on skb_array implementation)
> 
>   - if __ptr_ring_empty returns true, the ring may be empty
>     user must use operation with consumer lock to guarantee the
>     ring is empty. Note, when run with concurrent consumer operations
>     it is possible to return true when ring has elements. This
>     occurs when ring consumer head rolls over ring size. Also,
>     producer operations may put object in the ring concurrently
>     making empty check incorrect. To avoid this use locked
>     ptr_ring_empty() API.
> 
>   - if __ptr_ring_empty returns false, the ring may have elements.
>     Possibly, scenario is consumer operation and __ptr_ring_empty
>     operation run concurrently causing false to be returned when
>     ring is empty. To avoid this use locked ptr_ring_empty() API.
> 
> >>>
> >>>>> -->
> >>>>>
> >>>>> net: don't misuse ptr_ring_peek
> >>>>>
> >>>>> ptr_ring_peek only claims to be safe if the result is never
> >>>>> dereferenced, which isn't the case for its use in sch_generic.
> >>>>> Add locked API variants and use the bh one here.
> >>>>>
> >>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>>
> >>>>> ---
> >>>>>
> >>>>
> >>>> [...]
> >>>>
> >>>>> --- a/net/sched/sch_generic.c
> >>>>> +++ b/net/sched/sch_generic.c
> >>>>> @@ -659,7 +659,7 @@ static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
> >>>>>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
> >>>>>  		struct skb_array *q = band2list(priv, band);
> >>>>>  
> >>>>> -		skb = __skb_array_peek(q);
> >>>>> +		skb = skb_array_peek_bh(q);
> >>>>
> >>>> Ah I should have added a comment here. For now peek() is only used from
> >>>> locking qdiscs. So peek and consume/produce operations will never happen
> >>>> in parallel. In this case we should never hit the false negative case with
> >>>> my patch or the out of bounds reference without my patch.
> > 
> > OK so this is the part I missed. Can you add a comment please?
> > 
> > 
> 
> Yes, working on a documentation patch set to address misleading and missing
> comments/documentation in qdisc layer now.
> 
> >>>> Doing a peek() op without qdisc lock is a bit problematic anyways. With
> >>>> current code another cpu could consume the skb and free it. Either we
> >>>> can ensure a single consumer runs at a time on an array (not the same as
> >>>> qdisc maybe) or just avoid peek operations in this case. My current plan
> >>>> was to just avoid peek() ops altogether, they seem unnecessary with the
> >>>> types of qdiscs I want to be build.
> >>>>
> >>>> Thanks,
> >>>> John
> >>>
> >>> For the lockless qdisc, for net-next, do we need the patch above until you fix that though?
> >>>
> >>
> >> No, I think after this patch (net: ptr_ring: otherwise safe empty checks...) is
> >> applied we do not need any additional fixes in net-next. Future work will
> >> require the above patch (the one you provided) though so its useful work.
> > 
> > The one that avoids allocating extra memory?
> > 
> 
> Let me try summarize again.
> 
> We need the extra memory slot patch if we expose the
> __ptr_ring_empty/__skb_array_empty API. Otherwise we hit the out
> of bounds issues. And for qdisc layer the __skb_array_empty() API
> call is useful so we should not remove it.
> 
> In addition to the qdisc layer using the __skb_array_empty() API if
> we need a peek() API that works without the qdisc lock then your patch,
> the one adding the locked peek API, will be needed. It is not needed
> though until we have a qdisc that would use it without the lock.
> 
> In summary we keep the extra allocation to make the __skb_array_empty() API
> usable. And hold off exposing the skb_array_peek()/ptr_ring_peek() until we
> have a user for it. We can roll your proposed patch into any series we come
> up with for a new/updated qdisc.
> 
> > 
> >> I'll do another review of the false positive case though to be sure the
> >> current code is OK wrt handling false positives and any potential stalls.
> > 
> > 
> > Thanks!
> > 
> >>>>>  	}
> >>>>>  
> >>>>>  	return skb;
> >>>>>