From: "Jon Rosen (jrosen)" <jrosen@cisco.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	Willem de Bruijn <willemb@google.com>,
	Eric Dumazet <edumazet@google.com>,
	Kees Cook <keescook@chromium.org>,
	David Windsor <dwindsor@gmail.com>,
	"Rosen, Rami" <rami.rosen@intel.com>,
	"Reshetova, Elena" <elena.reshetova@intel.com>,
	"Mike Maloney" <maloney@google.com>,
	Benjamin Poirier <bpoirier@suse.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	"open list:NETWORKING [GENERAL]" <netdev@vger.kernel.org>,
	open list <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun
Date: Mon, 21 May 2018 18:16:42 +0000	[thread overview]
Message-ID: <64c91d04479348cabbcdf1df0ff7d40f@XCH-RTP-016.cisco.com> (raw)
In-Reply-To: <CAF=yD-Kfto4jJYMFzn=PV8OYdhEmdNfW+aakDhRMzRBWhWY0UQ@mail.gmail.com>
On Monday, May 21, 2018 1:07 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>On Mon, May 21, 2018 at 8:57 AM, Jon Rosen (jrosen) <jrosen@cisco.com> wrote:
>> On Sunday, May 20, 2018 7:22 PM, Willem de Bruijn
>> <willemdebruijn.kernel@gmail.com> wrote:
>>> On Sun, May 20, 2018 at 6:51 PM, Willem de Bruijn
>>> <willemdebruijn.kernel@gmail.com> wrote:
>>>> On Sat, May 19, 2018 at 8:07 AM, Jon Rosen <jrosen@cisco.com> wrote:
>>>>> Fix PACKET_RX_RING bug for versions TPACKET_V1 and TPACKET_V2 which
>>>>> casues the ring to get corrupted by allowing multiple kernel threads
>>>>> to claim ownership of the same ring entry. Track ownership in a shadow
>>>>> ring structure to prevent other kernel threads from reusing the same
>>>>> entry before it's fully filled in, passed to user space, and then
>>>>> eventually passed back to the kernel for use with a new packet.
>>>>>
>>>>> Signed-off-by: Jon Rosen <jrosen@cisco.com>
>>>>> ---
>>>>>
>>>>> There is a bug in net/packet/af_packet.c:tpacket_rcv in how it manages
>>>>> the PACKET_RX_RING for versions TPACKET_V1 and TPACKET_V2.  This bug makes
>>>>> it possible for multiple kernel threads to claim ownership of the same
>>>>> ring entry, corrupting the ring and the corresponding packet(s).
>>>>>
>>>>> These diffs are the second proposed solution, previous proposal was described
>>>>> in https://www.mail-archive.com/netdev@vger.kernel.org/msg227468.html
>>>>> subject [RFC PATCH] packet: mark ring entry as in-use inside spin_lock
>>>>> to prevent RX ring overrun
>>>>>
>>>>> Those diffs would have changed the binary interface and have broken certain
>>>>> applications. Consensus was that such a change would be inappropriate.
>>>>>
>>>>> These new diffs use a shadow ring in kernel space for tracking intermediate
>>>>> state of an entry and prevent more than one kernel thread from simultaneously
>>>>> allocating a ring entry. This avoids any impact to the binary interface
>>>>> between kernel and userspace but comes at the additional cost of requiring a
>>>>> second spin_lock when passing ownership of a ring entry to userspace.
>>>>>
>>>>> Jon Rosen (1):
>>>>>   packet: track ring entry use using a shadow ring to prevent RX ring
>>>>>     overrun
>>>>>
>>>>>  net/packet/af_packet.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  net/packet/internal.h  | 14 +++++++++++
>>>>>  2 files changed, 78 insertions(+)
>>>>>
>>>>
>>>>> @@ -2383,7 +2412,11 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>>>>>  #endif
>>>>>
>>>>>         if (po->tp_version <= TPACKET_V2) {
>>>>> +               spin_lock(&sk->sk_receive_queue.lock);
>>>>>                 __packet_set_status(po, h.raw, status);
>>>>> +               packet_rx_shadow_release(rx_shadow_ring_entry);
>>>>> +               spin_unlock(&sk->sk_receive_queue.lock);
>>>>> +
>>>>>                 sk->sk_data_ready(sk);
>>>>
>>>> Thanks for continuing to look at this. I spent some time on it last time
>>>> around but got stuck, too.
>>>>
>>>> This version takes an extra spinlock in the hot path. That will be very
>>>> expensive. Once we need to accept that, we could opt for a simpler
>>>> implementation akin to the one discussed in the previous thread:
>>>>
>>>> stash a value in tp_padding or similar while tp_status remains
>>>> TP_STATUS_KERNEL to signal ownership to concurrent kernel
>>>> threads. The issue previously was that that field could not atomically
>>>> be cleared together with __packet_set_status. This is no longer
>>>> an issue when holding the queue lock.
>>>>
>>>> With a field like tp_padding, unlike tp_len, it is arguably also safe to
>>>> clear it after flipping status (userspace should treat it as undefined).
>>>>
>>>> With v1 tpacket_hdr, no explicit padding field is defined but due to
>>>> TPACKET_HDRLEN alignment it exists on both 32 and 64 bit
>>>> platforms.
>>>>
>>>> The danger with using padding is that a process may write to it
>>>> and cause deadlock, of course. There is no logical reason for doing
>>>> so.
>>>
>>> For the ring, there is no requirement to allocate exactly the amount
>>> specified by the user request. Safer than relying on shared memory
>>> and simpler than the extra allocation in this patch would be to allocate
>>> extra shadow memory at the end of the ring (and not mmap that).
>>>
>>> That still leaves an extra cold cacheline vs using tp_padding.
>>
>> Given my lack of experience and knowledge in writing kernel code
>> it was easier for me to allocate the shadow ring as a separate
>> structure.  Of course it's not about me and my skills so if it's
>> more appropriate to allocate at the tail of the existing ring
>> then certainly I can look at doing that.
>>
>> I think the bigger issues as you've pointed out are the cost of
>> the additional spin lock and should the additional state be
>> stored in-band (fewer cache lines) or out-of band (less risk of
>> breaking due to unpredictable application behavior).
>
> We don't need the spinlock if clearing the shadow byte after
> setting the status to user.
>
> Worst case, user will set it back to kernel while the shadow
> byte is not cleared yet and the next producer will drop a packet.
> But next producers will make progress, so there is no deadlock
> or corruption.
I thought so too for a while but after spending more time than I
care to admit I relized the following sequence was occuring:
   Core A                       Core B
   ------                       ------
   - Enter spin_lock
   -   Get tp_status of head (X)
       tp_status == 0
   -   Check inuse
       inuse == 0
   -   Allocate entry X
       advance head (X+1)
       set inuse=1
   - Exit spin_lock
     <very long delay>
                                <allocate N-1 entries
                                where N = size of ring>
                                - Enter spin_lock
                                -   get tp_status of head (X+N)
                                    tp_status == 0 (but slot
                                    in use for X on core A)
   - write tp_status of         <--- trouble!
     X = TP_STATUS_USER         <--- trouble!
   - write inuse=0              <--- trouble!
                                -   Check inuse
                                    inuse == 0
                                -   Allocate entry X+N
                                    advance head (X+N+1)
                                    set inuse=1
                                - Exit spin_lock
At this point Core A just passed slot X to userspace with a
packet and Core B has just been assigned slot X+N (same slot as
X) for it's new packet. Both cores A and B end up filling in that
slot.  Tracking ths donw was one of the reasons it took me a
while to produce these updated diffs.
>
> It probably does require a shadow structure as opposed to a
> padding byte to work with the long tail of (possibly broken)
> applications, sadly.
I agree.
>
> A setsockopt for userspace to signal a stricter interpretation of
> tp_status to elide the shadow hack could then be considered.
> It's not pretty. Either way, no full new version is required.
>
>> As much as I would like to find a solution that doesn't require
>> the spin lock I have yet to do so. Maybe the answer is that
>> existing applications will need to suffer the performance impact
>> but a new version or option for TPACKET_V1/V2 could be added to
>> indicate strict adherence of the TP_STATUS_USER bit and then the
>> original diffs could be used.
>>
>> There is another option I was considering but have yet to try
>> which would avoid needing a shadow ring by using counter(s) to
>> track maximum sequence number queued to userspace vs. the next
>> sequence number to be allocated in the ring.  If the difference
>> is greater than the size of the ring then the ring can be
>> considered full and the allocation would fail. Of course this may
>> create an additional hotspot between cores, not sure if that
>> would be significant or not.
>
> Please do have a look, but I don't think that this will work in this
> case in practice. It requires tracking the producer tail. Updating
> the slowest writer requires probing each subsequent slot's status
> byte to find the new tail, which is a lot of (by then cold) cacheline
> reads.
I've thought about it a little more and am not convinced it's
workable but I'll spend a little more time on it before giving
up.
next prev parent reply	other threads:[~2018-05-21 18:16 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-19 12:07 [PATCH v2] packet: track ring entry use using a shadow ring to prevent RX ring overrun Jon Rosen
2018-05-20 22:51 ` Willem de Bruijn
2018-05-20 23:22   ` Willem de Bruijn
2018-05-21 12:57     ` Jon Rosen (jrosen)
2018-05-21 17:07       ` Willem de Bruijn
2018-05-21 18:16         ` Jon Rosen (jrosen) [this message]
2018-05-22 15:41           ` Willem de Bruijn
2018-05-23 11:08             ` Jon Rosen (jrosen)
2018-05-22 14:12         ` Jon Rosen (jrosen)
2018-05-22 15:46           ` Willem de Bruijn
2018-05-23 11:54     ` Jon Rosen (jrosen)
2018-05-23 13:37       ` Willem de Bruijn
2018-05-23 15:29         ` Jon Rosen (jrosen)
2018-05-23 15:53           ` Willem de Bruijn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox
  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):
  git send-email \
    --in-reply-to=64c91d04479348cabbcdf1df0ff7d40f@XCH-RTP-016.cisco.com \
    --to=jrosen@cisco.com \
    --cc=bpoirier@suse.com \
    --cc=davem@davemloft.net \
    --cc=dwindsor@gmail.com \
    --cc=edumazet@google.com \
    --cc=elena.reshetova@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maloney@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=rami.rosen@intel.com \
    --cc=tglx@linutronix.de \
    --cc=willemb@google.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY
  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
  Be sure your reply has a Subject: header at the top and a blank line
  before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).