* Re: [PATCH bpf 1/6] netkit: Add tstats per-CPU traffic counters
From: Jakub Kicinski @ 2023-11-06 21:28 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: martin.lau, netdev, bpf, Nikolay Aleksandrov
In-Reply-To: <20231103222748.12551-2-daniel@iogearbox.net>
On Fri, 3 Nov 2023 23:27:43 +0100 Daniel Borkmann wrote:
> Add dev->tstats traffic accounting to netkit. The latter contains per-CPU
> RX and TX counters.
>
> The dev's TX counters are bumped upon pass/unspec as well as redirect
> verdicts, in other words, on everything except for drops.
>
> The dev's RX counters are bumped upon successful __netif_rx(), as well
> as from skb_do_redirect() (not part of this commit here).
>
> Using dev->lstats with having just a single packets/bytes counter and
> inferring one another's RX counters from the peer dev's lstats is not
> possible given skb_do_redirect() can also bump the device's stats.
sorry for the delay in replying, I'll comment here instead of on:
https://lore.kernel.org/all/6d5cb0ef-fabc-7ca3-94b2-5b1925a6805f@iogearbox.net/
What I had in mind was to have the driver just set the type of stats.
That way it doesn't have to bother with error handling either
(allocation failure checking, making sure free happens in the right
spot etc. all happen in the core). Here's a completely untested diff:
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9980517ed8b0..c23cb7dc0122 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1506,25 +1506,12 @@ static void veth_free_queues(struct net_device *dev)
static int veth_dev_init(struct net_device *dev)
{
- int err;
-
- dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats);
- if (!dev->lstats)
- return -ENOMEM;
-
- err = veth_alloc_queues(dev);
- if (err) {
- free_percpu(dev->lstats);
- return err;
- }
-
- return 0;
+ return veth_alloc_queues(dev);
}
static void veth_dev_free(struct net_device *dev)
{
veth_free_queues(dev);
- free_percpu(dev->lstats);
}
#ifdef CONFIG_NET_POLL_CONTROLLER
@@ -1802,6 +1789,8 @@ static void veth_setup(struct net_device *dev)
dev->hw_enc_features = VETH_FEATURES;
dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
netif_set_tso_max_size(dev, GSO_MAX_SIZE);
+
+ dev->pcpu_stat_type = NETDEV_PCPU_STAT_LSTAT;
}
/*
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 208c63f177f4..25e71480ca58 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1797,6 +1797,13 @@ enum netdev_ml_priv_type {
ML_PRIV_CAN,
};
+enum netdev_stat_type {
+ NETDEV_PCPU_STAT_NONE,
+ NETDEV_PCPU_STAT_LSTAT, /* struct pcpu_lstats */
+ NETDEV_PCPU_STAT_TSTAT, /* struct pcpu_sw_netstats */
+ NETDEV_PCPU_STAT_DSTAT, /* struct pcpu_dstats */
+};
+
/**
* struct net_device - The DEVICE structure.
*
@@ -2354,6 +2361,8 @@ struct net_device {
void *ml_priv;
enum netdev_ml_priv_type ml_priv_type;
+ /** @pcpu_stat_type: type of per-CPU stats in use */
+ enum netdev_stat_type pcpu_stat_type:8;
union {
struct pcpu_lstats __percpu *lstats;
struct pcpu_sw_netstats __percpu *tstats;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0d548431f3fa..15fec94c7d24 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10049,6 +10049,45 @@ void netif_tx_stop_all_queues(struct net_device *dev)
}
EXPORT_SYMBOL(netif_tx_stop_all_queues);
+static int netdev_do_alloc_pcpu_stats(struct net_device *dev)
+{
+ void __percpu *v;
+
+ switch (dev->pcpu_stat_type) {
+ case NETDEV_PCPU_STAT_NONE:
+ return 0;
+ case NETDEV_PCPU_STAT_LSTAT:
+ v = dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats);
+ break;
+ case NETDEV_PCPU_STAT_TSTAT:
+ v = dev->tstats =
+ netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
+ break;
+ case NETDEV_PCPU_STAT_DSTAT:
+ v = dev->dstats = netdev_alloc_pcpu_stats(struct pcpu_dstats);
+ break;
+ }
+
+ return v ? 0 : -ENOMEM;
+}
+
+static void netdev_do_free_pcpu_stats(struct net_device *dev)
+{
+ switch (dev->pcpu_stat_type) {
+ case NETDEV_PCPU_STAT_NONE:
+ return;
+ case NETDEV_PCPU_STAT_LSTAT:
+ free_percpu(dev->lstats);
+ break;
+ case NETDEV_PCPU_STAT_TSTAT:
+ free_percpu(dev->tstats);
+ break;
+ case NETDEV_PCPU_STAT_DSTAT:
+ free_percpu(dev->dstats);
+ break;
+ }
+}
+
/**
* register_netdevice() - register a network device
* @dev: device to register
@@ -10109,9 +10148,13 @@ int register_netdevice(struct net_device *dev)
goto err_uninit;
}
+ ret = netdev_do_alloc_pcpu_stats(dev);
+ if (ret)
+ goto err_uninit;
+
ret = dev_index_reserve(net, dev->ifindex);
if (ret < 0)
- goto err_uninit;
+ goto err_free_pcpu;
dev->ifindex = ret;
/* Transfer changeable features to wanted_features and enable
@@ -10217,6 +10260,8 @@ int register_netdevice(struct net_device *dev)
call_netdevice_notifiers(NETDEV_PRE_UNINIT, dev);
err_ifindex_release:
dev_index_release(net, dev->ifindex);
+err_free_pcpu:
+ netdev_do_free_pcpu_stats(dev);
err_uninit:
if (dev->netdev_ops->ndo_uninit)
dev->netdev_ops->ndo_uninit(dev);
@@ -10469,6 +10514,7 @@ void netdev_run_todo(void)
WARN_ON(rcu_access_pointer(dev->ip_ptr));
WARN_ON(rcu_access_pointer(dev->ip6_ptr));
+ netdev_do_free_pcpu_stats(dev);
if (dev->priv_destructor)
dev->priv_destructor(dev);
if (dev->needs_free_netdev)
^ permalink raw reply related
* Re: [PATCH bpf 4/6] bpf, netkit: Add indirect call wrapper for fetching peer dev
From: Jakub Kicinski @ 2023-11-06 21:32 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: martin.lau, netdev, bpf, Nikolay Aleksandrov
In-Reply-To: <20231103222748.12551-5-daniel@iogearbox.net>
On Fri, 3 Nov 2023 23:27:46 +0100 Daniel Borkmann wrote:
> ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of
> indirect call wrapper and therefore optimize the bpf_redirect_peer()
> internal handling a bit. Add a small skb_get_peer_dev() wrapper which
> utilizes the INDIRECT_CALL_1() macro instead of open coding.
Why don't we kill the ndo and put the pointer in struct net_device?
^ permalink raw reply
* Re: [PATCH net v3] tcp: Fix -Wc23-extensions in tcp_options_write()
From: Nathan Chancellor @ 2023-11-06 21:34 UTC (permalink / raw)
To: Dmitry Safonov
Cc: ndesaulniers, trix, noureddine, hch, netdev, linux-kernel, llvm,
patches, edumazet, davem, dsahern, kuba, pabeni
In-Reply-To: <a8cc305d-0ab8-4ff7-b11a-94f51f33ec92@arista.com>
On Mon, Nov 06, 2023 at 09:26:48PM +0000, Dmitry Safonov wrote:
> Seems like exactly the fix that my git testing tree had, with an
> exception to naming the helper tcp_ao_options_write().
Heh, not sure why I never considered that as an option... I am guessing
it does not matter enough for a v4 at this point but I could send a
net-next change later to update it if you so desire!
> But then I found* your patch-v1 and decided not to send an alternative
> patch.
>
> Thanks for fixing this,
Thanks for taking a look!
> Reviewed-by: Dmitry Safonov <dima@arista.com>
>
> *had to fix my Gmail lkml filter to label not only emails with cc/to my
> name, but also the raw email address (usually, I got them to/cc "Dmitry
> Safonov", but this one didn't have the name and got lost in the lkml pile).
Sorry about that, b4 used to have some interesting behavior around names
at one point (don't remember the details at the moment) and just using
emails avoided those issues. Maybe I should go back to names and emails
to see if I notice any problems again.
Cheers,
Nathan
^ permalink raw reply
* Re: [PATCH net v3] tcp: Fix -Wc23-extensions in tcp_options_write()
From: Dmitry Safonov @ 2023-11-06 21:48 UTC (permalink / raw)
To: Nathan Chancellor
Cc: ndesaulniers, trix, noureddine, hch, netdev, linux-kernel, llvm,
patches, edumazet, davem, dsahern, kuba, pabeni
In-Reply-To: <20231106213408.GA1841794@dev-arch.thelio-3990X>
On 11/6/23 21:34, Nathan Chancellor wrote:
> On Mon, Nov 06, 2023 at 09:26:48PM +0000, Dmitry Safonov wrote:
>> Seems like exactly the fix that my git testing tree had, with an
>> exception to naming the helper tcp_ao_options_write().
>
> Heh, not sure why I never considered that as an option... I am guessing
> it does not matter enough for a v4 at this point but I could send a
> net-next change later to update it if you so desire!
It doesn't matter really, not worth another patch :-)
>> But then I found* your patch-v1 and decided not to send an alternative
>> patch.
>>
>> Thanks for fixing this,
>
> Thanks for taking a look!
>
>> Reviewed-by: Dmitry Safonov <dima@arista.com>
>>
>> *had to fix my Gmail lkml filter to label not only emails with cc/to my
>> name, but also the raw email address (usually, I got them to/cc "Dmitry
>> Safonov", but this one didn't have the name and got lost in the lkml pile).
>
> Sorry about that, b4 used to have some interesting behavior around names
> at one point (don't remember the details at the moment) and just using
> emails avoided those issues. Maybe I should go back to names and emails
> to see if I notice any problems again.
No worries, should be fixed now. I preferred previously filtering by
full name, rather than email address as I send patches from both work
and home emails, but also sometimes people send patches/questions to
emails from the companies I previously worked for, regardless .mailmap
entries.
Probably, they look for author in lkml/mail archive, rather than use git
to get the current/proper email address.
Thanks,
Dmitry
^ permalink raw reply
* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Stanislav Fomichev @ 2023-11-06 21:59 UTC (permalink / raw)
To: Mina Almasry
Cc: David Ahern, netdev, linux-kernel, linux-arch, linux-kselftest,
linux-media, dri-devel, linaro-mm-sig, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
Ilias Apalodimas, Arnd Bergmann, Willem de Bruijn, Shuah Khan,
Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <CAHS8izMrnVUfbbS=OcJ6JT9SZRRfZ2MC7UnggthpZT=zf2BGLA@mail.gmail.com>
On 11/06, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
> >
> > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > On 11/05, Mina Almasry wrote:
> > >> For device memory TCP, we expect the skb headers to be available in host
> > >> memory for access, and we expect the skb frags to be in device memory
> > >> and unaccessible to the host. We expect there to be no mixing and
> > >> matching of device memory frags (unaccessible) with host memory frags
> > >> (accessible) in the same skb.
> > >>
> > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > >> are device memory frags or not.
> > >>
> > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > >> and marks the skb as skb->devmem accordingly.
> > >>
> > >> Add checks through the network stack to avoid accessing the frags of
> > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > >>
> > >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > >>
> > >> ---
> > >> include/linux/skbuff.h | 14 +++++++-
> > >> include/net/tcp.h | 5 +--
> > >> net/core/datagram.c | 6 ++++
> > >> net/core/gro.c | 5 ++-
> > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > >> net/ipv4/tcp.c | 6 ++++
> > >> net/ipv4/tcp_input.c | 13 +++++--
> > >> net/ipv4/tcp_output.c | 5 ++-
> > >> net/packet/af_packet.c | 4 +--
> > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > >>
> > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > >> index 1fae276c1353..8fb468ff8115 100644
> > >> --- a/include/linux/skbuff.h
> > >> +++ b/include/linux/skbuff.h
> > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > >> * @csum_level: indicates the number of consecutive checksums found in
> > >> * the packet minus one that have been verified as
> > >> * CHECKSUM_UNNECESSARY (max 3)
> > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > >> + * device memory.
> > >> * @dst_pending_confirm: need to confirm neighbour
> > >> * @decrypted: Decrypted SKB
> > >> * @slow_gro: state present at GRO time, slower prepare step required
> > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > >> __u8 csum_not_inet:1;
> > >> #endif
> > >> -
> > >> + __u8 devmem:1;
> > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > >> __u16 tc_index; /* traffic control index */
> > >> #endif
> > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > >> __skb_zcopy_downgrade_managed(skb);
> > >> }
> > >>
> > >> +/* Return true if frags in this skb are not readable by the host. */
> > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > >> +{
> > >> + return skb->devmem;
> > >
> > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > It better communicates the fact that the stack shouldn't dereference the
> > > frags (because it has 'devmem' fragments or for some other potential
> > > future reason).
> >
> > +1.
> >
> > Also, the flag on the skb is an optimization - a high level signal that
> > one or more frags is in unreadable memory. There is no requirement that
> > all of the frags are in the same memory type.
David: maybe there should be such a requirement (that they all are
unreadable)? Might be easier to support initially; we can relax later
on.
> The flag indicates that the skb contains all devmem dma-buf memory
> specifically, not generic 'not_readable' frags as the comment says:
>
> + * @devmem: indicates that all the fragments in this skb are backed by
> + * device memory.
>
> The reason it's not a generic 'not_readable' flag is because handing
> off a generic not_readable skb to the userspace is semantically not
> what we're doing. recvmsg() is augmented in this patch series to
> return a devmem skb to the user via a cmsg_devmem struct which refers
> specifically to the memory in the dma-buf. recvmsg() in this patch
> series is not augmented to give any 'not_readable' skb to the
> userspace.
>
> IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> correct. If a new type of unreadable skbs are introduced to the stack,
> I imagine the stack would implement:
>
> 1. new header flag: skb->newmem
> 2.
>
> static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> {
> return skb->devmem || skb->newmem;
> }
>
> 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
You copy it to the userspace in a special way because your frags
are page_is_page_pool_iov(). I agree with David, the skb bit is
just and optimization.
For most of the core stack, it doesn't matter why your skb is not
readable. For a few places where it matters (recvmsg?), you can
double-check your frags (all or some) with page_is_page_pool_iov.
Unrelated: we probably need socket to dmabuf association as well (via
netlink or something).
We are fundamentally receiving into and sending from a dmabuf (devmem ==
dmabuf).
And once you have this association, recvmsg shouldn't need any new
special flags.
^ permalink raw reply
* Re: [PATCH net] macsec: Abort MACSec Rx offload datapath when skb is not marked with MACSec metadata
From: Rahul Rameshbabu @ 2023-11-06 22:15 UTC (permalink / raw)
To: Sabrina Dubroca
Cc: netdev, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
David S. Miller, Saeed Mahameed
In-Reply-To: <ZULRxX9eIbFiVi7v@hog>
On Wed, 01 Nov, 2023 23:31:33 +0100 Sabrina Dubroca <sd@queasysnail.net> wrote:
> 2023-11-01, 13:02:17 -0700, Rahul Rameshbabu wrote:
>> When MACSec is configured on an outer netdev, traffic received directly
>> through the underlying netdev should not be processed by the MACSec Rx
>> datapath. When using MACSec offload on an outer netdev, traffic with no
>> metadata indicator in the skb is mistakenly considered as MACSec traffic
>> and incorrectly handled in the handle_not_macsec function. Treat skbs with
>> no metadata type as non-MACSec packets rather than assuming they are MACSec
>> packets.
>
> What about the other drivers? mlx5 is the only driver that sets md_dst
> on its macsec skbs, so their offloaded packets just get dropped now?
After taking a deeper look throughout the tree, I realize there are
macsec offloading drivers that do not set md_dst. In this event, we fail
to correctly handle the skb and deliver to the port as you mentioned
previously. Sorry about this miss on my end.
However, I believe that all macsec offload supporting devices run into
the following problem today (including mlx5 devices).
When I configure macsec offload on a device and then vlan on top of the
macsec interface, I become unable to send traffic through the underlying
device.
(can replace mlx5_1 with any ifname of a device that supports macsec offload)
Side 1
ip link del macsec0
ip address flush mlx5_1
ip address add 1.1.1.1/24 dev mlx5_1
ip link set dev mlx5_1 up
ip link add link mlx5_1 macsec0 type macsec sci 1 encrypt on
ip link set dev macsec0 address 00:11:22:33:44:66
ip macsec offload macsec0 mac
ip macsec add macsec0 tx sa 0 pn 1 on key 00 dffafc8d7b9a43d5b9a3dfbbf6a30c16
ip macsec add macsec0 rx sci 2 on
ip macsec add macsec0 rx sci 2 sa 0 pn 1 on key 00 ead3664f508eb06c40ac7104cdae4ce5
ip address flush macsec0
ip address add 2.2.2.1/24 dev macsec0
ip link set dev macsec0 up
ip link add link macsec0 name macsec_vlan type vlan id 1
ip link set dev macsec_vlan address 00:11:22:33:44:88
ip address flush macsec_vlan
ip address add 3.3.3.1/24 dev macsec_vlan
ip link set dev macsec_vlan up
Side 2
ip link del macsec0
ip address flush mlx5_1
ip address add 1.1.1.2/24 dev mlx5_1
ip link set dev mlx5_1 up
ip link add link mlx5_1 macsec0 type macsec sci 2 encrypt on
ip link set dev macsec0 address 00:11:22:33:44:77
ip macsec offload macsec0 mac
ip macsec add macsec0 tx sa 0 pn 1 on key 00 ead3664f508eb06c40ac7104cdae4ce5
ip macsec add macsec0 rx sci 1 on
ip macsec add macsec0 rx sci 1 sa 0 pn 1 on key 00 dffafc8d7b9a43d5b9a3dfbbf6a30c16
ip address flush macsec0
ip address add 2.2.2.2/24 dev macsec0
ip link set dev macsec0 up
ip link add link macsec0 name macsec_vlan type vlan id 1
ip link set dev macsec_vlan address 00:11:22:33:44:99
ip address flush macsec_vlan
ip address add 3.3.3.2/24 dev macsec_vlan
ip link set dev macsec_vlan up
Side 1
ping -I mlx5_1 1.1.1.2
PING 1.1.1.2 (1.1.1.2) from 1.1.1.1 mlx5_1: 56(84) bytes of data.
From 1.1.1.1 icmp_seq=1 Destination Host Unreachable
ping: sendmsg: No route to host
From 1.1.1.1 icmp_seq=2 Destination Host Unreachable
From 1.1.1.1 icmp_seq=3 Destination Host Unreachable
I am thinking the solution is a combination of annotating which macsec
devices support md_dst and this patch. However, I am not sure this fix
would be helpful for devices that support macsec offload without
utilizing md_dst information (would still be problematic).
Would be glad to hear any suggestions you may have before sending out a
v2.
Sorry for the late reply. Was away from a trip and wanted to validate
some things before posting back.
--
Thanks,
Rahul Rameshbabu
^ permalink raw reply
* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Mina Almasry @ 2023-11-06 22:18 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: David Ahern, netdev, linux-kernel, linux-arch, linux-kselftest,
linux-media, dri-devel, linaro-mm-sig, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
Ilias Apalodimas, Arnd Bergmann, Willem de Bruijn, Shuah Khan,
Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <ZUlhu4hlTaqR3CTh@google.com>
On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 11/06, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
> > >
> > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > On 11/05, Mina Almasry wrote:
> > > >> For device memory TCP, we expect the skb headers to be available in host
> > > >> memory for access, and we expect the skb frags to be in device memory
> > > >> and unaccessible to the host. We expect there to be no mixing and
> > > >> matching of device memory frags (unaccessible) with host memory frags
> > > >> (accessible) in the same skb.
> > > >>
> > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > >> are device memory frags or not.
> > > >>
> > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > >> and marks the skb as skb->devmem accordingly.
> > > >>
> > > >> Add checks through the network stack to avoid accessing the frags of
> > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > >>
> > > >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > > >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > >>
> > > >> ---
> > > >> include/linux/skbuff.h | 14 +++++++-
> > > >> include/net/tcp.h | 5 +--
> > > >> net/core/datagram.c | 6 ++++
> > > >> net/core/gro.c | 5 ++-
> > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > >> net/ipv4/tcp.c | 6 ++++
> > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > >> net/ipv4/tcp_output.c | 5 ++-
> > > >> net/packet/af_packet.c | 4 +--
> > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > >>
> > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > >> index 1fae276c1353..8fb468ff8115 100644
> > > >> --- a/include/linux/skbuff.h
> > > >> +++ b/include/linux/skbuff.h
> > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > >> * the packet minus one that have been verified as
> > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > >> + * device memory.
> > > >> * @dst_pending_confirm: need to confirm neighbour
> > > >> * @decrypted: Decrypted SKB
> > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > >> __u8 csum_not_inet:1;
> > > >> #endif
> > > >> -
> > > >> + __u8 devmem:1;
> > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > >> __u16 tc_index; /* traffic control index */
> > > >> #endif
> > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > >> __skb_zcopy_downgrade_managed(skb);
> > > >> }
> > > >>
> > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > >> +{
> > > >> + return skb->devmem;
> > > >
> > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > It better communicates the fact that the stack shouldn't dereference the
> > > > frags (because it has 'devmem' fragments or for some other potential
> > > > future reason).
> > >
> > > +1.
> > >
> > > Also, the flag on the skb is an optimization - a high level signal that
> > > one or more frags is in unreadable memory. There is no requirement that
> > > all of the frags are in the same memory type.
>
> David: maybe there should be such a requirement (that they all are
> unreadable)? Might be easier to support initially; we can relax later
> on.
>
Currently devmem == not_readable, and the restriction is that all the
frags in the same skb must be either all readable or all unreadable
(all devmem or all non-devmem).
> > The flag indicates that the skb contains all devmem dma-buf memory
> > specifically, not generic 'not_readable' frags as the comment says:
> >
> > + * @devmem: indicates that all the fragments in this skb are backed by
> > + * device memory.
> >
> > The reason it's not a generic 'not_readable' flag is because handing
> > off a generic not_readable skb to the userspace is semantically not
> > what we're doing. recvmsg() is augmented in this patch series to
> > return a devmem skb to the user via a cmsg_devmem struct which refers
> > specifically to the memory in the dma-buf. recvmsg() in this patch
> > series is not augmented to give any 'not_readable' skb to the
> > userspace.
> >
> > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > correct. If a new type of unreadable skbs are introduced to the stack,
> > I imagine the stack would implement:
> >
> > 1. new header flag: skb->newmem
> > 2.
> >
> > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > {
> > return skb->devmem || skb->newmem;
> > }
> >
> > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
>
> You copy it to the userspace in a special way because your frags
> are page_is_page_pool_iov(). I agree with David, the skb bit is
> just and optimization.
>
> For most of the core stack, it doesn't matter why your skb is not
> readable. For a few places where it matters (recvmsg?), you can
> double-check your frags (all or some) with page_is_page_pool_iov.
>
I see, we can do that then. I.e. make the header flag 'not_readable'
and check the frags to decide to delegate to tcp_recvmsg_devmem() or
something else. We can even assume not_readable == devmem because
currently devmem is the only type of unreadable frag currently.
> Unrelated: we probably need socket to dmabuf association as well (via
> netlink or something).
Not sure this is possible. The dma-buf is bound to the rx-queue, and
any packets that land on that rx-queue are bound to that dma-buf,
regardless of which socket that packet belongs to. So the association
IMO must be rx-queue to dma-buf, not socket to dma-buf.
> We are fundamentally receiving into and sending from a dmabuf (devmem ==
> dmabuf).
> And once you have this association, recvmsg shouldn't need any new
> special flags.
--
Thanks,
Mina
^ permalink raw reply
* [PATCH] s390/qeth: Fix typo 'weed' in comment
From: Kuan-Wei Chiu @ 2023-11-06 22:20 UTC (permalink / raw)
To: wintera, wenjia, hca, gor, agordeev
Cc: borntraeger, svens, linux-s390, netdev, linux-kernel,
Kuan-Wei Chiu
Replace 'weed' with 'we' in the comment.
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
---
drivers/s390/net/qeth_core_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/s390/net/qeth_core_main.c b/drivers/s390/net/qeth_core_main.c
index 6af2511e070c..cf8506d0f185 100644
--- a/drivers/s390/net/qeth_core_main.c
+++ b/drivers/s390/net/qeth_core_main.c
@@ -3675,7 +3675,7 @@ static void qeth_flush_queue(struct qeth_qdio_out_q *queue)
static void qeth_check_outbound_queue(struct qeth_qdio_out_q *queue)
{
/*
- * check if weed have to switch to non-packing mode or if
+ * check if we have to switch to non-packing mode or if
* we have to get a pci flag out on the queue
*/
if ((atomic_read(&queue->used_buffers) <= QETH_LOW_WATERMARK_PACK) ||
--
2.25.1
^ permalink raw reply related
* Re: [PATCH v9 9/17] bpf,lsm: refactor bpf_prog_alloc/bpf_prog_free LSM hooks
From: Paul Moore @ 2023-11-06 22:29 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Andrii Nakryiko, bpf, netdev, brauner, linux-fsdevel,
linux-security-module, keescook, kernel-team, sargun
In-Reply-To: <CAEf4Bzaxv4uHK=+_vwZuvgBgq8L6d4JwxTSGxZgU44LwWYhDug@mail.gmail.com>
On Mon, Nov 6, 2023 at 2:03 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Sun, Nov 5, 2023 at 9:01 PM Paul Moore <paul@paul-moore.com> wrote:
> > On Nov 3, 2023 Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Based on upstream discussion ([0]), rework existing
> > > bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead
> > > of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF
> > > program struct. Also, we pass bpf_attr union with all the user-provided
> > > arguments for BPF_PROG_LOAD command. This will give LSMs as much
> > > information as we can basically provide.
> > >
> > > The hook is also BPF token-aware now, and optional bpf_token struct is
> > > passed as a third argument. bpf_prog_load LSM hook is called after
> > > a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were
> > > allocated and filled out, but right before performing full-fledged BPF
> > > verification step.
> > >
> > > bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for
> > > consistency. SELinux code is adjusted to all new names, types, and
> > > signatures.
> > >
> > > Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be
> > > used by some LSMs to allocate extra security blob, but also by other
> > > LSMs to reject BPF program loading, we need to make sure that
> > > bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one
> > > *even* if the hook itself returned error. If we don't do that, we run
> > > the risk of leaking memory. This seems to be possible today when
> > > combining SELinux and BPF LSM, as one example, depending on their
> > > relative ordering.
> > >
> > > Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to
> > > sleepable LSM hooks list, as they are both executed in sleepable
> > > context. Also drop bpf_prog_load hook from untrusted, as there is no
> > > issue with refcount or anything else anymore, that originally forced us
> > > to add it to untrusted list in c0c852dd1876 ("bpf: Do not mark certain LSM
> > > hook arguments as trusted"). We now trigger this hook much later and it
> > > should not be an issue anymore.
> >
> > See my comment below, but it isn't clear to me if this means it is okay
> > to have `BTF_ID(func, bpf_lsm_bpf_prog_free)` called twice. It probably
> > would be a good idea to get KP, BPF LSM maintainer, to review this change
> > as well to make sure this looks good to him.
> >
> > > [0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > > include/linux/lsm_hook_defs.h | 5 +++--
> > > include/linux/security.h | 12 +++++++-----
> > > kernel/bpf/bpf_lsm.c | 5 +++--
> > > kernel/bpf/syscall.c | 25 +++++++++++++------------
> > > security/security.c | 25 +++++++++++++++----------
> > > security/selinux/hooks.c | 15 ++++++++-------
> > > 6 files changed, 49 insertions(+), 38 deletions(-)
> >
> > ...
> >
> > > diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
> > > index e14c822f8911..3e956f6302f3 100644
> > > --- a/kernel/bpf/bpf_lsm.c
> > > +++ b/kernel/bpf/bpf_lsm.c
> > > @@ -263,6 +263,8 @@ BTF_ID(func, bpf_lsm_bpf_map)
> > > BTF_ID(func, bpf_lsm_bpf_map_alloc_security)
> > > BTF_ID(func, bpf_lsm_bpf_map_free_security)
> > > BTF_ID(func, bpf_lsm_bpf_prog)
> > > +BTF_ID(func, bpf_lsm_bpf_prog_load)
> > > +BTF_ID(func, bpf_lsm_bpf_prog_free)
> > > BTF_ID(func, bpf_lsm_bprm_check_security)
> > > BTF_ID(func, bpf_lsm_bprm_committed_creds)
> > > BTF_ID(func, bpf_lsm_bprm_committing_creds)
> > > @@ -346,8 +348,7 @@ BTF_SET_END(sleepable_lsm_hooks)
> > >
> > > BTF_SET_START(untrusted_lsm_hooks)
> > > BTF_ID(func, bpf_lsm_bpf_map_free_security)
> > > -BTF_ID(func, bpf_lsm_bpf_prog_alloc_security)
> > > -BTF_ID(func, bpf_lsm_bpf_prog_free_security)
> > > +BTF_ID(func, bpf_lsm_bpf_prog_free)
> > > BTF_ID(func, bpf_lsm_file_alloc_security)
> > > BTF_ID(func, bpf_lsm_file_free_security)
> > > #ifdef CONFIG_SECURITY_NETWORK
> >
> > It looks like you're calling the BTF_ID() macro on bpf_lsm_bpf_prog_free
> > twice? I would have expected a only one macro call for each bpf_prog_load
> > and bpf_prog_free, is that a bad assuption?
> >
>
> Yeah, there is no problem having multiple BTF_ID() invocations for the
> same function. BTF_ID() macro (conceptually) emits a relocation that
> will instruct resolve_btfids tool to put an actual BTF ID for the
> specified function in a designated 4-byte slot.
>
> In this case, we have two separate lists: sleepable_lsm_hooks and
> untrusted_lsm_hooks, so we need two separate BTF_ID() entries for the
> same function. It's expected to be duplicated.
Okay, thanks for the explanation. It jumped out as a deviation from
the current code when I was looking at the changes and I was worried
that it was a typo, but it sounds like it's expected and the proper
thing to do.
> > > diff --git a/security/security.c b/security/security.c
> > > index dcb3e7014f9b..5773d446210e 100644
> > > --- a/security/security.c
> > > +++ b/security/security.c
> > > @@ -5180,16 +5180,21 @@ int security_bpf_map_alloc(struct bpf_map *map)
> > > }
> > >
> > > /**
> > > - * security_bpf_prog_alloc() - Allocate a bpf program LSM blob
> > > - * @aux: bpf program aux info struct
> > > + * security_bpf_prog_load() - Check if loading of BPF program is allowed
> > > + * @prog: BPF program object
> > > + * @attr: BPF syscall attributes used to create BPF program
> > > + * @token: BPF token used to grant user access to BPF subsystem
> > > *
> > > - * Initialize the security field inside bpf program.
> > > + * Do a check when the kernel allocates BPF program object and is about to
> > > + * pass it to BPF verifier for additional correctness checks. This is also the
> > > + * point where LSM blob is allocated for LSMs that need them.
> >
> > This is pretty nitpicky, but I'm guessing you may need to make another
> > revision to this patchset, if you do please drop the BPF verifier remark
> > from the comment above.
> >
> > Example: "Perform an access control check when the kernel loads a BPF
> > program and allocates the associated BPF program object. This hook is
> > also responsibile for allocating any required LSM state for the BPF
> > program."
>
> Done, no problem.
With the comment change above, and the clarification of the BTF_ID()
calls, feel free to add my ACK.
Acked-by: Paul Moore <paul@paul-moore.com>
--
paul-moore.com
^ permalink raw reply
* Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP
From: Stanislav Fomichev @ 2023-11-06 22:34 UTC (permalink / raw)
To: Willem de Bruijn
Cc: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
linux-media, dri-devel, linaro-mm-sig, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
Ilias Apalodimas, Arnd Bergmann, David Ahern, Shuah Khan,
Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <CAF=yD-+MFpO5Hdqn+Q9X54SBpgcBeJvKTRD53X2oM4s8uVqnAQ@mail.gmail.com>
On 11/06, Willem de Bruijn wrote:
> > > IMHO, we need a better UAPI to receive the tokens and give them back to
> > > the kernel. CMSG + setsockopt(SO_DEVMEM_DONTNEED) get the job done,
> > > but look dated and hacky :-(
> > >
> > > We should either do some kind of user/kernel shared memory queue to
> > > receive/return the tokens (similar to what Jonathan was doing in his
> > > proposal?)
> >
> > I'll take a look at Jonathan's proposal, sorry, I'm not immediately
> > familiar but I wanted to respond :-) But is the suggestion here to
> > build a new kernel-user communication channel primitive for the
> > purpose of passing the information in the devmem cmsg? IMHO that seems
> > like an overkill. Why add 100-200 lines of code to the kernel to add
> > something that can already be done with existing primitives? I don't
> > see anything concretely wrong with cmsg & setsockopt approach, and if
> > we switch to something I'd prefer to switch to an existing primitive
> > for simplicity?
> >
> > The only other existing primitive to pass data outside of the linear
> > buffer is the MSG_ERRQUEUE that is used for zerocopy. Is that
> > preferred? Any other suggestions or existing primitives I'm not aware
> > of?
> >
> > > or bite the bullet and switch to io_uring.
> > >
> >
> > IMO io_uring & socket support are orthogonal, and one doesn't preclude
> > the other. As you know we like to use sockets and I believe there are
> > issues with io_uring adoption at Google that I'm not familiar with
> > (and could be wrong). I'm interested in exploring io_uring support as
> > a follow up but I think David Wei will be interested in io_uring
> > support as well anyway.
>
> I also disagree that we need to replace a standard socket interface
> with something "faster", in quotes.
>
> This interface is not the bottleneck to the target workload.
>
> Replacing the synchronous sockets interface with something more
> performant for workloads where it is, is an orthogonal challenge.
> However we do that, I think that traditional sockets should continue
> to be supported.
>
> The feature may already even work with io_uring, as both recvmsg with
> cmsg and setsockopt have io_uring support now.
I'm not really concerned with faster. I would prefer something cleaner :-)
Or maybe we should just have it documented. With some kind of path
towards beautiful world where we can create dynamic queues..
^ permalink raw reply
* Re: Missing a write memory barrier in tls_init()
From: Jakub Kicinski @ 2023-11-06 22:36 UTC (permalink / raw)
To: Dae R. Jeong
Cc: borisp, john.fastabend, davem, edumazet, pabeni, netdev,
linux-kernel, ywchoi
In-Reply-To: <ZUNLocdNkny6QPn8@dragonet>
On Thu, 2 Nov 2023 16:11:29 +0900 Dae R. Jeong wrote:
> In addition, I believe the {tls_setsockopt, tls_getsockopt}
> implementation is fine because of the address dependency. I think
> load-load reordering is prohibited in this case so we don't need a
> read barrier.
Sounds plausible, could you send a patch?
The smb_wmb() would be better placed in tls_init(), IMHO.
^ permalink raw reply
* Re: [PATCH net-next v2 3/3] net: dsa: realtek: support reset controller
From: Luiz Angelo Daros de Luca @ 2023-11-06 22:37 UTC (permalink / raw)
To: Vladimir Oltean
Cc: Linus Walleij, netdev, alsi, andrew, vivien.didelot, f.fainelli,
davem, kuba, pabeni, robh+dt, krzk+dt, arinc.unal
In-Reply-To: <20231102155521.2yo5qpugdhkjy22x@skbuf>
> On Thu, Nov 02, 2023 at 03:59:48PM +0100, Linus Walleij wrote:
> > I don't know if this is an answer to your question, but look at what I did in
> >
> > drivers/usb/fotg210/Makefile:
> >
> > # This setup links the different object files into one single
> > # module so we don't have to EXPORT() a lot of internal symbols
> > # or create unnecessary submodules.
> > fotg210-objs-y += fotg210-core.o
> > fotg210-objs-$(CONFIG_USB_FOTG210_HCD) += fotg210-hcd.o
> > fotg210-objs-$(CONFIG_USB_FOTG210_UDC) += fotg210-udc.o
> > fotg210-objs := $(fotg210-objs-y)
> > obj-$(CONFIG_USB_FOTG210) += fotg210.o
> >
> > Everything starting with CONFIG_* is a Kconfig option obviously.
> >
> > The final module is just one file named fotg210.ko no matter whether
> > HCD (host controller), UDC (device controller) or both parts were
> > compiled into it. Often you just need one of them, sometimes you may
> > need both.
> >
> > It's a pretty clean example of how you do this "one module from
> > several optional parts" using Kbuild.
>
> To be clear, something like this is what you mean, right?
>
> diff --git a/drivers/net/dsa/realtek/Kconfig b/drivers/net/dsa/realtek/Kconfig
> index 060165a85fb7..857a039fb0f1 100644
> --- a/drivers/net/dsa/realtek/Kconfig
> +++ b/drivers/net/dsa/realtek/Kconfig
> @@ -15,39 +15,37 @@ menuconfig NET_DSA_REALTEK
>
> if NET_DSA_REALTEK
>
> +config NET_DSA_REALTEK_INTERFACE
> + tristate
> + help
> + Common interface driver for accessing Realtek switches, either
> + through MDIO or SMI.
> +
> config NET_DSA_REALTEK_MDIO
> - tristate "Realtek MDIO interface driver"
> - depends on OF
> - depends on NET_DSA_REALTEK_RTL8365MB || NET_DSA_REALTEK_RTL8366RB
> - depends on NET_DSA_REALTEK_RTL8365MB || !NET_DSA_REALTEK_RTL8365MB
> - depends on NET_DSA_REALTEK_RTL8366RB || !NET_DSA_REALTEK_RTL8366RB
> + tristate "Realtek MDIO interface support"
> help
> Select to enable support for registering switches configured
> through MDIO.
>
> config NET_DSA_REALTEK_SMI
> - tristate "Realtek SMI interface driver"
> - depends on OF
> - depends on NET_DSA_REALTEK_RTL8365MB || NET_DSA_REALTEK_RTL8366RB
> - depends on NET_DSA_REALTEK_RTL8365MB || !NET_DSA_REALTEK_RTL8365MB
> - depends on NET_DSA_REALTEK_RTL8366RB || !NET_DSA_REALTEK_RTL8366RB
> + bool "Realtek SMI interface support"
> help
> Select to enable support for registering switches connected
> through SMI.
>
> config NET_DSA_REALTEK_RTL8365MB
> tristate "Realtek RTL8365MB switch subdriver"
> - imply NET_DSA_REALTEK_SMI
> - imply NET_DSA_REALTEK_MDIO
> + select NET_DSA_REALTEK_INTERFACE
> select NET_DSA_TAG_RTL8_4
> + depends on OF
> help
> Select to enable support for Realtek RTL8365MB-VC and RTL8367S.
>
> config NET_DSA_REALTEK_RTL8366RB
> tristate "Realtek RTL8366RB switch subdriver"
> - imply NET_DSA_REALTEK_SMI
> - imply NET_DSA_REALTEK_MDIO
> + select NET_DSA_REALTEK_INTERFACE
> select NET_DSA_TAG_RTL4_A
> + depends on OF
> help
> Select to enable support for Realtek RTL8366RB.
>
> diff --git a/drivers/net/dsa/realtek/Makefile b/drivers/net/dsa/realtek/Makefile
> index 0aab57252a7c..35b7734c0ad0 100644
> --- a/drivers/net/dsa/realtek/Makefile
> +++ b/drivers/net/dsa/realtek/Makefile
> @@ -1,6 +1,15 @@
> # SPDX-License-Identifier: GPL-2.0
> -obj-$(CONFIG_NET_DSA_REALTEK_MDIO) += realtek-mdio.o
> -obj-$(CONFIG_NET_DSA_REALTEK_SMI) += realtek-smi.o
> +
> +obj-$(CONFIG_NET_DSA_REALTEK_INTERFACE) := realtek-interface.o
> +
> +realtek-interface-objs := realtek-interface-common.o
> +ifdef CONFIG_NET_DSA_REALTEK_MDIO
> +realtek-interface-objs += realtek-mdio.o
> +endif
> +ifdef CONFIG_NET_DSA_REALTEK_SMI
> +realtek-interface-objs += realtek-smi.o
> +endif
> +
> obj-$(CONFIG_NET_DSA_REALTEK_RTL8366RB) += rtl8366.o
> rtl8366-objs := rtl8366-core.o rtl8366rb.o
> obj-$(CONFIG_NET_DSA_REALTEK_RTL8365MB) += rtl8365mb.o
> diff --git a/drivers/net/dsa/realtek/realtek-interface-common.c b/drivers/net/dsa/realtek/realtek-interface-common.c
> new file mode 100644
> index 000000000000..bb7c77cdb9e2
> --- /dev/null
> +++ b/drivers/net/dsa/realtek/realtek-interface-common.c
> @@ -0,0 +1,36 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +
> +#include <linux/module.h>
> +
> +#include "realtek-mdio.h"
> +#include "realtek-smi.h"
> +
> +static int __init realtek_interface_init(void)
> +{
> + int err;
> +
> + err = realtek_mdio_init();
> + if (err)
> + return err;
> +
> + err = realtek_smi_init();
> + if (err) {
> + realtek_smi_exit();
> + return err;
> + }
> +
> + return 0;
> +}
> +module_init(realtek_interface_init);
> +
> +static void __exit realtek_interface_exit(void)
> +{
> + realtek_smi_exit();
> + realtek_mdio_exit();
> +}
> +module_exit(realtek_interface_exit);
> +
> +MODULE_AUTHOR("Luiz Angelo Daros de Luca <luizluca@gmail.com>");
> +MODULE_AUTHOR("Linus Walleij <linus.walleij@linaro.org>");
> +MODULE_DESCRIPTION("Driver for interfacing with Realtek switches via MDIO or SMI");
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/net/dsa/realtek/realtek-mdio.c b/drivers/net/dsa/realtek/realtek-mdio.c
> index 292e6d087e8b..6997dec14de2 100644
> --- a/drivers/net/dsa/realtek/realtek-mdio.c
> +++ b/drivers/net/dsa/realtek/realtek-mdio.c
> @@ -1,5 +1,5 @@
> // SPDX-License-Identifier: GPL-2.0+
> -/* Realtek MDIO interface driver
> +/* Realtek MDIO interface support
> *
> * ASICs we intend to support with this driver:
> *
> @@ -19,12 +19,12 @@
> * Copyright (C) 2009-2010 Gabor Juhos <juhosg@openwrt.org>
> */
>
> -#include <linux/module.h>
> #include <linux/of.h>
> #include <linux/overflow.h>
> #include <linux/regmap.h>
>
> #include "realtek.h"
> +#include "realtek-mdio.h"
>
> /* Read/write via mdiobus */
> #define REALTEK_MDIO_CTRL0_REG 31
> @@ -283,8 +283,12 @@ static struct mdio_driver realtek_mdio_driver = {
> .shutdown = realtek_mdio_shutdown,
> };
>
> -mdio_module_driver(realtek_mdio_driver);
> +int realtek_mdio_init(void)
> +{
> + return mdio_driver_register(&realtek_mdio_driver);
> +}
>
> -MODULE_AUTHOR("Luiz Angelo Daros de Luca <luizluca@gmail.com>");
> -MODULE_DESCRIPTION("Driver for Realtek ethernet switch connected via MDIO interface");
> -MODULE_LICENSE("GPL");
> +void realtek_mdio_exit(void)
> +{
> + mdio_driver_unregister(&realtek_mdio_driver);
> +}
> diff --git a/drivers/net/dsa/realtek/realtek-mdio.h b/drivers/net/dsa/realtek/realtek-mdio.h
> new file mode 100644
> index 000000000000..941b4ef9d531
> --- /dev/null
> +++ b/drivers/net/dsa/realtek/realtek-mdio.h
> @@ -0,0 +1,24 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef _REALTEK_MDIO_H
> +#define _REALTEK_MDIO_H
> +
> +#if IS_ENABLED(CONFIG_NET_DSA_REALTEK_MDIO)
> +
> +int realtek_mdio_init(void);
> +void realtek_mdio_exit(void);
> +
> +#else
> +
> +static inline int realtek_mdio_init(void)
> +{
> + return 0;
> +}
> +
> +static inline void realtek_mdio_exit(void)
> +{
> +}
> +
> +#endif
> +
> +#endif
> diff --git a/drivers/net/dsa/realtek/realtek-smi.c b/drivers/net/dsa/realtek/realtek-smi.c
> index 755546ed8db6..4c282bfc884d 100644
> --- a/drivers/net/dsa/realtek/realtek-smi.c
> +++ b/drivers/net/dsa/realtek/realtek-smi.c
> @@ -1,5 +1,5 @@
> // SPDX-License-Identifier: GPL-2.0+
> -/* Realtek Simple Management Interface (SMI) driver
> +/* Realtek Simple Management Interface (SMI) interface
> * It can be discussed how "simple" this interface is.
> *
> * The SMI protocol piggy-backs the MDIO MDC and MDIO signals levels
> @@ -26,7 +26,6 @@
> */
>
> #include <linux/kernel.h>
> -#include <linux/module.h>
> #include <linux/device.h>
> #include <linux/spinlock.h>
> #include <linux/skbuff.h>
> @@ -40,6 +39,7 @@
> #include <linux/if_bridge.h>
>
> #include "realtek.h"
> +#include "realtek-smi.h"
>
> #define REALTEK_SMI_ACK_RETRY_COUNT 5
>
> @@ -560,8 +560,13 @@ static struct platform_driver realtek_smi_driver = {
> .remove_new = realtek_smi_remove,
> .shutdown = realtek_smi_shutdown,
> };
> -module_platform_driver(realtek_smi_driver);
>
> -MODULE_AUTHOR("Linus Walleij <linus.walleij@linaro.org>");
> -MODULE_DESCRIPTION("Driver for Realtek ethernet switch connected via SMI interface");
> -MODULE_LICENSE("GPL");
> +int realtek_smi_init(void)
> +{
> + return platform_driver_register(&realtek_smi_driver);
> +}
> +
> +void realtek_smi_exit(void)
> +{
> + platform_driver_unregister(&realtek_smi_driver);
> +}
> diff --git a/drivers/net/dsa/realtek/realtek-smi.h b/drivers/net/dsa/realtek/realtek-smi.h
> new file mode 100644
> index 000000000000..9a4838321f94
> --- /dev/null
> +++ b/drivers/net/dsa/realtek/realtek-smi.h
> @@ -0,0 +1,24 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef _REALTEK_SMI_H
> +#define _REALTEK_SMI_H
> +
> +#if IS_ENABLED(CONFIG_NET_DSA_REALTEK_SMI)
> +
> +int realtek_smi_init(void);
> +void realtek_smi_exit(void);
> +
> +#else
> +
> +static inline int realtek_smi_init(void)
> +{
> + return 0;
> +}
> +
> +static inline void realtek_smi_exit(void)
> +{
> +}
> +
> +#endif
> +
> +#endif
> --
> 2.34.1
>
>
> It looks pretty reasonable to me. More stuff could go into
> realtek-interface-common.c, that could be called directly from
> realtek-smi.c and realtek-mdio.c without exporting anything.
>
> I've eliminated the possibility for the SMI and MDIO options to be
> anything other than y or n, because only a single interface module
> (the common one) exists, and the y/n/m quality of that is
> implied/selected by the drivers which depend on it. I hope I wasn't too
> trigger-happy with this.
Vladimir, you did all the work ;-) Thanks!
I implemented this code (with the fixes), and this is the result:
30272 drivers/net/dsa/realtek/realtek-mdio.ko
42176 drivers/net/dsa/realtek/realtek-smi.ko
114544 drivers/net/dsa/realtek/rtl8365mb.ko
115080 drivers/net/dsa/realtek/rtl8366.ko
87264 drivers/net/dsa/realtek/realtek-interface.ko
114544 drivers/net/dsa/realtek/rtl8365mb.ko
115080 drivers/net/dsa/realtek/rtl8366.ko
It is still strange why merging both modules into a single
realtek-interface.ko results in a slightly larger file. Anyway, the
difference is not that significant. I still think that some systems
will miss that extra 30kb or 40kb for the interface code they don't
need.
Your proposed Kconfig does not attempt to avoid a realtek-interface
without both interfaces or without support for both switch families.
Is it possible in Kconfig to force it to, at least, select one of the
interfaces and one of the switches? Is it okay to leave it
unconstrained?
If merging the modules is the accepted solution, it makes me wonder if
rtl8365mb.ko and rtl8366.ko should get merged as well into a single
realtek-switch.ko. They are a hard dependency for realtek-interface.ko
(previously on each interface module). If the kernel is custom-built,
it would still be possible to exclude one switch family at build time.
I'll use these modules in OpenWrt, which builds a single kernel for a
bunch of devices. Is there a way to weakly depend on a module,
allowing the system to load only a single subdriver? Is it worth it?
Regards,
Luiz
^ permalink raw reply
* Re: [PATCH net v2 4/4] net: ethernet: cortina: Handle large frames
From: Linus Walleij @ 2023-11-06 22:44 UTC (permalink / raw)
To: Vladimir Oltean
Cc: Hans Ulli Kroll, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Michał Mirosław, Andrew Lunn,
linux-arm-kernel, netdev, linux-kernel
In-Reply-To: <20231106132626.orn5r57cc7n5ditj@skbuf>
On Mon, Nov 6, 2023 at 2:26 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> On Sun, Nov 05, 2023 at 09:57:26PM +0100, Linus Walleij wrote:
> > If we start sending bigger frames (after first bumping up the MTU
> > on both interfaces sending and receiveing the frames), truncated
> > packets start to appear on the target such as in this tcpdump
> > resulting from ping -s 1474:
>
> A bit related: what is gmac_fix_features() supposed to do? I see it
> unsets GMAC_OFFLOAD_FEATURES when the MTU goes over a certain limit,
> and that also includes NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM. Is that
> limit correct, or is it supposed to kick in sooner, to allow
> validate_xmit_skb() -> skb_csum_hwoffload_help() do the software
> checksuum for you? I'm not sure whether that was the intention.
That indeed seems like the intention. But it's a bit suboptimal, because it
disables hardware checksum just because the MTU goes over a
certain level, and stops using the hardware checksum also for all
packets smaller than the MTU :(
I'll delete this and make the driver slot in the SW fallback per-packet
instead, I think it is fair to assume that most packets will be < MTU
and it is really just a question of where the fallback gets called.
> > Suggested-by: Vladimir Oltean <olteanv@gmail.com>
>
> To be clear, I didn't suggest any of this. I just pointed towards the gemini.c
> driver as being the problem. Please remove my Suggested-by tag.
OK sorry, it was just my way of trying to provide credit where
credit is due, because you helped so much with this bug.
> > - if (skb->ip_summed != CHECKSUM_NONE) {
> > + if (skb->len >= ETH_FRAME_LEN) {
> > + /* Hardware offloaded checksumming isn't working on frames
> > + * bigger than 1514 bytes. Perhaps the buffer is only 1518
> > + * bytes fitting a normal frame and a checksum?
> > + * Just use software checksumming and bypass on bigger frames.
> > + */
> > + if (skb->ip_summed == CHECKSUM_PARTIAL) {
> > + ret = skb_checksum_help(skb);
> > + if (ret)
> > + return ret;
> > + }
> > + word1 |= TSS_BYPASS_BIT;
> > + } else if (skb->ip_summed == CHECKSUM_PARTIAL) {
> > int tcp = 0;
> >
> > if (skb->protocol == htons(ETH_P_IP)) {
>
> [ context: tag_rtl4_a.c is a "category 2", aka "Ethertype", tagger ]
>
> We say this in Documentation/networking/dsa/dsa.rst:
>
> Checksum offload should work with category 1 and 2 taggers when the DSA conduit
> driver declares NETIF_F_HW_CSUM in vlan_features
> and looks at csum_start and csum_offset.
> For those cases, DSA will shift the checksum start and offset by
> the tag size. If the DSA conduit driver still uses the legacy NETIF_F_IP_CSUM
> or NETIF_F_IPV6_CSUM in vlan_features, the offload might only work if the
> offload hardware already expects that specific tag (perhaps due to matching
> vendors).
Since things work smoothly I can only assume that the Gemini
checksum engine actually knows about the Realtek ethertype (0x8899)
and the protocol (0xa) and takes action on that, since the switch works.
OR: it has some heuristic on for how to handle it. (Such as looking for
a valid TCP or UDP header to figure out where to put the checksum.)
But I have no idea how it does it. It doesn't have a firmware AFAIK.
Examples listed were ICMP so just IP checksums but I tried for example
SSH, and HTTP and packets look like this:
22:51:35.457191 9a:ec:30:5a:46:96 (oui Unknown) > bc:ae:c5:6b:a8:3d
(oui Unknown),
ethertype IPv4 (0x0800), length 434: (tos 0x48, ttl 64, id 8221,
offset 0, flags [DF], proto TCP (6), length 420)
_gateway.48102 > fedora.ssh: Flags [P.], cksum 0xcf1b (correct),
seq 811:1179, ack 2310,
win 2054, options [nop,nop,TS val 74858741 ecr 1981407207], length 368
Checksum correct. So...
> DSA user ports inherit those flags from the conduit, and it is up to
> the driver to correctly fall back to software checksum when the IP header is not
> where the hardware expects. If that check is ineffective, the packets might go
> to the network without a proper checksum (the checksum field will have the
> pseudo IP header sum).
It definately does not contain the pseudo IP header sum because
it would be the same all the time but tcpdump is happy:
cksum 0xcf1b (correct)
cksum 0x0655 (correct)
cksum 0xd247 (correct)
cksum 0x06b1 (correct)
> Shouldn't "word1 |= TSS_BYPASS_BIT;" be done depending on skb->protocol,
> rather than depending on skb->len?!
>
> if (skb->protocol == htons(ETH_P_IP)) {
> word1 |= TSS_IP_CHKSUM_BIT;
> tcp = ip_hdr(skb)->protocol == IPPROTO_TCP;
> } else { /* IPv6 */
> word1 |= TSS_IPV6_ENABLE_BIT;
> tcp = ipv6_hdr(skb)->nexthdr == IPPROTO_TCP;
> } // here
> word1 |= TSS_BYPASS_BIT;
Oddly it assumes everything is either TCP or UDP on
IPv4 or IPv6. And yet things such as ICMP work just fine.
I think the checksum engine can contain some various
heuristics, such as if it cannot recognize what is coming
in as the selected TCP or UDP, it will pass right through.
> Gemini should never attempt to provide checksums for DSA-tagged packets
> unless it is able to take skb->csum_start into consideration, otherwise
> it will get it wrong.
>
> This is somewhat independent of the other problem you've found, which
> seems to be that large non-DSA packets get truncated anyway. But not
> bypassing TX checksum offload truncates a packet? Hmm, strange.
I have a theory about that in the comment, which is that when they
engineered the hardware they only put in a hardware buffer for
1518 bytes in the checksum engine. If you try to put in any more
it gets truncated. It's a reasonable guess.
If you do not set the checkumming engine to "bypass" it will try
to fit the incoming paket into the checksumming buffer, and then
it will look to see if it can find the right TCP or UDP headers.
If it can't it will pass the packet out from the buffer without doing
any changes. But the buffer is just 1518 bytes, which means that
no matter what kind of package it is, it will get truncated if it
does not fit into the checksumming buffer.
This would give exactly the behaviour we're seeing.
Yours,
Linus Walleij
^ permalink raw reply
* Re: [PATCH v9 11/17] bpf,lsm: add BPF token LSM hooks
From: Paul Moore @ 2023-11-06 22:46 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Andrii Nakryiko, bpf, netdev, brauner, linux-fsdevel,
linux-security-module, keescook, kernel-team, sargun
In-Reply-To: <CAEf4BzbGjLQV0CsTwawiqHaGf4eObMQBJT-bpDpWOoQ8hNNcVQ@mail.gmail.com>
On Mon, Nov 6, 2023 at 2:17 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Sun, Nov 5, 2023 at 9:01 PM Paul Moore <paul@paul-moore.com> wrote:
> > On Nov 3, 2023 Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to
> > > allocate LSM security blob (we add `void *security` field to struct
> > > bpf_token for that), but also control who can instantiate BPF token.
> > > This follows existing pattern for BPF map and BPF prog.
> > >
> > > Also add security_bpf_token_allow_cmd() and security_bpf_token_capable()
> > > LSM hooks that allow LSM implementation to control and negate (if
> > > necessary) BPF token's delegation of a specific bpf_cmd and capability,
> > > respectively.
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > > include/linux/bpf.h | 3 ++
> > > include/linux/lsm_hook_defs.h | 5 +++
> > > include/linux/security.h | 25 +++++++++++++++
> > > kernel/bpf/bpf_lsm.c | 4 +++
> > > kernel/bpf/token.c | 13 ++++++--
> > > security/security.c | 60 +++++++++++++++++++++++++++++++++++
> > > 6 files changed, 107 insertions(+), 3 deletions(-)
> >
> > ...
> >
> > > diff --git a/include/linux/security.h b/include/linux/security.h
> > > index 08fd777cbe94..1d6edbf45d1c 100644
> > > --- a/include/linux/security.h
> > > +++ b/include/linux/security.h
> > > @@ -60,6 +60,7 @@ struct fs_parameter;
> > > enum fs_value_type;
> > > struct watch;
> > > struct watch_notification;
> > > +enum bpf_cmd;
> >
> > Yes, I think it's fine to include bpf.h in security.h instead of the
> > forward declaration.
> >
> > > /* Default (no) options for the capable function */
> > > #define CAP_OPT_NONE 0x0
> > > @@ -2031,6 +2032,11 @@ extern void security_bpf_map_free(struct bpf_map *map);
> > > extern int security_bpf_prog_load(struct bpf_prog *prog, union bpf_attr *attr,
> > > struct bpf_token *token);
> > > extern void security_bpf_prog_free(struct bpf_prog *prog);
> > > +extern int security_bpf_token_create(struct bpf_token *token, union bpf_attr *attr,
> > > + struct path *path);
> > > +extern void security_bpf_token_free(struct bpf_token *token);
> > > +extern int security_bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
> > > +extern int security_bpf_token_capable(const struct bpf_token *token, int cap);
> > > #else
> > > static inline int security_bpf(int cmd, union bpf_attr *attr,
> > > unsigned int size)
> > > @@ -2065,6 +2071,25 @@ static inline int security_bpf_prog_load(struct bpf_prog *prog, union bpf_attr *
> > >
> > > static inline void security_bpf_prog_free(struct bpf_prog *prog)
> > > { }
> > > +
> > > +static inline int security_bpf_token_create(struct bpf_token *token, union bpf_attr *attr,
> > > + struct path *path)
> > > +{
> > > + return 0;
> > > +}
> > > +
> > > +static inline void security_bpf_token_free(struct bpf_token *token)
> > > +{ }
> > > +
> > > +static inline int security_bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
> > > +{
> > > + return 0;
> > > +}
> > > +
> > > +static inline int security_bpf_token_capable(const struct bpf_token *token, int cap)
> > > +{
> > > + return 0;
> > > +}
> >
> > Another nitpick, but I would prefer to shorten
> > security_bpf_token_allow_cmd() renamed to security_bpf_token_cmd() both
> > to shorten the name and to better fit convention. I realize the caller
> > is named bpf_token_allow_cmd() but I'd still rather see the LSM hook
> > with the shorter name.
>
> Makes sense, renamed to security_bpf_token_cmd() and updated hook name as well
Thanks.
> > > diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
> > > index 35e6f55c2a41..5d04da54faea 100644
> > > --- a/kernel/bpf/token.c
> > > +++ b/kernel/bpf/token.c
> > > @@ -7,11 +7,12 @@
> > > #include <linux/idr.h>
> > > #include <linux/namei.h>
> > > #include <linux/user_namespace.h>
> > > +#include <linux/security.h>
> > >
> > > bool bpf_token_capable(const struct bpf_token *token, int cap)
> > > {
> > > /* BPF token allows ns_capable() level of capabilities */
> > > - if (token) {
> > > + if (token && security_bpf_token_capable(token, cap) == 0) {
> > > if (ns_capable(token->userns, cap))
> > > return true;
> > > if (cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN))
> >
> > We typically perform the capability based access controls prior to the
> > LSM controls, meaning if we want to the token controls to work in a
> > similar way we should do something like this:
> >
> > bool bpf_token_capable(...)
> > {
> > if (token) {
> > if (ns_capable(token, cap) ||
> > (cap != ADMIN && ns_capable(token, ADMIN)))
> > return security_bpf_token_capable(token, cap);
> > }
> > return capable(cap) || (cap != ADMIN && capable(...))
> > }
>
> yep, makes sense, I changed it as you suggested above
Thanks again.
> > > @@ -28,6 +29,7 @@ void bpf_token_inc(struct bpf_token *token)
> > >
> > > static void bpf_token_free(struct bpf_token *token)
> > > {
> > > + security_bpf_token_free(token);
> > > put_user_ns(token->userns);
> > > kvfree(token);
> > > }
> > > @@ -172,6 +174,10 @@ int bpf_token_create(union bpf_attr *attr)
> > > token->allowed_progs = mnt_opts->delegate_progs;
> > > token->allowed_attachs = mnt_opts->delegate_attachs;
> > >
> > > + err = security_bpf_token_create(token, attr, &path);
> > > + if (err)
> > > + goto out_token;
> > > +
> > > fd = get_unused_fd_flags(O_CLOEXEC);
> > > if (fd < 0) {
> > > err = fd;
> > > @@ -216,8 +222,9 @@ bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
> > > {
> > > if (!token)
> > > return false;
> > > -
> > > - return token->allowed_cmds & (1ULL << cmd);
> > > + if (!(token->allowed_cmds & (1ULL << cmd)))
> > > + return false;
> > > + return security_bpf_token_allow_cmd(token, cmd) == 0;
> >
> > I'm not sure how much it really matters, but someone might prefer
> > the '!!' approach/style over '== 0'.
>
> it would have to be !security_bpf_token_cmd(), right?
Yeah :P
In most, although definitely not all, kernel functions when something
returns 0 we consider that the positive/success case, with non-zero
values being some sort of failure. I must have defaulted to that
logic here, but you are correct that just a single negation would be
needed here.
> And that single
> negation is just very confusing when dealing with int-returning
> function. I find it much easier to make sure the logic is correct when
> we have explicit `== 0`.
That's fine, it's something I've seen mentioned over the years and
thought I might offer it as a comment. I can read either approach
just fine :)
Anyway, with the other changes mentioned above, e.g. naming and
permission ordering, feel free to add my ACK.
Acked-by: Paul Moore <paul@paul-moore.com>
--
paul-moore.com
^ permalink raw reply
* Re: [PATCH] Documentation: Document the Netlink spec
From: Jakub Kicinski @ 2023-11-06 22:51 UTC (permalink / raw)
To: Breno Leitao; +Cc: corbet, linux-doc, netdev, pabeni, edumazet
In-Reply-To: <20231103135622.250314-1-leitao@debian.org>
On Fri, 3 Nov 2023 06:56:22 -0700 Breno Leitao wrote:
> This is a Sphinx extension that parses the Netlink YAML spec files
> (Documentation/netlink/specs/), and generates a rst file to be
> displayed into Documentation pages.
>
> Create a new Documentation/networking/netlink_spec page, and a sub-page
> for each Netlink spec that needs to be documented, such as ethtool,
> devlink, netdev, etc.
>
> Create a Sphinx directive extension that reads the YAML spec
> (located under Documentation/netlink/specs), parses it and returns a RST
> string that is inserted where the Sphinx directive was called.
> +=======================================
> +Family ``fou`` netlink specification
> +=======================================
nit: bad length of the ==== marker lines
> +def parse_attributes_set(entries: List[Dict[str, Any]]) -> str:
I'd rename to parse_attr_sets (plural sets).
> + """Parse attribute from attribute-set"""
> + preprocessed = ["name", "type"]
> + ignored = ["checks"]
> + lines = []
> +
> + for entry in entries:
> + lines.append(rst_bullet(bold(entry["name"])))
This would be better as subsubtitle.
> + for attr in entry["attributes"]:
> + type_ = attr.get("type")
> + attr_line = bold(attr["name"])
> + if type_:
> + # Add the attribute type in the same line
> + attr_line += f" ({inline(type_)})"
> +
> + lines.append(rst_bullet(attr_line, 2))
And this actually, probably also a sub^3-title, because we'll want to
link to the specific attributes sooner or later. And linking to bullet
points isn't a thing (AFAIU?)
^ permalink raw reply
* Re: [syzbot] [dccp?] general protection fault in dccp_write_xmit (2)
From: syzbot @ 2023-11-06 22:52 UTC (permalink / raw)
To: bragathemanick0908, davem, dccp, edumazet, kuba, linux-kernel,
netdev, pabeni, syzkaller-bugs
In-Reply-To: <0000000000009e122006088a2b8d@google.com>
syzbot has found a reproducer for the following issue on:
HEAD commit: d2f51b3516da Merge tag 'rtc-6.7' of git://git.kernel.org/p..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16675f40e80000
kernel config: https://syzkaller.appspot.com/x/.config?x=cd456c5e6582895e
dashboard link: https://syzkaller.appspot.com/bug?extid=c71bc336c5061153b502
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=167ac787680000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1799f190e80000
Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-d2f51b35.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/d678011e498e/vmlinux-d2f51b35.xz
kernel image: https://storage.googleapis.com/syzbot-assets/4f6ed772923d/bzImage-d2f51b35.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+c71bc336c5061153b502@syzkaller.appspotmail.com
general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 3 PID: 5345 Comm: syz-executor785 Not tainted 6.6.0-syzkaller-14651-gd2f51b3516da #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:ccid_hc_tx_send_packet net/dccp/ccid.h:166 [inline]
RIP: 0010:dccp_write_xmit+0x66/0x1d0 net/dccp/output.c:356
Code: 00 48 85 c0 49 89 c4 0f 84 03 01 00 00 e8 82 5f cd f7 41 80 3e 00 0f 85 45 01 00 00 48 8b 9d f8 08 00 00 48 89 d8 48 c1 e8 03 <42> 80 3c 28 00 0f 85 1f 01 00 00 48 8b 1b 48 8d bb b0 00 00 00 48
RSP: 0018:ffffc90003797870 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88802a2bd000 RSI: ffffffff89ba330e RDI: ffff88802d7d9540
RBP: ffff88802d7d9540 R08: 0000000000000001 R09: fffffbfff23e11e9
R10: ffffffff91f08f4f R11: ffffffff915e5030 R12: ffff8880186c9cc0
R13: dffffc0000000000 R14: ffffed1005afb3c7 R15: ffff88802d7d9e38
FS: 00007f263ceef6c0(0000) GS:ffff88806b900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020002980 CR3: 000000001b2ab000 CR4: 0000000000350ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
dccp_sendmsg+0x968/0xcc0 net/dccp/proto.c:801
inet_sendmsg+0x9d/0xe0 net/ipv4/af_inet.c:847
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0xd5/0x180 net/socket.c:745
____sys_sendmsg+0x2ac/0x940 net/socket.c:2588
___sys_sendmsg+0x135/0x1d0 net/socket.c:2642
__sys_sendmmsg+0x1a1/0x450 net/socket.c:2728
__do_sys_sendmmsg net/socket.c:2757 [inline]
__se_sys_sendmmsg net/socket.c:2754 [inline]
__x64_sys_sendmmsg+0x9c/0x100 net/socket.c:2754
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x3f/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f263cf53559
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 61 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f263ceef218 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00007f263cfdd438 RCX: 00007f263cf53559
RDX: 0400000000000239 RSI: 0000000020002980 RDI: 0000000000000006
RBP: 00007f263cfdd430 R08: 00007fff5b335167 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f263cfdd43c
R13: 00007f263cfaa504 R14: 0400000000000239 R15: 00007fff5b335168
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:ccid_hc_tx_send_packet net/dccp/ccid.h:166 [inline]
RIP: 0010:dccp_write_xmit+0x66/0x1d0 net/dccp/output.c:356
Code: 00 48 85 c0 49 89 c4 0f 84 03 01 00 00 e8 82 5f cd f7 41 80 3e 00 0f 85 45 01 00 00 48 8b 9d f8 08 00 00 48 89 d8 48 c1 e8 03 <42> 80 3c 28 00 0f 85 1f 01 00 00 48 8b 1b 48 8d bb b0 00 00 00 48
RSP: 0018:ffffc90003797870 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88802a2bd000 RSI: ffffffff89ba330e RDI: ffff88802d7d9540
RBP: ffff88802d7d9540 R08: 0000000000000001 R09: fffffbfff23e11e9
R10: ffffffff91f08f4f R11: ffffffff915e5030 R12: ffff8880186c9cc0
R13: dffffc0000000000 R14: ffffed1005afb3c7 R15: ffff88802d7d9e38
FS: 00007f263ceef6c0(0000) GS:ffff88806b900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020002980 CR3: 000000001b2ab000 CR4: 0000000000350ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess):
0: 00 48 85 add %cl,-0x7b(%rax)
3: c0 49 89 c4 rorb $0xc4,-0x77(%rcx)
7: 0f 84 03 01 00 00 je 0x110
d: e8 82 5f cd f7 call 0xf7cd5f94
12: 41 80 3e 00 cmpb $0x0,(%r14)
16: 0f 85 45 01 00 00 jne 0x161
1c: 48 8b 9d f8 08 00 00 mov 0x8f8(%rbp),%rbx
23: 48 89 d8 mov %rbx,%rax
26: 48 c1 e8 03 shr $0x3,%rax
* 2a: 42 80 3c 28 00 cmpb $0x0,(%rax,%r13,1) <-- trapping instruction
2f: 0f 85 1f 01 00 00 jne 0x154
35: 48 8b 1b mov (%rbx),%rbx
38: 48 8d bb b0 00 00 00 lea 0xb0(%rbx),%rdi
3f: 48 rex.W
---
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.
^ permalink raw reply
* Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP
From: Willem de Bruijn @ 2023-11-06 22:55 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
linux-media, dri-devel, linaro-mm-sig, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
Ilias Apalodimas, Arnd Bergmann, David Ahern, Shuah Khan,
Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <ZUlp8XutSAScKs_0@google.com>
On Mon, Nov 6, 2023 at 2:34 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 11/06, Willem de Bruijn wrote:
> > > > IMHO, we need a better UAPI to receive the tokens and give them back to
> > > > the kernel. CMSG + setsockopt(SO_DEVMEM_DONTNEED) get the job done,
> > > > but look dated and hacky :-(
> > > >
> > > > We should either do some kind of user/kernel shared memory queue to
> > > > receive/return the tokens (similar to what Jonathan was doing in his
> > > > proposal?)
> > >
> > > I'll take a look at Jonathan's proposal, sorry, I'm not immediately
> > > familiar but I wanted to respond :-) But is the suggestion here to
> > > build a new kernel-user communication channel primitive for the
> > > purpose of passing the information in the devmem cmsg? IMHO that seems
> > > like an overkill. Why add 100-200 lines of code to the kernel to add
> > > something that can already be done with existing primitives? I don't
> > > see anything concretely wrong with cmsg & setsockopt approach, and if
> > > we switch to something I'd prefer to switch to an existing primitive
> > > for simplicity?
> > >
> > > The only other existing primitive to pass data outside of the linear
> > > buffer is the MSG_ERRQUEUE that is used for zerocopy. Is that
> > > preferred? Any other suggestions or existing primitives I'm not aware
> > > of?
> > >
> > > > or bite the bullet and switch to io_uring.
> > > >
> > >
> > > IMO io_uring & socket support are orthogonal, and one doesn't preclude
> > > the other. As you know we like to use sockets and I believe there are
> > > issues with io_uring adoption at Google that I'm not familiar with
> > > (and could be wrong). I'm interested in exploring io_uring support as
> > > a follow up but I think David Wei will be interested in io_uring
> > > support as well anyway.
> >
> > I also disagree that we need to replace a standard socket interface
> > with something "faster", in quotes.
> >
> > This interface is not the bottleneck to the target workload.
> >
> > Replacing the synchronous sockets interface with something more
> > performant for workloads where it is, is an orthogonal challenge.
> > However we do that, I think that traditional sockets should continue
> > to be supported.
> >
> > The feature may already even work with io_uring, as both recvmsg with
> > cmsg and setsockopt have io_uring support now.
>
> I'm not really concerned with faster. I would prefer something cleaner :-)
>
> Or maybe we should just have it documented. With some kind of path
> towards beautiful world where we can create dynamic queues..
I suppose we just disagree on the elegance of the API.
The concise notification API returns tokens as a range for
compression, encoding as two 32-bit unsigned integers start + length.
It allows for even further batching by returning multiple such ranges
in a single call.
This is analogous to the MSG_ZEROCOPY notification mechanism from
kernel to user.
The synchronous socket syscall interface can be replaced by something
asynchronous like io_uring. This already works today? Whatever
asynchronous ring-based API would be selected, io_uring or otherwise,
I think the concise notification encoding would remain as is.
Since this is an operation on a socket, I find a setsockopt the
fitting interface.
^ permalink raw reply
* Re: [PATCH] net: bcmasp: Use common error handling code in bcmasp_probe()
From: Jakub Kicinski @ 2023-11-06 22:58 UTC (permalink / raw)
To: Markus Elfring
Cc: Julia Lawall, David S. Miller, Eric Dumazet, Florian Fainelli,
Justin Chen, Paolo Abeni, bcm-kernel-feedback-list, netdev,
kernel-janitors, cocci, LKML
In-Reply-To: <0b2972cb-03b2-40c7-a728-6ebe2512637f@web.de>
On Sun, 5 Nov 2023 17:33:46 +0100 Markus Elfring wrote:
> Add a jump target so that a bit of exception handling can be better
> reused at the end of this function.
>
> Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
> ---
> drivers/net/ethernet/broadcom/asp2/bcmasp.c | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
The diffstat proves otherwise.
Please don't send such patches to networking.
^ permalink raw reply
* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Stanislav Fomichev @ 2023-11-06 22:59 UTC (permalink / raw)
To: Mina Almasry
Cc: David Ahern, netdev, linux-kernel, linux-arch, linux-kselftest,
linux-media, dri-devel, linaro-mm-sig, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
Ilias Apalodimas, Arnd Bergmann, Willem de Bruijn, Shuah Khan,
Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <CAHS8izMaAhoae5ChnzO4gny1cYYnqV1cB8MC2cAF3eoyt+Sf4A@mail.gmail.com>
On 11/06, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 11/06, Mina Almasry wrote:
> > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
> > > >
> > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > On 11/05, Mina Almasry wrote:
> > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > >> (accessible) in the same skb.
> > > > >>
> > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > >> are device memory frags or not.
> > > > >>
> > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > >> and marks the skb as skb->devmem accordingly.
> > > > >>
> > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > >>
> > > > >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > > > >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > > >>
> > > > >> ---
> > > > >> include/linux/skbuff.h | 14 +++++++-
> > > > >> include/net/tcp.h | 5 +--
> > > > >> net/core/datagram.c | 6 ++++
> > > > >> net/core/gro.c | 5 ++-
> > > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > > >> net/ipv4/tcp.c | 6 ++++
> > > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > > >> net/ipv4/tcp_output.c | 5 ++-
> > > > >> net/packet/af_packet.c | 4 +--
> > > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > > >>
> > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > >> --- a/include/linux/skbuff.h
> > > > >> +++ b/include/linux/skbuff.h
> > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > > >> * the packet minus one that have been verified as
> > > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > > >> + * device memory.
> > > > >> * @dst_pending_confirm: need to confirm neighbour
> > > > >> * @decrypted: Decrypted SKB
> > > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > >> __u8 csum_not_inet:1;
> > > > >> #endif
> > > > >> -
> > > > >> + __u8 devmem:1;
> > > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > >> __u16 tc_index; /* traffic control index */
> > > > >> #endif
> > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > >> __skb_zcopy_downgrade_managed(skb);
> > > > >> }
> > > > >>
> > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > >> +{
> > > > >> + return skb->devmem;
> > > > >
> > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > future reason).
> > > >
> > > > +1.
> > > >
> > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > one or more frags is in unreadable memory. There is no requirement that
> > > > all of the frags are in the same memory type.
> >
> > David: maybe there should be such a requirement (that they all are
> > unreadable)? Might be easier to support initially; we can relax later
> > on.
> >
>
> Currently devmem == not_readable, and the restriction is that all the
> frags in the same skb must be either all readable or all unreadable
> (all devmem or all non-devmem).
>
> > > The flag indicates that the skb contains all devmem dma-buf memory
> > > specifically, not generic 'not_readable' frags as the comment says:
> > >
> > > + * @devmem: indicates that all the fragments in this skb are backed by
> > > + * device memory.
> > >
> > > The reason it's not a generic 'not_readable' flag is because handing
> > > off a generic not_readable skb to the userspace is semantically not
> > > what we're doing. recvmsg() is augmented in this patch series to
> > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > series is not augmented to give any 'not_readable' skb to the
> > > userspace.
> > >
> > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > I imagine the stack would implement:
> > >
> > > 1. new header flag: skb->newmem
> > > 2.
> > >
> > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > {
> > > return skb->devmem || skb->newmem;
> > > }
> > >
> > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> >
> > You copy it to the userspace in a special way because your frags
> > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > just and optimization.
> >
> > For most of the core stack, it doesn't matter why your skb is not
> > readable. For a few places where it matters (recvmsg?), you can
> > double-check your frags (all or some) with page_is_page_pool_iov.
> >
>
> I see, we can do that then. I.e. make the header flag 'not_readable'
> and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> something else. We can even assume not_readable == devmem because
> currently devmem is the only type of unreadable frag currently.
>
> > Unrelated: we probably need socket to dmabuf association as well (via
> > netlink or something).
>
> Not sure this is possible. The dma-buf is bound to the rx-queue, and
> any packets that land on that rx-queue are bound to that dma-buf,
> regardless of which socket that packet belongs to. So the association
> IMO must be rx-queue to dma-buf, not socket to dma-buf.
But there is still always 1 dmabuf to 1 socket association (on rx), right?
Because otherwise, there is no way currently to tell, at recvmsg, which
dmabuf the received token belongs to.
So why not have a separate control channel action to say: this socket fd
is supposed to receive into this dmabuf fd? This action would put
the socket into permanent 'MSG_SOCK_DEVMEM' mode. Maybe you can also
put some checks at the lower level to to enforce this dmabuf
association. (to avoid any potential issues with flow steering)
We'll still have dmabuf to rx-queue association because of various reasons..
^ permalink raw reply
* BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target
From: Nelson, Shannon @ 2023-11-06 23:03 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
David Ahern, Jakub Kicinski, netdev, bpf, Daniel Borkmann,
Alexei Starovoitov, Andrii Nakryiko
While testing new code to support XDP in the ionic driver we found that
we could panic the kernel by running a bind/unbind loop on the target
interface of an xdp_redirect action. Obviously this is a stress test
that is abusing the system, but it does point to a window of opportunity
in bq_enqueue() and bq_xmit_all(). I believe that while the validity of
the target interface has been checked in __xdp_enqueue(), the interface
can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to
use the interface. There is no locking or reference taken on the
interface to hold it in place before the target’s ndo_xdp_xmit() is called.
Below is a stack trace that our tester captured while running our test
code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a
non-upstream kernel. But if you look at the current upstream code in
kernel/bpf/devmap.c I think you can see what we ran into.
Other than telling users to not abuse the system with a bind/unbind
loop, is there something we can do to limit the potential pain here?
Without knowing what interfaces might be targeted by the users’ XDP
programs, is there a step the originating driver can do to take
precautions? Did we simply miss a step in the driver, or is this an
actual problem in the devmap code?
Thanks,
sln
[ 6118.862868] general protection fault, probably for non-canonical
address 0x696867666564659a: 0000 [#1] PREEMPT SMP NOPTI
[ 6118.863026] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G
OE -------- --- 5.14.0-284.11.1.el9_2.x86_64 #1
[ 6118.863335] Hardware name: HPE ProLiant DL320 Gen11/ProLiant DL320
Gen11, BIOS 1.30 03/01/1023
[ 6118.863515] RIP: 0010:bq_xmit_all+0x8a/0x160
[ 6118.863704] Code: de 4c 89 f1 44 89 ea e8 b4 fc ff ff 89 c6 85 c0 0f
84 af 00 00 00 49 8b 86 d0 00 00 00 44 89 f9 48 89 da 4c 89 f7 89 74 24
04 <48> 8b 80 38 02 00 00 ff d0 0f 1f 00 31 c9 8b 74 24 04 85 c0 41 89
[ 6118.864137] RSP: 0018:ff733629002d8e50 EFLAGS: 00010246
[ 6118.864366] RAX: 6968676665646362 RBX: ffa53628f9ea47e8 RCX:
0000000000000001
[ 6118.864608] RDX: ffa53628f9ea47e8 RSI: 0000000000000010 RDI:
ff1fa0f2a46c8000
[ 6118.864859] RBP: ffa53628f9ea47f0 R08: ffffffffc082e700 R09:
ffffffffffffffff
[ 6118.865114] R10: 0000000000000040 R11: 0000000000000003 R12:
0000000000000010
[ 6118.865380] R13: 0000000000000010 R14: ff1fa0f2a46c8000 R15:
0000000000000001
[ 6118.865648] FS: 0000000000000000(0000) GS:ff1fa0f5af280000(0000)
knlGS:0000000000000000
[ 6118.865928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6118.866213] CR2: 0000557586bae5e6 CR3: 000000001d410004 CR4:
0000000000771ee0
[ 6118.866509] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 6118.866811] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
0000000000000400
[ 6118.867117] PKRU: 55555554
[ 6118.867423] Call Trace:
[ 6118.867731] <IRQ>
[ 6118.868040] __dev_flush+0x39/0xa0
[ 6118.868358] xdp_do_flush+0xa/0x20
[ 6118.868681] ionic_txrx_napi+0x1ba/0x1f0 [ionic]
[ 6118.869024] __napi_poll+0x27/0x170
[ 6118.869357] net_rx_action+0x233/0x2f0
[ 6118.869693] __do_softirq+0xc7/0x2ac
[ 6118.870037] __irq_exit_rcu+0xb5/0xe0
[ 6118.870383] common_interrupt+0x80/0xa0
[ 6118.870735] </IRQ>
[ 6118.871082] <TASK>
[ 6118.871431] asm_common_interrupt+0x22/0x40
[ 6118.871790] RIP: 0010:cpuidle_enter_state+0xd2/0x400
[ 6118.872162] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 f9 a9 8e ff 45 84
ff 74 12 9c 58 f6 c4 02 0f 85 12 03 00 00 31 ff e8 72 b7 94 ff fb 45 85
f6 <0f> 88 15 01 00 00 49 63 d6 4c 2b 2c 24 48 8d 04 52 48 8d 04 82 49
[ 6118.872962] RSP: 0018:ff7336290010fe80 EFLAGS: 00000206
[ 6118.873377] RAX: ff1fa0f5af2aabc0 RBX: 0000000000000003 RCX:
000000000000001f
[ 6118.873804] RDX: 0000000000000000 RSI: 0000000055555555 RDI:
0000000000000000
[ 6118.874236] RBP: ffa53628faaa4400 R08: 00000590a8a55399 R09:
0000000000000001
[ 6118.874752] R10: 0000000000000400 R11: 0000000000000730 R12:
ffffffffbacace60
[ 6118.875282] R13: 00000590a8a55399 R14: 0000000000000003 R15:
0000000000000000
[ 6118.875737] cpuidle_enter+0x29/0x40
[ 6118.876198] cpuidle_idle_call+0x12c/0x1c0
[ 6118.876664] do_idle+0x7b/0xe0
[ 6118.877132] cpu_startup_entry+0x19/0x20
[ 6118.877604] start_secondary+0x116/0x140
[ 6118.878081] secondary_startup_64_no_verify+0xe5/0xeb
[ 6118.878562] </TASK>
[ 6118.879104] Modules linked in: ionic(OE) ib_core tls 8021q garp mrp
stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables
nfnetlink vfat fat ipmi_ssif intel_rapl_msr intel_rapl_common nfit
libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
pmt_crashlog pmt_telemetry pmt_class kvm irqbypass rapl intel_cstate
cdc_eem intel_uncore idxd usbnet acpi_ipmi mei_me isst_if_mbox_pci
isst_if_mmio pcspkr isst_if_common intel_vsec mii idxd_bus mei hpilo
ipmi_si ipmi_devintf ipmi_msghandler acpi_tad acpi_power_meter xfs
libcrc32c sd_mod t10_pi sg mgag200 i2c_algo_bit drm_shmem_helper
drm_kms_helper ahci syscopyarea sysfillrect sysimgblt libahci
fb_sys_fops qat_4xxx drm libata crct10dif_pclmul crc32_pclmul intel_qat
crc32c_intel tg3 ghash_clmulni_intel crc8 hpwdt wmi dm_mirror
dm_region_hash dm_log dm_mod fuse [last unloaded: ionic]
^ permalink raw reply
* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: Mina Almasry @ 2023-11-06 23:27 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: David Ahern, netdev, linux-kernel, linux-arch, linux-kselftest,
linux-media, dri-devel, linaro-mm-sig, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
Ilias Apalodimas, Arnd Bergmann, Willem de Bruijn, Shuah Khan,
Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <ZUlvzm24SA3YjirV@google.com>
On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 11/06, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On 11/06, Mina Almasry wrote:
> > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@kernel.org> wrote:
> > > > >
> > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > On 11/05, Mina Almasry wrote:
> > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > >> (accessible) in the same skb.
> > > > > >>
> > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > >> are device memory frags or not.
> > > > > >>
> > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > >>
> > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > >>
> > > > > >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > > > > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> > > > > >> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > > > > >>
> > > > > >> ---
> > > > > >> include/linux/skbuff.h | 14 +++++++-
> > > > > >> include/net/tcp.h | 5 +--
> > > > > >> net/core/datagram.c | 6 ++++
> > > > > >> net/core/gro.c | 5 ++-
> > > > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > >> net/ipv4/tcp.c | 6 ++++
> > > > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > > > >> net/ipv4/tcp_output.c | 5 ++-
> > > > > >> net/packet/af_packet.c | 4 +--
> > > > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > >>
> > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > >> --- a/include/linux/skbuff.h
> > > > > >> +++ b/include/linux/skbuff.h
> > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > > > >> * the packet minus one that have been verified as
> > > > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > >> + * device memory.
> > > > > >> * @dst_pending_confirm: need to confirm neighbour
> > > > > >> * @decrypted: Decrypted SKB
> > > > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > >> __u8 csum_not_inet:1;
> > > > > >> #endif
> > > > > >> -
> > > > > >> + __u8 devmem:1;
> > > > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > >> __u16 tc_index; /* traffic control index */
> > > > > >> #endif
> > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > >> __skb_zcopy_downgrade_managed(skb);
> > > > > >> }
> > > > > >>
> > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > >> +{
> > > > > >> + return skb->devmem;
> > > > > >
> > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > future reason).
> > > > >
> > > > > +1.
> > > > >
> > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > all of the frags are in the same memory type.
> > >
> > > David: maybe there should be such a requirement (that they all are
> > > unreadable)? Might be easier to support initially; we can relax later
> > > on.
> > >
> >
> > Currently devmem == not_readable, and the restriction is that all the
> > frags in the same skb must be either all readable or all unreadable
> > (all devmem or all non-devmem).
> >
> > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > specifically, not generic 'not_readable' frags as the comment says:
> > > >
> > > > + * @devmem: indicates that all the fragments in this skb are backed by
> > > > + * device memory.
> > > >
> > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > off a generic not_readable skb to the userspace is semantically not
> > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > series is not augmented to give any 'not_readable' skb to the
> > > > userspace.
> > > >
> > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > I imagine the stack would implement:
> > > >
> > > > 1. new header flag: skb->newmem
> > > > 2.
> > > >
> > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > {
> > > > return skb->devmem || skb->newmem;
> > > > }
> > > >
> > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > >
> > > You copy it to the userspace in a special way because your frags
> > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > just and optimization.
> > >
> > > For most of the core stack, it doesn't matter why your skb is not
> > > readable. For a few places where it matters (recvmsg?), you can
> > > double-check your frags (all or some) with page_is_page_pool_iov.
> > >
> >
> > I see, we can do that then. I.e. make the header flag 'not_readable'
> > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > something else. We can even assume not_readable == devmem because
> > currently devmem is the only type of unreadable frag currently.
> >
> > > Unrelated: we probably need socket to dmabuf association as well (via
> > > netlink or something).
> >
> > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > any packets that land on that rx-queue are bound to that dma-buf,
> > regardless of which socket that packet belongs to. So the association
> > IMO must be rx-queue to dma-buf, not socket to dma-buf.
>
> But there is still always 1 dmabuf to 1 socket association (on rx), right?
> Because otherwise, there is no way currently to tell, at recvmsg, which
> dmabuf the received token belongs to.
>
Yes, but this 1 dma-buf to 1 socket association happens because the
user binds the dma-buf to an rx-queue and configures flow steering of
the socket to that rx-queue.
> So why not have a separate control channel action to say: this socket fd
> is supposed to receive into this dmabuf fd?
> This action would put
> the socket into permanent 'MSG_SOCK_DEVMEM' mode. Maybe you can also
> put some checks at the lower level to to enforce this dmabuf
> association. (to avoid any potential issues with flow steering)
>
setsockopt(SO_DEVMEM_ASSERT_DMA_BUF, dmabuf_fd)? Sounds interesting,
but maybe a bit of a weird API to me. Because the API can't enforce
the socket to receive packets on a dma-buf (rx-queue binding + flow
steering does that), but the API can assert that incoming packets are
received on said dma-buf. I guess it would check packets before they
are acked and would drop packets that landed on the wrong queue.
I'm a bit unsure about defensively programming features (and uapi no
less) to 'avoid any potential issues with flow steering'. Flow
steering is supposed to work.
Also if we wanted to defensively program something to avoid flow
steering issues, then I'd suggest adding to cmsg_devmem the dma-buf fd
that the data is on, not this setsockopt() that asserts. IMO it's a
weird API for the userspace to ask the kernel to assert some condition
(at least I haven't seen it before or commonly).
But again, in general, I'm a bit unsure about defensively designing
uapi around a feature like flow steering that's supposed to work.
> We'll still have dmabuf to rx-queue association because of various reasons..
--
Thanks,
Mina
^ permalink raw reply
* Re: [PATCH bpf-next v5 00/13] xsk: TX metadata
From: Jakub Kicinski @ 2023-11-06 23:31 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, toke, willemb, dsahern, magnus.karlsson,
bjorn, maciej.fijalkowski, hawk, yoong.siang.song, netdev,
xdp-hints
In-Reply-To: <20231102225837.1141915-1-sdf@google.com>
On Thu, 2 Nov 2023 15:58:24 -0700 Stanislav Fomichev wrote:
> This series implements initial TX metadata (offloads) for AF_XDP.
> See patch #2 for the main implementation and mlx5/stmmac ones for the
> example on how to consume the metadata on the device side.
>
> Starting with two types of offloads:
> - request TX timestamp (and write it back into the metadata area)
> - request TX checksum offload
Acked-by: Jakub Kicinski <kuba@kernel.org>
^ permalink raw reply
* Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP
From: Stanislav Fomichev @ 2023-11-06 23:32 UTC (permalink / raw)
To: Willem de Bruijn
Cc: Mina Almasry, netdev, linux-kernel, linux-arch, linux-kselftest,
linux-media, dri-devel, linaro-mm-sig, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
Ilias Apalodimas, Arnd Bergmann, David Ahern, Shuah Khan,
Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <CAF=yD-JZ88j+44MYgX-=oYJngz4Z0zw6Y0V3nHXisZJtNu7q6A@mail.gmail.com>
On Mon, Nov 6, 2023 at 2:56 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Mon, Nov 6, 2023 at 2:34 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On 11/06, Willem de Bruijn wrote:
> > > > > IMHO, we need a better UAPI to receive the tokens and give them back to
> > > > > the kernel. CMSG + setsockopt(SO_DEVMEM_DONTNEED) get the job done,
> > > > > but look dated and hacky :-(
> > > > >
> > > > > We should either do some kind of user/kernel shared memory queue to
> > > > > receive/return the tokens (similar to what Jonathan was doing in his
> > > > > proposal?)
> > > >
> > > > I'll take a look at Jonathan's proposal, sorry, I'm not immediately
> > > > familiar but I wanted to respond :-) But is the suggestion here to
> > > > build a new kernel-user communication channel primitive for the
> > > > purpose of passing the information in the devmem cmsg? IMHO that seems
> > > > like an overkill. Why add 100-200 lines of code to the kernel to add
> > > > something that can already be done with existing primitives? I don't
> > > > see anything concretely wrong with cmsg & setsockopt approach, and if
> > > > we switch to something I'd prefer to switch to an existing primitive
> > > > for simplicity?
> > > >
> > > > The only other existing primitive to pass data outside of the linear
> > > > buffer is the MSG_ERRQUEUE that is used for zerocopy. Is that
> > > > preferred? Any other suggestions or existing primitives I'm not aware
> > > > of?
> > > >
> > > > > or bite the bullet and switch to io_uring.
> > > > >
> > > >
> > > > IMO io_uring & socket support are orthogonal, and one doesn't preclude
> > > > the other. As you know we like to use sockets and I believe there are
> > > > issues with io_uring adoption at Google that I'm not familiar with
> > > > (and could be wrong). I'm interested in exploring io_uring support as
> > > > a follow up but I think David Wei will be interested in io_uring
> > > > support as well anyway.
> > >
> > > I also disagree that we need to replace a standard socket interface
> > > with something "faster", in quotes.
> > >
> > > This interface is not the bottleneck to the target workload.
> > >
> > > Replacing the synchronous sockets interface with something more
> > > performant for workloads where it is, is an orthogonal challenge.
> > > However we do that, I think that traditional sockets should continue
> > > to be supported.
> > >
> > > The feature may already even work with io_uring, as both recvmsg with
> > > cmsg and setsockopt have io_uring support now.
> >
> > I'm not really concerned with faster. I would prefer something cleaner :-)
> >
> > Or maybe we should just have it documented. With some kind of path
> > towards beautiful world where we can create dynamic queues..
>
> I suppose we just disagree on the elegance of the API.
Yeah, I might be overly sensitive to the apis that use get/setsockopt
for something more involved than setting a flag.
Probably because I know that bpf will (unnecessarily) trigger on these :-D
I had to implement that bpf "bypass" (or fastpath) for
TCP_ZEROCOPY_RECEIVE and it looks like this token recycle might also
benefit from something similar.
> The concise notification API returns tokens as a range for
> compression, encoding as two 32-bit unsigned integers start + length.
> It allows for even further batching by returning multiple such ranges
> in a single call.
Tangential: should tokens be u64? Otherwise we can't have more than
4gb unacknowledged. Or that's a reasonable constraint?
> This is analogous to the MSG_ZEROCOPY notification mechanism from
> kernel to user.
>
> The synchronous socket syscall interface can be replaced by something
> asynchronous like io_uring. This already works today? Whatever
> asynchronous ring-based API would be selected, io_uring or otherwise,
> I think the concise notification encoding would remain as is.
>
> Since this is an operation on a socket, I find a setsockopt the
> fitting interface.
^ permalink raw reply
* Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags
From: David Ahern @ 2023-11-06 23:37 UTC (permalink / raw)
To: Mina Almasry, Stanislav Fomichev
Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
dri-devel, linaro-mm-sig, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
Ilias Apalodimas, Arnd Bergmann, Willem de Bruijn, Shuah Khan,
Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <CAHS8izMaAhoae5ChnzO4gny1cYYnqV1cB8MC2cAF3eoyt+Sf4A@mail.gmail.com>
On 11/6/23 3:18 PM, Mina Almasry wrote:
>>>>>> @@ -991,7 +993,7 @@ struct sk_buff {
>>>>>> #if IS_ENABLED(CONFIG_IP_SCTP)
>>>>>> __u8 csum_not_inet:1;
>>>>>> #endif
>>>>>> -
>>>>>> + __u8 devmem:1;
>>>>>> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>>>>>> __u16 tc_index; /* traffic control index */
>>>>>> #endif
>>>>>> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
>>>>>> __skb_zcopy_downgrade_managed(skb);
>>>>>> }
>>>>>>
>>>>>> +/* Return true if frags in this skb are not readable by the host. */
>>>>>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
>>>>>> +{
>>>>>> + return skb->devmem;
>>>>>
>>>>> bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
>>>>> It better communicates the fact that the stack shouldn't dereference the
>>>>> frags (because it has 'devmem' fragments or for some other potential
>>>>> future reason).
>>>>
>>>> +1.
>>>>
>>>> Also, the flag on the skb is an optimization - a high level signal that
>>>> one or more frags is in unreadable memory. There is no requirement that
>>>> all of the frags are in the same memory type.
>>
>> David: maybe there should be such a requirement (that they all are
>> unreadable)? Might be easier to support initially; we can relax later
>> on.
>>
>
> Currently devmem == not_readable, and the restriction is that all the
> frags in the same skb must be either all readable or all unreadable
> (all devmem or all non-devmem).
What requires that restriction? In all of the uses of skb->devmem and
skb_frags_not_readable() what matters is if any frag is not readable,
then frag list walk or collapse is avoided.
^ permalink raw reply
* Re: [PATCH bpf 1/6] netkit: Add tstats per-CPU traffic counters
From: Daniel Borkmann @ 2023-11-06 23:42 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: martin.lau, netdev, bpf, Nikolay Aleksandrov
In-Reply-To: <20231106132845.6356bc72@kernel.org>
On 11/6/23 10:28 PM, Jakub Kicinski wrote:
> On Fri, 3 Nov 2023 23:27:43 +0100 Daniel Borkmann wrote:
>> Add dev->tstats traffic accounting to netkit. The latter contains per-CPU
>> RX and TX counters.
>>
>> The dev's TX counters are bumped upon pass/unspec as well as redirect
>> verdicts, in other words, on everything except for drops.
>>
>> The dev's RX counters are bumped upon successful __netif_rx(), as well
>> as from skb_do_redirect() (not part of this commit here).
>>
>> Using dev->lstats with having just a single packets/bytes counter and
>> inferring one another's RX counters from the peer dev's lstats is not
>> possible given skb_do_redirect() can also bump the device's stats.
>
> sorry for the delay in replying, I'll comment here instead of on:
>
> https://lore.kernel.org/all/6d5cb0ef-fabc-7ca3-94b2-5b1925a6805f@iogearbox.net/
>
> What I had in mind was to have the driver just set the type of stats.
> That way it doesn't have to bother with error handling either
> (allocation failure checking, making sure free happens in the right
> spot etc. all happen in the core). Here's a completely untested diff:
Ah perfect, thanks! I'll take a look and integrate this into a v2 this
week if that's okay with you. And add sth to bail out if the ndo is in
place and NETDEV_PCPU_STAT_TSTAT not selected for the time being.
Thanks,
Daniel
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox