* Re: [PATCH net-next] net: core: Quiet W=1 warnings for unused vars and static functions
From: Eric Dumazet @ 2014-10-06 22:02 UTC (permalink / raw)
To: Joe Perches; +Cc: netdev, LKML, John Fastabend
In-Reply-To: <1412632298.2916.42.camel@joe-AO725>
On Mon, 2014-10-06 at 14:51 -0700, Joe Perches wrote:
> Reduce noise when compiling W=1.
>
> All the variables are unused.
> The functions are not called outside of the file so static
> is preferred.
>
> Signed-off-by: Joe Perches <joe@perches.com>
> ---
>
> John, can you please verify that these gen_stats accesses
> are unnecessary? I believe the compiler can elide them in
> any case, but I'm not sure what you intended here.
...
> diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
> index 14681b9..01be9cf 100644
> --- a/net/core/gen_stats.c
> +++ b/net/core/gen_stats.c
> @@ -106,13 +106,9 @@ __gnet_stats_copy_basic_cpu(struct gnet_stats_basic_packed *bstats,
> for_each_possible_cpu(i) {
> struct gnet_stats_basic_cpu *bcpu = per_cpu_ptr(cpu, i);
> unsigned int start;
> - __u64 bytes;
> - __u32 packets;
>
> do {
> start = u64_stats_fetch_begin_irq(&bcpu->syncp);
> - bytes = bcpu->bstats.bytes;
> - packets = bcpu->bstats.packets;
> } while (u64_stats_fetch_retry_irq(&bcpu->syncp, start));
>
Well... Please fix the bug for real.
> bstats->bytes += bcpu->bstats.bytes;
->
bstats->bytes += bytes;
bstats->packets += packets;
^ permalink raw reply
* Re: [PATCH net-next] net: core: Quiet W=1 warnings for unused vars and static functions
From: David Miller @ 2014-10-06 21:56 UTC (permalink / raw)
To: joe; +Cc: netdev, linux-kernel, john.fastabend
In-Reply-To: <1412632298.2916.42.camel@joe-AO725>
From: Joe Perches <joe@perches.com>
Date: Mon, 06 Oct 2014 14:51:38 -0700
> Reduce noise when compiling W=1.
>
> All the variables are unused.
> The functions are not called outside of the file so static
> is preferred.
>
> Signed-off-by: Joe Perches <joe@perches.com>
> ---
>
> John, can you please verify that these gen_stats accesses
> are unnecessary? I believe the compiler can elide them in
> any case, but I'm not sure what you intended here.
BTW, this patch reminds me that if people think there are
subdirectories where we can turn on things like -Werror in the
networking I would be very happy to apply such patches.
Things like arch/sparc has had this for years, I even forget when I
added it. :-)
Things like net/core/ for example should be doable for sure.
^ permalink raw reply
* Re: [PATCH net-next] net: introduce netdevice gso_min_segs attribute
From: David Miller @ 2014-10-06 21:54 UTC (permalink / raw)
To: eric.dumazet; +Cc: amirv, edumazet, netdev, yevgenyp, ogerlitz, idos
In-Reply-To: <1412631778.11091.84.camel@edumazet-glaptop2.roam.corp.google.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 06 Oct 2014 14:42:58 -0700
> On Mon, 2014-10-06 at 17:21 -0400, David Miller wrote:
>
>> So exactly what value are you using for mlx4?
>>
>
> It seems that on ConnectX-3 family, TSO packets of 2 or 3 MSS are not
> worth using TSO engine. The cutoff point seems to be 4 (same throughput)
>
> So I was planning to use gso_min_segs = 4 only for them.
>
>> Because I wonder if we should just generically forfeit TSO unless
>> we have > 2 segments, for example.
>
> When I tested on bnx2x, this was not a gain.
>
> bnx2x is faster sending TSO packets, even if they have 2 MSS.
>
> I'll try the experiment on I40E Intel cards.
Ok I'm sold on your patch then if two major chipsets already benefit
from differing values.
I'll apply this, thanks Eric.
^ permalink raw reply
* Re: [PATCH] net: Add ndo_gso_check
From: Jesse Gross @ 2014-10-06 21:51 UTC (permalink / raw)
To: Tom Herbert
Cc: Or Gerlitz, Alexander Duyck, John Fastabend, Jeff Kirsher,
David Miller, Linux Netdev List, Thomas Graf, Pravin Shelar,
Andy Zhou
In-Reply-To: <CA+mtBx8-w74f_f9JEfpwH9=Mjr_zaJJAYCMMdYAWdCy-51pzqg@mail.gmail.com>
On Sun, Oct 5, 2014 at 11:49 AM, Tom Herbert <therbert@google.com> wrote:
> On Sun, Oct 5, 2014 at 7:04 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> On Thu, Oct 2, 2014 at 2:06 AM, Tom Herbert <therbert@google.com> wrote:
>>> On Wed, Oct 1, 2014 at 1:58 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>>> On Tue, Sep 30, 2014 at 6:34 PM, Tom Herbert <therbert@google.com> wrote:
>>>>> On Tue, Sep 30, 2014 at 7:30 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> [...]
>>> Solution #4: apply this patch and implement the check functions as
>>> needed in those 4 or 5 drivers. If a device can only do VXLAN/NVGRE
>>> then I believe the check function is something like:
>>>
>>> bool mydev_gso_check(struct sk_buff *skb, struct net_device *dev)
>>> {
>>> if ((skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL) &&
>>> ((skb->inner_protocol_type != ENCAP_TYPE_ETHER ||
>>> skb->protocol != htons(ETH_P_TEB) ||
>>> skb_inner_mac_header(skb) - skb_transport_header(skb) != 12)
>>> return false;
>>>
>>> return true;
>>> }
>>
>> Yep, such helper can can be basically made to work and let the 4-5
>> drivers that can
>> do GSO offloading for vxlan but not for any FOU/GUE packets signal
>> that to the stack.
>>
>> Re the 12 constant, you were referring to the udp+vxlan headers? it's 8+8
>>
>> Also, we need a way for drivers that can support VXLAN or NVGRE but
>> not concurrently
>> on the same port @ the same time to only let vxlan packet to pass
>> successfully through the helper.
>>
> Or, there should be no difference in GSO processing between VXLAN and
> NVGRE. Can you explain why you feel you need to differentiate them for
> GSO?
There is a difference in the processing that needs to happen for VXLAN
and GRE, even on transmit: at a minimum, the length field in the UDP
header needs to be updated for VXLAN. These are already broken out in
the stack between GRE and UDP tunnels anyways though.
^ permalink raw reply
* Re: [PATCH net-next v2] r8152: nway reset after setting eee
From: David Miller @ 2014-10-06 21:51 UTC (permalink / raw)
To: hayeswang; +Cc: netdev, nic_swsd, linux-kernel, linux-usb
In-Reply-To: <1394712342-15778-57-Taiwan-albertk@realtek.com>
From: Hayes Wang <hayeswang@realtek.com>
Date: Mon, 6 Oct 2014 10:36:04 +0800
> Restart autonegotiation is necessary after setting EEE.
>
> Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Applied, thanks.
^ permalink raw reply
* [PATCH net-next] net: core: Quiet W=1 warnings for unused vars and static functions
From: Joe Perches @ 2014-10-06 21:51 UTC (permalink / raw)
To: netdev; +Cc: LKML, John Fastabend
Reduce noise when compiling W=1.
All the variables are unused.
The functions are not called outside of the file so static
is preferred.
Signed-off-by: Joe Perches <joe@perches.com>
---
John, can you please verify that these gen_stats accesses
are unnecessary? I believe the compiler can elide them in
any case, but I'm not sure what you intended here.
net/core/dev.c | 4 ++--
net/core/gen_stats.c | 4 ----
net/core/rtnetlink.c | 3 +--
3 files changed, 3 insertions(+), 8 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 1a90530..2049a17 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5239,7 +5239,7 @@ void netdev_upper_dev_unlink(struct net_device *dev,
}
EXPORT_SYMBOL(netdev_upper_dev_unlink);
-void netdev_adjacent_add_links(struct net_device *dev)
+static void netdev_adjacent_add_links(struct net_device *dev)
{
struct netdev_adjacent *iter;
@@ -5264,7 +5264,7 @@ void netdev_adjacent_add_links(struct net_device *dev)
}
}
-void netdev_adjacent_del_links(struct net_device *dev)
+static void netdev_adjacent_del_links(struct net_device *dev)
{
struct netdev_adjacent *iter;
diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
index 14681b9..01be9cf 100644
--- a/net/core/gen_stats.c
+++ b/net/core/gen_stats.c
@@ -106,13 +106,9 @@ __gnet_stats_copy_basic_cpu(struct gnet_stats_basic_packed *bstats,
for_each_possible_cpu(i) {
struct gnet_stats_basic_cpu *bcpu = per_cpu_ptr(cpu, i);
unsigned int start;
- __u64 bytes;
- __u32 packets;
do {
start = u64_stats_fetch_begin_irq(&bcpu->syncp);
- bytes = bcpu->bstats.bytes;
- packets = bcpu->bstats.packets;
} while (u64_stats_fetch_retry_irq(&bcpu->syncp, start));
bstats->bytes += bcpu->bstats.bytes;
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a688268..c2fe350 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2917,7 +2917,7 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
{
struct net *net = sock_net(skb->sk);
rtnl_doit_func doit;
- int sz_idx, kind;
+ int kind;
int family;
int type;
int err;
@@ -2933,7 +2933,6 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
return 0;
family = ((struct rtgenmsg *)nlmsg_data(nlh))->rtgen_family;
- sz_idx = type>>2;
kind = type&3;
if (kind != 2 && !netlink_net_capable(skb, CAP_NET_ADMIN))
^ permalink raw reply related
* Re: [PATCH v2 net-next] net: better IFF_XMIT_DST_RELEASE support
From: David Miller @ 2014-10-06 21:50 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, ja
In-Reply-To: <1412559515.11091.46.camel@edumazet-glaptop2.roam.corp.google.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 05 Oct 2014 18:38:35 -0700
> @@ -1001,7 +1001,8 @@ static netdev_features_t bond_fix_features(struct net_device *dev,
>
> static void bond_compute_features(struct bonding *bond)
> {
> - unsigned int flags, dst_release_flag = IFF_XMIT_DST_RELEASE;
> + unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE |
> + IFF_XMIT_DST_RELEASE_PERM;
> netdev_features_t vlan_features = BOND_VLAN_FEATURES;
> netdev_features_t enc_features = BOND_ENC_FEATURES;
> struct net_device *bond_dev = bond->dev;
> @@ -1037,8 +1038,10 @@ done:
> bond_dev->gso_max_segs = gso_max_segs;
> netif_set_gso_max_size(bond_dev, gso_max_size);
>
> - flags = bond_dev->priv_flags & ~IFF_XMIT_DST_RELEASE;
> - bond_dev->priv_flags = flags | dst_release_flag;
> + bond_dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
> + if ((bond_dev->priv_flags & IFF_XMIT_DST_RELEASE_PERM) &&
> + dst_release_flag == (IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM))
> + bond_dev->priv_flags |= IFF_XMIT_DST_RELEASE;
I think I might be missing something, but in all of these places where
you add this logic, it looks to me like:
dst_release_flag = CONSTANT;
...
if (... &&
dst_release_flags == CONSTANT)
This 'dst_release_flag' variable never changes, so why bother with the
test at all?
Maybe I'm just being dense today...
^ permalink raw reply
* Re: [PATCH v2 net-next 15/15] tipc: remove old ASCII netlink API
From: Jon Paul Maloy @ 2014-10-06 21:47 UTC (permalink / raw)
To: jon.maloy@ericsson.com, richard.alpe@ericsson.com,
erik.hugne@ericsson.com, ying.xue@windriver.com
Cc: netdev@vger.kernel.org, tipc-discussion@lists.sourceforge.net
In-Reply-To: <20141006.152003.1261907320590776614.davem@davemloft.net>
I sort of expected that answer. Just resend the other ones so we get them in now (we are at rc7+). We can try to figure out if we can do a kernel-internal translation later.
///jon
Sent from Yahoo Mail on Android
------------------------------------------------------------------------------
Slashdot TV. Videos for Nerds. Stuff that Matters.
http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk
^ permalink raw reply
* Re: [PATCH net-next] net: introduce netdevice gso_min_segs attribute
From: Eric Dumazet @ 2014-10-06 21:42 UTC (permalink / raw)
To: David Miller; +Cc: amirv, edumazet, netdev, yevgenyp, ogerlitz, idos
In-Reply-To: <20141006.172149.1596496098013953331.davem@davemloft.net>
On Mon, 2014-10-06 at 17:21 -0400, David Miller wrote:
> So exactly what value are you using for mlx4?
>
It seems that on ConnectX-3 family, TSO packets of 2 or 3 MSS are not
worth using TSO engine. The cutoff point seems to be 4 (same throughput)
So I was planning to use gso_min_segs = 4 only for them.
> Because I wonder if we should just generically forfeit TSO unless
> we have > 2 segments, for example.
When I tested on bnx2x, this was not a gain.
bnx2x is faster sending TSO packets, even if they have 2 MSS.
I'll try the experiment on I40E Intel cards.
^ permalink raw reply
* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: David Miller @ 2014-10-06 21:42 UTC (permalink / raw)
To: john.fastabend
Cc: dborkman, fw, gerlitz.or, hannes, netdev, john.ronciak, amirv,
eric.dumazet, danny.zhou
In-Reply-To: <20141006000629.32055.2295.stgit@nitbit.x32>
From: John Fastabend <john.fastabend@gmail.com>
Date: Sun, 05 Oct 2014 17:06:31 -0700
> This patch adds a net_device ops to split off a set of driver queues
> from the driver and map the queues into user space via mmap. This
> allows the queues to be directly manipulated from user space. For
> raw packet interface this removes any overhead from the kernel network
> stack.
About the facility in general, I am generally in favor, as I expressed
at the networking track in Chiacgo.
But you missed the mark wrt. describing the descriptors.
I do not want you to give device IDs.
I want the code to be %100 agnostic to device or vendor IDs.
Really "describe" the descriptor. Not just how large is it (32-bits,
64-bits, etc.), but also: 1) is it little or big endian 2) where is
the length field 3) where is control bit "foo" located, etc.
That's what I want to see in "struct tpacket_dev_info", rather than
device IDs and "versions".
^ permalink raw reply
* Re: [PATCH] net: Add ndo_gso_check
From: Tom Herbert @ 2014-10-06 21:28 UTC (permalink / raw)
To: Or Gerlitz
Cc: Alexander Duyck, John Fastabend, Jeff Kirsher, David Miller,
Linux Netdev List, Thomas Graf, Pravin Shelar, Andy Zhou
In-Reply-To: <CAJ3xEMiLsQ=Wd=461L1Wx0jQ+zivY-Z573+90dD8d7pdUD4Skw@mail.gmail.com>
On Mon, Oct 6, 2014 at 1:22 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Mon, Oct 6, 2014 at 8:59 PM, Tom Herbert <therbert@google.com> wrote:
>> On Sun, Oct 5, 2014 at 12:13 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>> RX wise, Linux tells the driver that UDP port X would be used for
>>> VXLAN, right? and indeed, it's possible for some HW implementations
>>> not to support RX offloading (checksum) for both VXLAN and NVGRE @ the
>>> same time over the same port. But TX/GRO wise, you're probably
>>> correct. The thing is that from the user POV they need solution that
>>> works for both RX and TX offloading.
>
>> I think from a user POV we want a solution that supports RX and TX
>> offloading across the widest range of protocols. This is accomplished
>> by implementing protocol agnostic mechanisms like CHECKSUM_COMPLETE
>> and protocol agnostic UDP tunnel TSO like we've described. IMO, the
>> fact that we have devices that implement protocol specific mechanisms
>> for NVGRE and VXLAN should be considered legacy support in the stack,
>> for new UDP encapsulation protocols we should not expose specifics in
>> the stack in either by adding a GSO type for each protocol, nor
>> ndo_add_foo_port for each protocol-- these things will not scale and
>> unnecessarily complicate the core stack.
>
> I tend to generally agree to the wind that blows from your writeup, namely:
>
> UDP encapsulation offloads wise, we should pose few general
> requirements to NICs to be implemented by vendors in their tomorrow's
> HW and treat the current generation (these 4-5 drivers with their
> limitations as legacy which should be supported but not state the
> stack overall design).
>
> Still we should seek more ways to reduce the pain/amount of
> not-well-defined-configurations when these drivers are there and the
> stack goes through this upside-down turnaround changes. OTOH you
> didn't accept my SKB coloring suggestion for GSO inspection, and OTOH
> I guess we can live with some sort of generic helper in the form of
> what you suggested, but like it or not, getting rid of
> ndo_add_vxlan_port will simply break things out.
>
> Are we going to have a session on the encapsulation/offloads design @ LPC?
>
yes, I will talk about FOU and GUE implementation. You should
abstracts in the schedule now.
> I think a replay of your LKS presentation along with open discussion
> on how to get there with the legacy requirements could be very
> helpful.
>
>
> Or.
^ permalink raw reply
* Re: [PATCH net-next] net: introduce netdevice gso_min_segs attribute
From: David Miller @ 2014-10-06 21:21 UTC (permalink / raw)
To: eric.dumazet; +Cc: amirv, edumazet, netdev, yevgenyp, ogerlitz, idos
In-Reply-To: <1412529087.11091.14.camel@edumazet-glaptop2.roam.corp.google.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 05 Oct 2014 10:11:27 -0700
> From: Eric Dumazet <edumazet@google.com>
>
> Some TSO engines might have a too heavy setup cost, that impacts
> performance on hosts sending small bursts (2 MSS per packet).
>
> This patch adds a device gso_min_segs, allowing drivers to set
> a minimum segment size for TSO packets, according to the NIC
> performance.
>
> Tested on a mlx4 NIC, this allows to get a ~110% increase of
> throughput when sending 2 MSS per packet.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
So exactly what value are you using for mlx4?
Because I wonder if we should just generically forfeit TSO unless
we have > 2 segments, for example.
^ permalink raw reply
* Re: [PATCH net-next] ipv4: igmp: fix v3 general query drop monitor false positive
From: David Miller @ 2014-10-06 21:15 UTC (permalink / raw)
To: dborkman; +Cc: edumazet, netdev
In-Reply-To: <1412522870-26335-1-git-send-email-dborkman@redhat.com>
From: Daniel Borkmann <dborkman@redhat.com>
Date: Sun, 5 Oct 2014 17:27:50 +0200
> In case we find a general query with non-zero number of sources, we
> are dropping the skb as it's malformed.
>
> RFC3376, section 4.1.8. Number of Sources (N):
>
> This number is zero in a General Query or a Group-Specific Query,
> and non-zero in a Group-and-Source-Specific Query.
>
> Therefore, reflect that by using kfree_skb() instead of consume_skb().
>
> Fixes: d679c5324d9a ("igmp: avoid drop_monitor false positives")
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Applied, thanks Daniel.
^ permalink raw reply
* Re: [PATCH v8 net-next 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
From: David Miller @ 2014-10-06 21:13 UTC (permalink / raw)
To: maheshb; +Cc: j.vosburgh, andy, vfalico, nikolay, netdev, edumazet, maze
In-Reply-To: <1412469901-27451-1-git-send-email-maheshb@google.com>
From: Mahesh Bandewar <maheshb@google.com>
Date: Sat, 4 Oct 2014 17:45:01 -0700
> Earlier change to use usable slave array for TLB mode had an additional
> performance advantage. So extending the same logic to all other modes
> that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
> Also consolidating this with the earlier TLB change.
>
> The main idea is to build the usable slaves array in the control path
> and use that array for slave selection during xmit operation.
>
> Measured performance in a setup with a bond of 4x1G NICs with 200
> instances of netperf for the modes involved (3ad, xor, tlb)
> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
>
> Mode TPS-Before TPS-After
>
> 802.3ad : 468,694 493,101
> TLB (lb=0): 392,583 392,965
> XOR : 475,696 484,517
>
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Applied.
^ permalink raw reply
* Re: [PATCH v8 net-next 1/2] bonding: display xmit_hash_policy for non-dynamic-tlb mode
From: David Miller @ 2014-10-06 21:13 UTC (permalink / raw)
To: maheshb; +Cc: j.vosburgh, andy, vfalico, nikolay, netdev, edumazet, maze
In-Reply-To: <1412469884-27308-1-git-send-email-maheshb@google.com>
From: Mahesh Bandewar <maheshb@google.com>
Date: Sat, 4 Oct 2014 17:44:44 -0700
> It's a trivial fix to display xmit_hash_policy for this new TLB mode
> since it uses transmit-hash-poilicy as part of bonding-master info
> (/proc/net/bonding/<bonding-interface).
>
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Applied.
^ permalink raw reply
* Re: [PATCH RFC net 0/2] ipv6: Avoid restarting fib6_lookup() for RTF_CACHE hit
From: David Miller @ 2014-10-06 21:10 UTC (permalink / raw)
To: kafai; +Cc: netdev, hannes
In-Reply-To: <1412374332-26657-1-git-send-email-kafai@fb.com>
From: Martin KaFai Lau <kafai@fb.com>
Date: Fri, 3 Oct 2014 15:12:10 -0700
> I am trying to understand why there is a need to restart fib6_lookup() after
> getting rt with RTF_CACHE.
>
> I have adapted the davem's udpflood test
> (https://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git) to
> support IPv6 and here is the result:
>
> #root > time ./udpflood -l 20000000 -c 250 2401:db00:face:face::2
>
> Before:
> real 0m33.224s
> user 0m2.941s
> sys 0m30.232s
>
> After:
> real 0m31.517s
> user 0m2.938s
> sys 0m28.536s
If you are serious about seeing these patches integrated, you must
freshly repost this series and provide a proper "Signed-off-by: " tag
for yourself in the commit messages.
^ permalink raw reply
* Re: [Xen-devel] [PATCHv1] xen-netfront: always keep the Rx ring full of requests
From: David Miller @ 2014-10-06 21:07 UTC (permalink / raw)
To: annie.li; +Cc: david.vrabel, netdev, boris.ostrovsky, xen-devel
In-Reply-To: <5432E26C.3020207@oracle.com>
From: annie li <annie.li@oracle.com>
Date: Mon, 06 Oct 2014 14:41:48 -0400
>
> On 2014/10/6 12:00, David Vrabel wrote:
>>>> + queue->rx.req_prod_pvt = req_prod;
>>>> +
>>>> + /* Not enough requests? Try again later. */
>>>> + if (req_prod - queue->rx.rsp_cons < NET_RX_SLOTS_MIN) {
>>>> + mod_timer(&queue->rx_refill_timer, jiffies + (HZ/10));
>>>> + return;
>>> If the previous for loop breaks because of failure of
>>> xennet_alloc_one_rx_buffer, then notify_remote_via_irq is missed here
>>> if
>>> the code returns directly.
>> This is deliberate -- there's no point notifying the backend if there
>> aren't enough requests for the next packet. Since we don't know what
>> the next packet might be we assume it's the largest possible.
> That makes sense.
> However, the largest packet case does not happen so
> frequently. Moreover, netback checks the slots every incoming skb
> requires in xenvif_rx_ring_slots_available, not only concerning the
> largest case.
I have an opinion about the sysfs stuff.
It's user facing, so even if it doesn't influence behavior any more
you have to keep the files around, just make them nops.
^ permalink raw reply
* Re: [PATCH v2 7/7] driver-core: add preferred async probe option for built-in and modules
From: Tejun Heo @ 2014-10-06 21:01 UTC (permalink / raw)
To: Luis R. Rodriguez
Cc: Luis R. Rodriguez, gregkh, dmitry.torokhov, tiwai, arjan, teg,
rmilasan, werner, oleg, hare, bpoirier, santosh, pmladek, dbueso,
linux-kernel, Tetsuo Handa, Joseph Salisbury, Kay Sievers,
One Thousand Gnomes, Tim Gardner, Pierre Fersing, Andrew Morton,
Nagalakshmi Nandigama, Praveen Krishnamoorthy, Sreekanth Reddy,
Abhijit Mahajan, Casey Leedom
In-Reply-To: <20141006203627.GZ14081@wotan.suse.de>
Hello,
On Mon, Oct 06, 2014 at 10:36:27PM +0200, Luis R. Rodriguez wrote:
> > Do we intend to keep this param permanently? Isn't this more of a
> > temp tool to be used during development? If so, maybe we should make
> > that clear with __DEVEL__ too?
>
> As its designed right now no, its not a temp tool, its there to
> require compatibility with old userspace. For modules we can require
> the module parameter but for built-in we need something else and this
> is what came to mind. It is also what would allow the prefer_async_probe
> flag too as otherwise we won't know if userspace is prepared.
I don't get it. For in-kernel stuff, we already have a clear
synchronization point where we already synchronize all async calls.
Shouldn't we be flushing these async probes there too? insmod'ing is
userland visible but there's no reason this has to be for the built-in
drivers.
Thanks.
--
tejun
^ permalink raw reply
* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: John Fastabend @ 2014-10-06 20:42 UTC (permalink / raw)
To: Stephen Hemminger
Cc: dborkman, fw, gerlitz.or, hannes, netdev, john.ronciak, amirv,
eric.dumazet, danny.zhou
In-Reply-To: <20141006095553.18097d68@urahara>
On 10/06/2014 09:55 AM, Stephen Hemminger wrote:
> On Sun, 05 Oct 2014 17:06:31 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
>
>> This patch adds a net_device ops to split off a set of driver queues
>> from the driver and map the queues into user space via mmap. This
>> allows the queues to be directly manipulated from user space. For
>> raw packet interface this removes any overhead from the kernel network
>> stack.
>>
>> Typically in an af_packet interface a packet_type handler is
>> registered and used to filter traffic to the socket and do other
>> things such as fan out traffic to multiple sockets. In this case the
>> networking stack is being bypassed so this code is not run. So the
>> hardware must push the correct traffic to the queues obtained from
>> the ndo callback ndo_split_queue_pairs().
>>
>> Fortunately there is already a flow classification interface which
>> is part of the ethtool command set, ETHTOOL_SRXCLSRLINS. It is
>> currently supported by multiple drivers including sfc, mlx4, niu,
>> ixgbe, and i40e. Supporting some way to steer traffic to a queue
>> is the _only_ hardware requirement to support the interface, plus
>> the driver needs to implement the correct ndo ops. A follow on
>> patch adds support for ixgbe but we expect at least the subset of
>> drivers implementing ETHTOOL_SRXCLSRLINS to be implemented later.
>>
>> The interface is driven over an af_packet socket which we believe
>> is the most natural interface to use. Because it is already used
>> for raw packet interfaces which is what we are providing here.
>> The high level flow for this interface looks like:
>>
>> bind(fd, &sockaddr, sizeof(sockaddr));
>>
>> /* Get the device type and info */
>> getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
>> &optlen);
>>
>> /* With device info we can look up descriptor format */
>>
>> /* Get the layout of ring space offset, page_sz, cnt */
>> getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>> &info, &optlen);
>>
>> /* request some queues from the driver */
>> setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> &qpairs_info, sizeof(qpairs_info));
>>
>> /* if we let the driver pick us queues learn which queues
>> * we were given
>> */
>> getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> &qpairs_info, sizeof(qpairs_info));
>>
>> /* And mmap queue pairs to user space */
>> mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>> MAP_SHARED, fd, 0);
>>
>> /* Now we have some user space queues to read/write to*/
>>
>> There is one critical difference when running with these interfaces
>> vs running without them. In the normal case the af_packet module
>> uses a standard descriptor format exported by the af_packet user
>> space headers. In this model because we are working directly with
>> driver queues the descriptor format maps to the descriptor format
>> used by the device. User space applications can learn device
>> information from the socket option PACKET_DEV_DESC_INFO which
>> should provide enough details to extrapulate the descriptor formats.
>> Although this adds some complexity to user space it removes the
>> requirement to copy descriptor fields around.
>>
>> The formats are usually provided by the device vendor documentation
>> If folks want I can provide a follow up patch to provide the formats
>> in a .h file in ./include/uapi/linux/ for ease of use. I have access
>> to formats for ixgbe and mlx drivers other driver owners would need to
>> provide their formats.
>>
>> We tested this interface using traffic generators and doing basic
>> L2 forwarding tests on ixgbe devices. Our tests use a set of patches
>> to DPDK to enable an interface using this socket interfaace. With
>> this interface we can xmit/receive @ line rate from a test user space
>> application on a single core.
>>
>> Additionally we have a set of DPDK patches to enable DPDK with this
>> interface. DPDK can be downloaded @ dpdk.org although as I hope is
>> clear from above DPDK is just our paticular test environment we
>> expect other libraries could be built on this interface.
>>
>> Signed-off-by: Danny Zhou <danny.zhou@intel.com>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>
> I like the ability to share a device between kernel and user mode networking.
> The model used for DPDK for this is really ugly and fragile/broken.
> Your proposal assumes that you fully trust the user mode networking application
> which is not a generally safe assumption.
>
> A device can DMA from/to any arbitrary physical memory.
> And it would be hard to use IOMMU to protect because the
> IOMMU doesn't know that the difference between the applications queue and
> the rest of the queues.
>
> At least with DPDK you can use VFIO, and you are claiming the whole device to
> allow protection against random memory being read/written.
>
>
However not all platforms support VFIO and when the application
only want to handle specific traffic types a queue maps well to
this.
--
John Fastabend Intel Corporation
^ permalink raw reply
* Re: [PATCH v2 7/7] driver-core: add preferred async probe option for built-in and modules
From: Dmitry Torokhov @ 2014-10-06 20:41 UTC (permalink / raw)
To: Luis R. Rodriguez
Cc: gregkh, tiwai, tj, arjan, teg, rmilasan, werner, oleg, hare,
bpoirier, santosh, pmladek, dbueso, mcgrof, linux-kernel,
Tetsuo Handa, Joseph Salisbury, Kay Sievers, One Thousand Gnomes,
Tim Gardner, Pierre Fersing, Andrew Morton, Nagalakshmi Nandigama,
Praveen Krishnamoorthy, Sreekanth Reddy, Abhijit Mahajan,
Casey Leedom, Hariprasad S
In-Reply-To: <1412372683-2003-8-git-send-email-mcgrof@do-not-panic.com>
Hi Luis,
On Fri, Oct 03, 2014 at 02:44:43PM -0700, Luis R. Rodriguez wrote:
> From: "Luis R. Rodriguez" <mcgrof@suse.com>
>
> At times we may wish to express the desire to prefer to have
> a device driver probe asynchronously by default. We cannot
> simply enable all device drivers to do this without vetting
> that userspace is prepared for this though given that some
> old userspace is expected to exist which is not equipped to
> deal with broad async probe support. This defines a new kernel
> parameter, bus.enable_kern_async=1, to help address this both to
> help enable async probe support for built-in drivers and to
> enable drivers to specify a preference to opt in for async
> probe support.
>
> If you have a device driver that should use async probe
> support when possible enable the prefer_async_probe bool.
>
> Folks wishing to test enabling async probe for all built-in
> drivers can enable bus.__DEBUG__kernel_force_mod_async_probe=1,
> if you use that though you are on your own.
Thank you for working on this. However there are still couple of issues
with the async probe.
1. As far as I can see you only handle the case when device is already
present and you load a driver. In this case we will do either async or
sync probing, depending on the driver/module settings. However if driver
has already been loaded/registered and we are adding a new device
(another module load, for example you load i2c controller module and it
enumerates its children, or driver signalled deferral during binding)
the probe will be synchronous regardless of the settings.
2. I thin kin the current implementation deferred binding process is
still single-threaded and basically synchronous.
Both of these issues stem form the fact that you only plugging into
bus_add_driver(), but you also need to plug into bus_probe_device(). I
believe I handled these 2 cases properly in the version of patch I sent
a couple of weeks ago so if you could incorporate that in your work that
would be great.
Thanks.
--
Dmitry
^ permalink raw reply
* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: John Fastabend @ 2014-10-06 20:37 UTC (permalink / raw)
To: Hannes Frederic Sowa
Cc: Daniel Borkmann, John Fastabend, Jesper Dangaard Brouer,
John W. Linville, Neil Horman, Florian Westphal, gerlitz.or,
netdev, john.ronciak, amirv, eric.dumazet, danny.zhou
In-Reply-To: <1412615032.3403.27.camel@localhost>
On 10/06/2014 10:03 AM, Hannes Frederic Sowa wrote:
> Hi John,
>
> On Mo, 2014-10-06 at 08:01 -0700, John Fastabend wrote:
>> On 10/06/2014 02:49 AM, Daniel Borkmann wrote:
>>> Hi John,
>>>
>>> On 10/06/2014 03:12 AM, John Fastabend wrote:
>>>> On 10/05/2014 05:29 PM, Florian Westphal wrote:
>>>>> John Fastabend <john.fastabend@gmail.com> wrote:
>>>>>> There is one critical difference when running with these interfaces
>>>>>> vs running without them. In the normal case the af_packet module
>>>>>> uses a standard descriptor format exported by the af_packet user
>>>>>> space headers. In this model because we are working directly with
>>>>>> driver queues the descriptor format maps to the descriptor format
>>>>>> used by the device. User space applications can learn device
>>>>>> information from the socket option PACKET_DEV_DESC_INFO which
>>>>>> should provide enough details to extrapulate the descriptor formats.
>>>>>> Although this adds some complexity to user space it removes the
>>>>>> requirement to copy descriptor fields around.
>>>>>
>>>>> I find it very disappointing that we seem to have to expose such
>>>>> hardware specific details to userspace via hw-independent interface.
>>>>
>>>> Well it was only for convenience if it doesn't fit as a socket
>>>> option we can remove it. We can look up the device using the netdev
>>>> name from the bind call. I see your point though so if there is
>>>> consensus that this is not needed that is fine.
>>>>
>>>>> How big of a cost are we talking about when you say that it 'removes
>>>>> the requirement to copy descriptor fields'?
>>>>
>>>> This was likely a poor description. If you want to let user space
>>>> poll on the ring (without using system calls or interrupts) then
>>>> I don't see how you can _not_ expose the ring directly complete with
>>>> the vendor descriptor formats.
>>>
>>> But how big is the concrete performance degradation you're seeing if you
>>> use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
>>> that does *not* directly expose hw descriptor formats to user space?
>>
>> If we don't directly expose the hardware descriptor formats then we
>> need to somehow kick the driver when we want it to do the copy from
>> the driver descriptor format to the common descriptor format.
>>
>> This requires a system call as far as I can tell. Which has unwanted
>> overhead. I can micro-benchmark this if its helpful. But if we dredge
>> up Jesper's slides here we are really counting cycles so even small
>> numbers count if we want to hit line rate in a user space application
>> with 40Gpbs hardware.
>
> I agree, it seems pretty hard to achieve non-syscall sending on the same
> core, as we somehow must transfer control over to the kernel without
> doing a syscall.
>
> The only other idea would be to export machine code up to user space,
> which you can mmap(MAP_EXEC) from the socket somehow to make this API
> truly NIC agnostic without recompiling. This code then would transform
> the generic descriptors to the hardware specific ones. Seems also pretty
> hairy to do that correctly, maybe.
This seems more complicated then a minimal library in userspace to
load descriptor handling code based on device id.
>
>>> With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
>>> on 40G when run on decent hardware though.
>>>
>>> It would really be great if we have something vendor neutral exposed as
>>> a stable ABI and could leverage emerging infrastructure we already have
>>> in the kernel such as eBPF and recent qdisc batching for raw sockets
>>> instead of reinventing the wheels. (Don't get me wrong, I would love to
>>> see AF_PACKET improved ...)
>>
>> I don't think the interface is vendor specific. It does require some
>> knowledge of the hardware descriptor layout though. It is though vendor
>> neutral from my point of view. I provided the ixgbe patch simple because
>> I'm most familiar with it and have a NIC here. If someone wants to send me
>> a Mellanox NIC I can give it a try although I was hoping to recruit Or or
>> Amir? The only hardware feature required is flow classification to queues
>> which seems to be common across 10Gbps and 40/100Gbps devices. So most
>> of the drivers should be able to support this.
>
> Does flow classification work at the same level as registering network
> addresses? Do I have to bind a e.g. multicast address wie ip maddr and
> then set up flow director/ntuple to get the packets on the correct user
> space facing queue or is it in case of the ixgbe card enough to just add
> those addresses via fdir? Have you thought about letting the
> kernel/driver handle that? In case one would like to connect their
> virtual machines via this interface to the network maybe we need central
> policing and resource constraints for queue management here?
Right now it is enough to program the addresses via fdir. This shouldn't
be ixgbe specific and I did take a quick look at the other drivers use
for fdir and believe it should work the same.
I'm not sure what you mean by kernel/driver handle this? Maybe you mean
from the socket interface? I thought about it briefly but opted for what
I think is more straight forward and using the fdir APIs.
As far as policing and resource constraints I think that is a user space
task. And yes I am working on some user space management applications but
they are still fairly rough.
>
> Do other drivers need a separate af-packet managed way to bind addresses
> to the queue? Maybe there are other quirks we might need to add to
> actually build support for other network interface cards. Would be great
> to at least examine one other driver in regard to this.
I _believe_ this interface is sufficient. I think one of the mellanox
interfaces could be implemented fairly easily.
>
> What other properties of the NIC must be exported? I think we also have
> to deal with MTUs currently configured in the NIC, promisc mode and
> maybe TSO?
We don't support per queue MTUs in the kernel. So I think this can be
learned using existing interfaces.
>
>> If your worried driver writers will implement the interface but not make
>> their descriptor formats easily available I considered putting the layout
>> in a header file in the uapi somewhere. Then we could just reject any
>> implementation that doesn't include the header file needed to use it
>> from user space.
>>
>> With regards to leveraging eBPF and qdisc batching I don't see how this
>> works with direct DMA and polling. Needed to give the lowest overhead
>> between kernel and user space. In this case we want to use the hardware
>> to do the filtering that would normally be done for eBPF and for many
>> use cases the hardware flow classifiers is sufficient.
>
> I agree, those features are hard to connect.
>
>> We already added a qdisc bypass option I see this as taking this path
>> further. I believe there is room for a continuum here. For basic cases
>> use af_packet v1,v2 for mmap rings but using common descriptors use
>> af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
>> specific applications that don't need QOS, eBPF use this interface.
>
> You can simply write C code instead of eBPF code, yes.
>
> I find the six additional ndo ops a bit worrisome as we are adding more
> and more subsystem specific ndoops to this struct. I would like to see
> some unification here, but currently cannot make concrete proposals,
> sorry.
I agree it seems like a bit much. One thought was to split the ndo
ops into categories. Switch ops, MACVLAN ops, basic ops and with this
userspace queue ops. This sort of goes along with some of the switch
offload work which is going to add a handful more ops as best I can
tell.
>
> Patch 2/3 does not yet expose hw ring descriptors in uapi headers it
> seems?
>
Nope I wasn't sure if we wanted to put the ring desciptors in UAPI or
not. If we do I would likely push that as a 4th patch.
> Are there plans to push a user space framework (maybe even into the
> kernel), too? Will this be dpdk (alike) in the end?
I have patches to enable this interface on DPDK and it gets the
same performance as the other DPDK interfaces.
I've considered creating a minimal library to do basic tx/rx and
descriptor processing maybe in ./test or ,/scripts to give a usage
example that is easier to get ahold of and review without having
to pull in all the other things DPDK does that may or may not be
interesting depending on what you are doing and on what hardware.
>
> Bye,
> Hannes
>
>
^ permalink raw reply
* Re: [PATCH v2 7/7] driver-core: add preferred async probe option for built-in and modules
From: Luis R. Rodriguez @ 2014-10-06 20:36 UTC (permalink / raw)
To: Tejun Heo
Cc: Luis R. Rodriguez, gregkh, dmitry.torokhov, tiwai, arjan, teg,
rmilasan, werner, oleg, hare, bpoirier, santosh, pmladek, dbueso,
linux-kernel, Tetsuo Handa, Joseph Salisbury, Kay Sievers,
One Thousand Gnomes, Tim Gardner, Pierre Fersing, Andrew Morton,
Nagalakshmi Nandigama, Praveen Krishnamoorthy, Sreekanth Reddy,
Abhijit Mahajan, Casey Leedom
In-Reply-To: <20141006201926.GF18303@htj.dyndns.org>
On Mon, Oct 06, 2014 at 04:19:26PM -0400, Tejun Heo wrote:
> Hello, Luis.
>
> The patchset generally looks good to me. Please feel free to add
>
> Reviewed-by: Tejun Heo <tj@kernel.org>
Will do.
> A question below.
>
> On Fri, Oct 03, 2014 at 02:44:43PM -0700, Luis R. Rodriguez wrote:
> > + bus.enable_kern_async=1 [KNL]
> > + This tells the kernel userspace has been vetted for
> > + asynchronous probe support and can listen to the device
> > + driver prefer_async_probe flag for both built-in device
> > + drivers and modules.
>
> Do we intend to keep this param permanently? Isn't this more of a
> temp tool to be used during development? If so, maybe we should make
> that clear with __DEVEL__ too?
As its designed right now no, its not a temp tool, its there to
require compatibility with old userspace. For modules we can require
the module parameter but for built-in we need something else and this
is what came to mind. It is also what would allow the prefer_async_probe
flag too as otherwise we won't know if userspace is prepared.
If we want to get rid of it, it would mean that we're letting go of the idea
that some userspace might exist which depends on *not* doing async probe. As
such I would not consider it a __DEVEL__ param and it'd be a judgement call
to eventually *not require* it. I can see that happening but perhaps a large
series of kernels down the road?
Luis
^ permalink raw reply
* [net-next 2/2] openvswitch: fix a sparse warning
From: Andy Zhou @ 2014-10-06 20:22 UTC (permalink / raw)
To: davem; +Cc: netdev, Andy Zhou
In-Reply-To: <1412626971-30271-1-git-send-email-azhou@nicira.com>
Fix a sparse warning introduced by commit:
f5796684069e0c71c65bce6a6d4766114aec1396 (openvswitch: Add support for
Geneve tunneling.) caught by kbuild test robot:
reproduce:
# apt-get install sparse
# git checkout f5796684069e0c71c65bce6a6d4766114aec1396
# make ARCH=x86_64 allmodconfig
# make C=1 CF=-D__CHECK_ENDIAN__
#
#
# sparse warnings: (new ones prefixed by >>)
#
# >> net/openvswitch/vport-geneve.c:109:15: sparse: incorrect type in assignment (different base types)
# net/openvswitch/vport-geneve.c:109:15: expected restricted __be16 [usertype] sport
# net/openvswitch/vport-geneve.c:109:15: got int
# >> net/openvswitch/vport-geneve.c:110:56: sparse: incorrect type in argument 3 (different base types)
# net/openvswitch/vport-geneve.c:110:56: expected unsigned short [unsigned] [usertype] value
# net/openvswitch/vport-geneve.c:110:56: got restricted __be16 [usertype] sport
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
---
net/openvswitch/vport-geneve.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/net/openvswitch/vport-geneve.c b/net/openvswitch/vport-geneve.c
index 5572d48..910b3ef 100644
--- a/net/openvswitch/vport-geneve.c
+++ b/net/openvswitch/vport-geneve.c
@@ -104,10 +104,9 @@ static int geneve_get_options(const struct vport *vport,
struct sk_buff *skb)
{
struct geneve_port *geneve_port = geneve_vport(vport);
- __be16 sport;
+ struct inet_sock *sk = inet_sk(geneve_port->gs->sock->sk);
- sport = ntohs(inet_sk(geneve_port->gs->sock->sk)->inet_sport);
- if (nla_put_u16(skb, OVS_TUNNEL_ATTR_DST_PORT, sport))
+ if (nla_put_u16(skb, OVS_TUNNEL_ATTR_DST_PORT, ntohs(sk->inet_sport)))
return -EMSGSIZE;
return 0;
}
--
1.7.9.5
^ permalink raw reply related
* [net-next 1/2] net: fix a sparse warning
From: Andy Zhou @ 2014-10-06 20:22 UTC (permalink / raw)
To: davem; +Cc: netdev, Andy Zhou
Fix a sparse warning introduced by Commit
0b5e8b8eeae40bae6ad7c7e91c97c3c0d0e57882 (net: Add Geneve tunneling
protocol driver) caught by kbuild test robot:
# apt-get install sparse
# git checkout 0b5e8b8eeae40bae6ad7c7e91c97c3c0d0e57882
# make ARCH=x86_64 allmodconfig
# make C=1 CF=-D__CHECK_ENDIAN__
#
#
# sparse warnings: (new ones prefixed by >>)
#
# >> net/ipv4/geneve.c:230:42: sparse: incorrect type in assignment (different base types)
# net/ipv4/geneve.c:230:42: expected restricted __be32 [addressable] [assigned] [usertype] s_addr
# net/ipv4/geneve.c:230:42: got unsigned long [unsigned] <noident>
#
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
---
net/ipv4/geneve.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/geneve.c b/net/ipv4/geneve.c
index f008c55..065cd94 100644
--- a/net/ipv4/geneve.c
+++ b/net/ipv4/geneve.c
@@ -227,7 +227,7 @@ static struct socket *geneve_create_sock(struct net *net, bool ipv6,
udp_conf.family = AF_INET6;
} else {
udp_conf.family = AF_INET;
- udp_conf.local_ip.s_addr = INADDR_ANY;
+ udp_conf.local_ip.s_addr = htonl(INADDR_ANY);
}
udp_conf.local_udp_port = port;
--
1.7.9.5
^ permalink raw reply related
* Re: [PATCH] net: Add ndo_gso_check
From: Or Gerlitz @ 2014-10-06 20:22 UTC (permalink / raw)
To: Tom Herbert
Cc: Alexander Duyck, John Fastabend, Jeff Kirsher, David Miller,
Linux Netdev List, Thomas Graf, Pravin Shelar, Andy Zhou
In-Reply-To: <CA+mtBx-AW65w7grfb7zTaDKzKQ2n_7JQuLquj1Ayopa4gdiF6Q@mail.gmail.com>
On Mon, Oct 6, 2014 at 8:59 PM, Tom Herbert <therbert@google.com> wrote:
> On Sun, Oct 5, 2014 at 12:13 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> RX wise, Linux tells the driver that UDP port X would be used for
>> VXLAN, right? and indeed, it's possible for some HW implementations
>> not to support RX offloading (checksum) for both VXLAN and NVGRE @ the
>> same time over the same port. But TX/GRO wise, you're probably
>> correct. The thing is that from the user POV they need solution that
>> works for both RX and TX offloading.
> I think from a user POV we want a solution that supports RX and TX
> offloading across the widest range of protocols. This is accomplished
> by implementing protocol agnostic mechanisms like CHECKSUM_COMPLETE
> and protocol agnostic UDP tunnel TSO like we've described. IMO, the
> fact that we have devices that implement protocol specific mechanisms
> for NVGRE and VXLAN should be considered legacy support in the stack,
> for new UDP encapsulation protocols we should not expose specifics in
> the stack in either by adding a GSO type for each protocol, nor
> ndo_add_foo_port for each protocol-- these things will not scale and
> unnecessarily complicate the core stack.
I tend to generally agree to the wind that blows from your writeup, namely:
UDP encapsulation offloads wise, we should pose few general
requirements to NICs to be implemented by vendors in their tomorrow's
HW and treat the current generation (these 4-5 drivers with their
limitations as legacy which should be supported but not state the
stack overall design).
Still we should seek more ways to reduce the pain/amount of
not-well-defined-configurations when these drivers are there and the
stack goes through this upside-down turnaround changes. OTOH you
didn't accept my SKB coloring suggestion for GSO inspection, and OTOH
I guess we can live with some sort of generic helper in the form of
what you suggested, but like it or not, getting rid of
ndo_add_vxlan_port will simply break things out.
Are we going to have a session on the encapsulation/offloads design @ LPC?
I think a replay of your LKS presentation along with open discussion
on how to get there with the legacy requirements could be very
helpful.
Or.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox