* Re: [PATCH v3 2/4] net: Handle encapsulated offloads before fragmentation or handing to lower dev
From: Jeff Kirsher @ 2012-12-07 23:15 UTC (permalink / raw)
To: Joseph Gasparakis
Cc: davem, shemminger, chrisw, gospo, Alexander Duyck, netdev,
linux-kernel, dmitry, saeed.bishara
In-Reply-To: <1354845419-22483-3-git-send-email-joseph.gasparakis@intel.com>
[-- Attachment #1: Type: text/plain, Size: 667 bytes --]
On 12/06/2012 05:56 PM, Joseph Gasparakis wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
>
> This change allows the VXLAN to enable Tx checksum offloading even on
> devices that do not support encapsulated checksum offloads. The
> advantage to this is that it allows for the lower device to change due
> to routing table changes without impacting features on the VXLAN itself.
>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> net/core/dev.c | 15 +++++++++++++--
> net/ipv4/ip_output.c | 4 ++++
> 2 files changed, 17 insertions(+), 2 deletions(-)
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 899 bytes --]
^ permalink raw reply
* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: Jeff Kirsher @ 2012-12-07 23:15 UTC (permalink / raw)
To: Joseph Gasparakis
Cc: davem, shemminger, chrisw, gospo, netdev, linux-kernel, dmitry,
saeed.bishara, Peter P Waskiewicz Jr, Alexander Duyck
In-Reply-To: <1354845419-22483-2-git-send-email-joseph.gasparakis@intel.com>
[-- Attachment #1: Type: text/plain, Size: 848 bytes --]
On 12/06/2012 05:56 PM, Joseph Gasparakis wrote:
> This patch adds support in the kernel for offloading in the NIC Tx and Rx
> checksumming for encapsulated packets (such as VXLAN and IP GRE).
>
> Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> include/linux/ip.h | 5 ++
> include/linux/ipv6.h | 5 ++
> include/linux/netdevice.h | 2 +
> include/linux/skbuff.h | 90 ++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/tcp.h | 10 +++++
> include/linux/udp.h | 5 ++
> net/core/skbuff.c | 9 ++++
> 7 files changed, 125 insertions(+), 1 deletions(-)
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 899 bytes --]
^ permalink raw reply
* Re: [PATCH net-next v3 0/3] Multiqueue support in virtio-net
From: Stephen Hemminger @ 2012-12-07 23:13 UTC (permalink / raw)
To: David Miller
Cc: krkumar2, kvm, mst, netdev, linux-kernel, virtualization,
bhutchings, jwhan, shiyer
In-Reply-To: <20121207.153556.989159003987459765.davem@davemloft.net>
On Fri, 07 Dec 2012 15:35:56 -0500 (EST)
David Miller <davem@davemloft.net> wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Sat, 8 Dec 2012 01:04:54 +0800
>
> > This series is an update version (hope the final version) of multiqueue
> > (VIRTIO_NET_F_MQ) support in virtio-net driver. All previous comments were
> > addressed, the work were based on Krishna Kumar's work to let virtio-net use
> > multiple rx/tx queues to do the packets reception and transmission. Performance
> > test show the aggregate latency were increased greately but may get some
> > regression in small packet transmission. Due to this, multiqueue were disabled
> > by default. If user want to benefit form the multiqueue, ethtool -L could be
> > used to enable the feature.
> >
> > Please review and comments.
> >
> > A protype implementation of qemu-kvm support could by found in
> > git://github.com/jasowang/qemu-kvm-mq.git. To start a guest with two queues, you
> > could specify the queues parameters to both tap and virtio-net like:
> >
> > ./qemu-kvm -netdev tap,queues=2,... -device virtio-net-pci,queues=2,...
> >
> > then enable the multiqueue through ethtool by:
> >
> > ethtool -L eth0 combined 2
>
> It seems like most, if not all, of the feedback given for this series
> has been addressed by Jason.
>
> Can I get some ACKs?
Other than the minor style nit in the first patch, I see no issues.
This is really needed by Virtual Routers.
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
^ permalink raw reply
* Re: [PATCH net-next v3 1/3] virtio-net: separate fields of sending/receiving queue from virtnet_info
From: Stephen Hemminger @ 2012-12-07 23:11 UTC (permalink / raw)
To: Jason Wang
Cc: krkumar2, kvm, mst, netdev, linux-kernel, virtualization,
bhutchings, jwhan, davem, shiyer
In-Reply-To: <1354899897-10423-2-git-send-email-jasowang@redhat.com>
Minor style issue reported by checkpatch which can be fixed after merge.
Although sizeof is actually an operator in C, it is considered correct
style to treat it as a function.
WARNING: sizeof hdr->hdr should be sizeof(hdr->hdr)
#293: FILE: drivers/net/virtio_net.c:395:
+ sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
WARNING: sizeof hdr->mhdr should be sizeof(hdr->mhdr)
#552: FILE: drivers/net/virtio_net.c:641:
+ sg_set_buf(sq->sg, &hdr->mhdr, sizeof hdr->mhdr);
WARNING: sizeof hdr->hdr should be sizeof(hdr->hdr)
#555: FILE: drivers/net/virtio_net.c:643:
+ sg_set_buf(sq->sg, &hdr->hdr, sizeof hdr->hdr);
^ permalink raw reply
* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
From: Jon Maloy @ 2012-12-07 22:30 UTC (permalink / raw)
To: Neil Horman; +Cc: Paul Gortmaker, David Miller, netdev, Ying Xue
In-Reply-To: <20121207192030.GA30339@hmsreliant.think-freely.org>
On 12/07/2012 02:20 PM, Neil Horman wrote:
> On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
>> From: Ying Xue <ying.xue@windriver.com>
>>
>> The sk_receive_queue limit control is currently performed for
>> all arriving messages, disregarding socket and message type.
>> But for connected sockets this check is redundant, since the protocol
>> flow control already makes queue overflow impossible.
>>
> Can you explain where that occurs?
It happens in the functions port_dispatcher_sigh() and tipc_send(),
among other places. Both are to be found in the file port.c, which
was supposed to contain the 'generic' (i.e., API independent) part
of the send/receive code.
Now that we have only one API left, the socket API, we are
planning to merge the code in socket.c and port.c, and get rid of
some code overhead.
The flow control in TIPC is message based, where the sender
requires to receive an explicit acknowledge message for each
512 message the receiver reads to user space.
If the sender has more than 1024 messages outstanding without having
received an acknowledge he will be suspended or receive EAGAIN until
he does.
The plan going forward is to replace this mechanism with a more
standard looking byte based flow control, while maintaining
backwards compatibility.
> I see where the tipc dispatch function calls
> sk_add_backlog, which checks the per socket recieve queue (regardless of weather
> the receiving socket is connection oriented or connectionless), but if the
> receiver doesn't call receive very often, This just adds a check against your
> global limit, doing nothing for your per-socket limits.
OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_
our per-socket limit. In fact, TIPC connectionless overflow control currently
is a kind of a hybrid, based on a message counter when the socket is not locked,
and based on sk_rcv_queue's byte limit when a message has to be added to the
backlog.
We are planning to fix this inconsistency too.
In fact it seems to
> repeat the same check twice, as in the worst case of the incomming message being
> TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
> OVERLOAD_LIMIT_BASE/2 again.
Yes, you are right. The intention is that only the first test,
if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..}
will be run for the vast majority of messages, since we must assume
that there is no overload most of the time.
An inelegant optimization, perhaps, but not logically wrong.
///jon
>
> Neil
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 22:23 UTC (permalink / raw)
To: David Miller; +Cc: eric, netdev, linux-wireless, linville, Eric Dumazet
In-Reply-To: <1354918363.9124.29.camel@jlt4.sipsolutions.net>
Oh I should say ... IP is like black magic to me ;-)
> - if (!pskb_may_pull(skb, sizeof(struct iphdr)))
> + if (!skb_copy_bits(skb, 0, &iph, sizeof(iph)))
> return skb;
>
> - iph = ip_hdr(skb);
> - if (iph->ihl < 5 || iph->version != 4)
> + if (iph.ihl < 5 || iph.version != 4)
> return skb;
> - if (!pskb_may_pull(skb, iph->ihl*4))
> - return skb;
> - iph = ip_hdr(skb);
> - len = ntohs(iph->tot_len);
> - if (skb->len < len || len < (iph->ihl * 4))
> +
> + len = ntohs(iph.tot_len);
> + if (skb->len < len || len < (iph.ihl * 4))
> return skb;
>
> - if (ip_is_fragment(ip_hdr(skb))) {
> + if (ip_is_fragment(&iph)) {
> skb = skb_share_check(skb, GFP_ATOMIC);
> if (skb) {
> + if (!pskb_may_pull(skb, iph.ihl*4))
> + return skb;
I moved this here but I have no idea what it does. Asking if we can pull
this here seems a bit pointless, and in the previous place it seemed
similarly pointless since we only use the static part of the IP header
and don't look at any options... So to me it seems this pskb_may_pull()
could just be removed, and that most likely means I'm messing with code
I don't understand :-)
johannes
^ permalink raw reply
* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 22:12 UTC (permalink / raw)
To: David Miller
Cc: eric-EVVnsjFE0OfYtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
linville-2XuSBdqkA4R54TAoqtyWWQ, Eric Dumazet
In-Reply-To: <1354916502.9124.18.camel-8Nb76shvtaUJvtFkdXX2HixXY32XiHfO@public.gmane.org>
On Fri, 2012-12-07 at 22:41 +0100, Johannes Berg wrote:
> wpa_supplicant opens a packet socket for ETH_P_EAPOL, which indirectly
> eventually calls dev_add_pack(). But if you do the same for another
> socket, you'll get the same again, and then deliver_skb() will deliver
> only a refcounted packet to the prot_hook->func().
>
> This seems like it could very well cause the problem?
Ok I couldn't reproduce it because suricata didn't work this way, but I
did try using tcpdump (without the fanout) and deliver_skb() *is* called
twice for each packet as I thought. It thus seems the problem is
entirely in af_packet itself. It was changed a bit by Eric Dumazet in
bc416d9768 but the original code goes back to the original defrag
support in 7736d33f4, as far as I can tell. aec27311c changed the code
to not do skb_clone(), but it seems the skb_share_check() should be
before the pskb_pull().
Well, it seems ip_check_defrag() should simply not modify the SKB before
it unshares it ... like this maybe:
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 448e685..8d5cc75 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -707,28 +707,27 @@ EXPORT_SYMBOL(ip_defrag);
struct sk_buff *ip_check_defrag(struct sk_buff *skb, u32 user)
{
- const struct iphdr *iph;
+ struct iphdr iph;
u32 len;
if (skb->protocol != htons(ETH_P_IP))
return skb;
- if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+ if (!skb_copy_bits(skb, 0, &iph, sizeof(iph)))
return skb;
- iph = ip_hdr(skb);
- if (iph->ihl < 5 || iph->version != 4)
+ if (iph.ihl < 5 || iph.version != 4)
return skb;
- if (!pskb_may_pull(skb, iph->ihl*4))
- return skb;
- iph = ip_hdr(skb);
- len = ntohs(iph->tot_len);
- if (skb->len < len || len < (iph->ihl * 4))
+
+ len = ntohs(iph.tot_len);
+ if (skb->len < len || len < (iph.ihl * 4))
return skb;
- if (ip_is_fragment(ip_hdr(skb))) {
+ if (ip_is_fragment(&iph)) {
skb = skb_share_check(skb, GFP_ATOMIC);
if (skb) {
+ if (!pskb_may_pull(skb, iph.ihl*4))
+ return skb;
if (pskb_trim_rcsum(skb, len))
return skb;
memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 21:56 UTC (permalink / raw)
To: Eric Leblond
Cc: David Miller, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
linville-2XuSBdqkA4R54TAoqtyWWQ
In-Reply-To: <1354916768.4530.18.camel@tiger2>
HI,
> > That's interesting. A single one seems odd. I might have expected two,
> > but not one. Well, since you removed the crash ... I guess I'll have to
> > believe that there's just one and the second one doesn't show up because
> > we crashed before :-)
>
> It was the case with initial code but I've suppressed the BUG() call and
> replaced it with a return ;)
Right. Well, does it actually still work then? I wonder if then you
don't see a second packet because the first one doesn't make it
through ... I'm thinking that this is just an internal af_packet problem
with multiple listeners.
> I've no simple code available to test it. I've add the problem when
> running suricata. Maybe you could use it. It is packaged in most
> distribution now.
> To enable packet fanout. Modify default /etc/suricata/suricata.yaml to
> have something like:
> af-packet:
> - interface: wlan0
> # Number of receive threads (>1 will enable experimental flow pinned
> # runmode)
> threads: 3
That'll should do, thanks.
johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Eric Leblond @ 2012-12-07 21:46 UTC (permalink / raw)
To: Johannes Berg; +Cc: David Miller, netdev, linux-wireless, linville
In-Reply-To: <1354915824.9124.11.camel@jlt4.sipsolutions.net>
Hi,
On Fri, 2012-12-07 at 22:30 +0100, Johannes Berg wrote:
>
> > Wireless folks, please take a look. The issue is that,
> > under the circumstances listed below, we get SKBs in
> > the AF_PACKET input path that are shared.
>
> Ok so I took a look, but I can't see where the wireless stack is going
> wrong.
>
> > Given the logic present in ieee80211_deliver_skb() I think
> > the mac80211 code doesn't expect this either.
>
> This is correct, but the driver should never give us a shared skb. From
> the other mail it seems Eric is using iwlwifi, which is definitely not
> creating shared SKBs. Nothing in mac80211 creates them either.
>
> > > I've observed this crash under the following condition:
> > > 1. a program is listening to an wifi interface (let say wlan0)
> > > 2. it is using fanout capture in flow load balancing mode
> > > 3. defrag option is on on the fanout socket
>
> How do you set this up, and what does it do? I'd like to try to
> reproduce this.
> > > 4. the interface disconnect (radio down for example)
> > > 5. the interface reconnect (radio switched up)
> > > 6. once reconnected a single packet is seen with skb->users=2
>
> That's interesting. A single one seems odd. I might have expected two,
> but not one. Well, since you removed the crash ... I guess I'll have to
> believe that there's just one and the second one doesn't show up because
> we crashed before :-)
It was the case with initial code but I've suppressed the BUG() call and
replaced it with a return ;)
> > So if we look at ieee80211_deliver_skb(), it has code to deal with unaligned
> > packet headers, wherein it memoves() the data into a better aligned location.
> >
> > But if these SKBs really are skb_shared(), this packet data
> > modification is illegal.
> >
> > I suspect that the assumptions built into this unaligned data handling
> > code, and AF_PACKET, are correct. Meaning that we should never see
> > skb_shared() packets here. We just have a missing skb_copy()
> > somewhere in mac80211, Johannes can you please take a look?
>
> My first theory was related to multiple virtual interfaces, but Eric
> didn't say he was running that, but we use skb_copy() for that in
> ieee80211_prepare_and_rx_handle(). That's not necessarily the most
> efficient (another reason for drivers to use paged RX here) but clearly
> not causing the issue.
>
> The only other theory I can come up with right now is that the skb_get()
> happens in deliver_skb via __netif_receive_skb. Keeping in mind that
> wpa_supplicant might have another packet socket open for authentication
> packets, that seems like a possibility. I'll test it once I figure out
> how to do this "defrag" option you speak of :)
I've no simple code available to test it. I've add the problem when
running suricata. Maybe you could use it. It is packaged in most
distribution now.
To enable packet fanout. Modify default /etc/suricata/suricata.yaml to
have something like:
af-packet:
- interface: wlan0
# Number of receive threads (>1 will enable experimental flow pinned
# runmode)
threads: 3
Start it with: suricata --af-packet=wlan0
Then get wlan0 interface down and up. After a few seconds, the crash
will occur.
It is a bit complicated for a simple test case. I can cook a little
example code if you want.
BR,
--
Eric Leblond <eric@regit.org>
Blog: https://home.regit.org/
^ permalink raw reply
* [PATCH net-next #2 1/1] r8169: workaround for missing extended GigaMAC registers
From: Francois Romieu @ 2012-12-07 21:20 UTC (permalink / raw)
To: netdev; +Cc: David Miller, Lee Chun-Yi, Wang YanQing, Hayes Wang
GigaMAC registers have been reported left unitialized in several
situations:
- after cold boot from power-off state
- after S3 resume
Tweaking rtl_hw_phy_config takes care of both.
This patch removes an excess entry (",") at the end of the exgmac_reg
array as well.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: Wang YanQing <udknight@gmail.com>
Cc: Hayes Wang <hayeswang@realtek.com>
---
drivers/net/ethernet/realtek/r8169.c | 42 ++++++++++++++++++++----------------
1 file changed, 24 insertions(+), 18 deletions(-)
- Rebased against net-next with rtl8168e_2_hw_phy_config vs rtl8168f_hw_phy_config
fixup as per davem's request.
- Removed a stealth "," after the last record of exgmac_reg. Not sure where it poked
into...
diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 891feee..1747256 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -3120,6 +3120,23 @@ static void rtl8168e_1_hw_phy_config(struct rtl8169_private *tp)
rtl_writephy(tp, 0x0d, 0x0000);
}
+static void rtl_rar_exgmac_set(struct rtl8169_private *tp, u8 *addr)
+{
+ const u16 w[] = {
+ addr[0] | (addr[1] << 8),
+ addr[2] | (addr[3] << 8),
+ addr[4] | (addr[5] << 8)
+ };
+ const struct exgmac_reg e[] = {
+ { .addr = 0xe0, ERIAR_MASK_1111, .val = w[0] | (w[1] << 16) },
+ { .addr = 0xe4, ERIAR_MASK_1111, .val = w[2] },
+ { .addr = 0xf0, ERIAR_MASK_1111, .val = w[0] << 16 },
+ { .addr = 0xf4, ERIAR_MASK_1111, .val = w[1] | (w[2] << 16) }
+ };
+
+ rtl_write_exgmac_batch(tp, e, ARRAY_SIZE(e));
+}
+
static void rtl8168e_2_hw_phy_config(struct rtl8169_private *tp)
{
static const struct phy_reg phy_reg_init[] = {
@@ -3204,6 +3221,9 @@ static void rtl8168e_2_hw_phy_config(struct rtl8169_private *tp)
rtl_writephy(tp, 0x1f, 0x0000);
r8168_aldps_enable_1(tp);
+
+ /* Broken BIOS workaround: feed GigaMAC registers with MAC address. */
+ rtl_rar_exgmac_set(tp, tp->dev->dev_addr);
}
static void rtl8168f_hw_phy_config(struct rtl8169_private *tp)
@@ -3740,33 +3760,19 @@ static void rtl8169_init_phy(struct net_device *dev, struct rtl8169_private *tp)
static void rtl_rar_set(struct rtl8169_private *tp, u8 *addr)
{
void __iomem *ioaddr = tp->mmio_addr;
- u32 high;
- u32 low;
-
- low = addr[0] | (addr[1] << 8) | (addr[2] << 16) | (addr[3] << 24);
- high = addr[4] | (addr[5] << 8);
rtl_lock_work(tp);
RTL_W8(Cfg9346, Cfg9346_Unlock);
- RTL_W32(MAC4, high);
+ RTL_W32(MAC4, addr[4] | addr[5] << 8);
RTL_R32(MAC4);
- RTL_W32(MAC0, low);
+ RTL_W32(MAC0, addr[0] | addr[1] << 8 | addr[2] << 16 | addr[3] << 24);
RTL_R32(MAC0);
- if (tp->mac_version == RTL_GIGA_MAC_VER_34) {
- const struct exgmac_reg e[] = {
- { .addr = 0xe0, ERIAR_MASK_1111, .val = low },
- { .addr = 0xe4, ERIAR_MASK_1111, .val = high },
- { .addr = 0xf0, ERIAR_MASK_1111, .val = low << 16 },
- { .addr = 0xf4, ERIAR_MASK_1111, .val = high << 16 |
- low >> 16 },
- };
-
- rtl_write_exgmac_batch(tp, e, ARRAY_SIZE(e));
- }
+ if (tp->mac_version == RTL_GIGA_MAC_VER_34)
+ rtl_rar_exgmac_set(tp, addr);
RTL_W8(Cfg9346, Cfg9346_Lock);
--
1.7.11.7
^ permalink raw reply related
* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 21:41 UTC (permalink / raw)
To: David Miller; +Cc: eric, netdev, linux-wireless, linville
In-Reply-To: <1354915824.9124.11.camel@jlt4.sipsolutions.net>
On Fri, 2012-12-07 at 22:30 +0100, Johannes Berg wrote:
> The only other theory I can come up with right now is that the skb_get()
> happens in deliver_skb via __netif_receive_skb. Keeping in mind that
> wpa_supplicant might have another packet socket open for authentication
> packets, that seems like a possibility. I'll test it once I figure out
> how to do this "defrag" option you speak of :)
Hmm now I'm venturing into the unknown (for me) and realm of
speculation...
wpa_supplicant opens a packet socket for ETH_P_EAPOL, which indirectly
eventually calls dev_add_pack(). But if you do the same for another
socket, you'll get the same again, and then deliver_skb() will deliver
only a refcounted packet to the prot_hook->func().
This seems like it could very well cause the problem?
johannes
^ permalink raw reply
* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 21:30 UTC (permalink / raw)
To: David Miller; +Cc: eric, netdev, linux-wireless, linville
In-Reply-To: <20121207.153134.25835204617509469.davem@davemloft.net>
> Wireless folks, please take a look. The issue is that,
> under the circumstances listed below, we get SKBs in
> the AF_PACKET input path that are shared.
Ok so I took a look, but I can't see where the wireless stack is going
wrong.
> Given the logic present in ieee80211_deliver_skb() I think
> the mac80211 code doesn't expect this either.
This is correct, but the driver should never give us a shared skb. From
the other mail it seems Eric is using iwlwifi, which is definitely not
creating shared SKBs. Nothing in mac80211 creates them either.
> > I've observed this crash under the following condition:
> > 1. a program is listening to an wifi interface (let say wlan0)
> > 2. it is using fanout capture in flow load balancing mode
> > 3. defrag option is on on the fanout socket
How do you set this up, and what does it do? I'd like to try to
reproduce this.
> > 4. the interface disconnect (radio down for example)
> > 5. the interface reconnect (radio switched up)
> > 6. once reconnected a single packet is seen with skb->users=2
That's interesting. A single one seems odd. I might have expected two,
but not one. Well, since you removed the crash ... I guess I'll have to
believe that there's just one and the second one doesn't show up because
we crashed before :-)
> So if we look at ieee80211_deliver_skb(), it has code to deal with unaligned
> packet headers, wherein it memoves() the data into a better aligned location.
>
> But if these SKBs really are skb_shared(), this packet data
> modification is illegal.
>
> I suspect that the assumptions built into this unaligned data handling
> code, and AF_PACKET, are correct. Meaning that we should never see
> skb_shared() packets here. We just have a missing skb_copy()
> somewhere in mac80211, Johannes can you please take a look?
My first theory was related to multiple virtual interfaces, but Eric
didn't say he was running that, but we use skb_copy() for that in
ieee80211_prepare_and_rx_handle(). That's not necessarily the most
efficient (another reason for drivers to use paged RX here) but clearly
not causing the issue.
The only other theory I can come up with right now is that the skb_get()
happens in deliver_skb via __netif_receive_skb. Keeping in mind that
wpa_supplicant might have another packet socket open for authentication
packets, that seems like a possibility. I'll test it once I figure out
how to do this "defrag" option you speak of :)
johannes
^ permalink raw reply
* Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7
From: Jay Vosburgh @ 2012-12-07 21:00 UTC (permalink / raw)
To: Linda Walsh; +Cc: Cong Wang, LKML, Linux Kernel Network Developers
In-Reply-To: <50C24C44.8000809@tlinx.org>
Linda Walsh <lkml@tlinx.org> wrote:
>Sorry for the delay.... my distro (Suse) has made rebooting my system
>a chore (have to often boot from rescue to get it to come up because
>they put mount libs in /usr/lib expecting they will always boot
>from their ram disk -- preventing those of use who boot directly
>from disk from doing so easily...grrr.
>
>Jay Vosburgh wrote:
>> The miimon functionality is used to check link state and notice
>> when slaves lose carrier.
>---
> If I am running 'rr' on 2 channels -- specifically for the purpose
>of link speed aggregation (getting 1 20Gb channel out of 2 10Gb channels)
>I'm not sure I see how miimon would provide benefit. -- if 1 link dies,
>the other, being on the same card is likely to be dead too, so would
>it really serve a purpose?
Perhaps, but if the link partner experiences a failure, that may
be a different situation. Not all failures will necessarily cause both
links to fail simultaneously.
>> Running without it will not detect failure of
>> the bonding slaves, which is likely not what you want. The mode,
>> balance-rr in your case, is what selects the load balance to use, and is
>> separate from the miimon.
>>
>----
> Wouldn't the entire link die if a slave dies -- like RAID0, 1 disk
>dies, the entire link goes down?
No; failure of a single slave does not cause the entire bond to
fail (unless that is the last available slave). For round robin, a
failed slave is taken out of the set used to transmit traffic, and any
remaining slaves continue to round robin amongst themselves.
> The other end (windows) doesn't dynamically config for a static-link
>aggregation, so I don't think it would provide benefit.
So it (windows) has no means to disable (and discontinue use of)
one channel of the aggregation should it fail, even in a static link
aggregation?
>> That said, the problem you're seeing appears to be caused by two
>> things: bonding holds a lock (in addition to RTNL) when calling
>> __ethtool_get_settings, and an ixgbe function in the call path to
>> retrieve the settings, ixgbe_acquire_swfw_sync_X540, can sleep.
>>
>> The test patch above handles one case in bond_enslave, but there
>> is another case in bond_miimon_commit when a slave changes link state
>> from down to up, which will occur shortly after the slave is added.
>>
>----
> Added your 2nd patch -- no more error messages...
>
> however -- likely unrelated, the max speed read or write I am seeing
>is about 500MB/s, and that is rare -- usually it's barely <3x a 1Gb
>network speed. (119/125 MB R/W). I'm not at all sure it's really
>combining the links
>properly. Anyway to verify that?
How are you testing the throughput? If you configure the
aggregation with just one link, how does the throughput compare to the
aggregation with both links?
It most likely is combining links properly, but any link
aggregation scheme has tradeoffs, and the best load balance algorithm to
use depends upon the work load. Two aggregated 10G links are not
interchangable with a single 20G link.
For a round robin transmission scheme, issues arise because
packets are delivered at the other end out of order. This in turn
triggers various TCP behaviors to deal with what is perceived to be
transmission errors or lost packets (TCP fast retransmit being the most
notable). This usually results in a single TCP connection being unable
to completely saturate a round-robin aggregated set of links.
There are a few parameters on linux that can be adjusted. I
don't know what the windows equivalents might be.
On linux, adjusting the net.ipv4.tcp_reordering sysctl value
will increase the tolerance for out of order delivery.
The sysctl is adjusted via something like
sysctl -w net.ipv4.tcp_reordering=10
the default value is 3, and higher values increase the tolerance
for out of order delivery. If memory serves, the setting is applied to
connections as they are created, so existing connections will not see
changes.
Also, adjusting the packet coalescing setting for the receiving
devices may also permit higher throughput. The packet coalescing setting
is adjusted via ethtool; the current settings can be viewed via
ethtool -c eth0
and then adjusted via something like
ethtool -C eth0 rx-usecs 30
I've seen reports that raising the "rx-usecs" parameter at the
receiver can increase the round-robin throughput. My recollection is
that the value used was 30, but the best settings will likely be
dependent upon your particular hardware and configuration.
> On the windows side it shows the bond-link as a 20Gb connection, but
>I don't see anyplace for something similar on linux.
There isn't any such indicator; bonding does not advertise its
link speed as the sum of its slaves link speeds.
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply
* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Eric Leblond @ 2012-12-07 20:54 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-wireless, johannes, linville
In-Reply-To: <20121207.153134.25835204617509469.davem@davemloft.net>
Hi,
On Fri, 2012-12-07 at 15:31 -0500, David Miller wrote:
> From: Eric Leblond <eric@regit.org>
> Date: Fri, 7 Dec 2012 19:56:01 +0100
>
> Wireless folks, please take a look. The issue is that,
> under the circumstances listed below, we get SKBs in
> the AF_PACKET input path that are shared.
>
> Given the logic present in ieee80211_deliver_skb() I think
> the mac80211 code doesn't expect this either.
>
> More commentary from me below:
>
> > This patch is adding a check on skb before trying to defrag the
> > packet for the hash computation in fanout mode. The goal of this
> > patch is to avoid an kernel crash in pskb_expand_head.
> > It appears that under some specific condition there is a shared
> > skb reaching the defrag code and this lead to a crash due to the
> > following code:
> >
> > if (skb_shared(skb))
> > BUG();
> >
> > I've observed this crash under the following condition:
> > 1. a program is listening to an wifi interface (let say wlan0)
> > 2. it is using fanout capture in flow load balancing mode
> > 3. defrag option is on on the fanout socket
> > 4. the interface disconnect (radio down for example)
> > 5. the interface reconnect (radio switched up)
> > 6. once reconnected a single packet is seen with skb->users=2
> > 7. the kernel crash in pskb_expand_head at skbuff.c:1035
> >
> > [BBB55:744364] [<ffffffff812a2761>] ? __pskb_pull_tail+0x43x0x26f
> > [BB8S5.744395] [<ffffffff812d29Tb>] ? ip_check_defrag+ox3a/0x14a
> > [BBB55.744422] [<ffffffffB1344459>] ? packet_rcv_fanout+ox5e/oxf9
> > [BBBS5.7444S0] [<ffffffffB12aaS9b>] ? __netif_receive_skb+ox444/ox4f9
> > [BBB55.T4447B] [<ffffffffB12aa?e1>] ? netif_receive_skb+ox6d/0x?3
> > [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_deliver_skb+0xbd/0xfa [mac80211]
> > [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_h_data+0x1e0/0x21a [mac80211]
> > [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_handlers+0x3d5/0x480 [mac80211]
> > [BBB55.T4447B] [<ffffffffB12aa?e1>] ? __wake_up
> > [BBB55.T4447B] [<ffffffffB12aa?e1>] ? evdev_eventr+0xc0/0xcf [evdev]
> >
> > Signed-off-by: Eric Leblond <eric@regit.org>
>
> So if we look at ieee80211_deliver_skb(), it has code to deal with unaligned
> packet headers, wherein it memoves() the data into a better aligned location.
>
> But if these SKBs really are skb_shared(), this packet data
> modification is illegal.
>
> I suspect that the assumptions built into this unaligned data handling
> code, and AF_PACKET, are correct. Meaning that we should never see
> skb_shared() packets here. We just have a missing skb_copy()
> somewhere in mac80211, Johannes can you please take a look?
Here's some more info that may help people knowing the code. During my
test, I've removed the BUG() and replaced with a printk to have a living
kernel. Only one single shared skb was seen for each up event.
I've also add another oops in the same code:
[BBB55:744364] [<ffffffff812a2761>] ? __pskb_pull_tail+0x43x0x26f
[BB8S5.744395] [<ffffffff812d29Tb>] ? ip_check_defrag+ox3a/0x14a
[BBB55.744422] [<ffffffffB1344459>] ? packet_rcv_fanout+ox5e/oxf9
[BBBS5.7444S0] [<ffffffffB12aaS9b>] ? __netif_receive_skb+ox444/ox4f9
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? netif_receive_skb+ox6d/0x?3
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_deliver_skb+0xbd/0xfa [mac80211]
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_h_data+0x1e0/0x21a [mac80211]
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_handlers+0x3d5/0x480 [mac80211]
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? _raw_spin_lock_irqsave+0x14/0x35
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_prepare_and_rx_handle+0x5a3/0x5db [mac80211]
...
[BBB55.T4447B] [<ffffffffB12aa?e1>] ? ttwu_dowakeup+0x2d
Picture of the oops available here:
http://home.regit.org/~regit/wireless-oops.jpg
BR,
--
Eric Leblond <eric@regit.org>
Blog: https://home.regit.org/
^ permalink raw reply
* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: Johannes Berg @ 2012-12-07 20:42 UTC (permalink / raw)
To: David Miller; +Cc: eric, netdev, linux-wireless, linville
In-Reply-To: <20121207.153134.25835204617509469.davem@davemloft.net>
On Fri, 2012-12-07 at 15:31 -0500, David Miller wrote:
> From: Eric Leblond <eric@regit.org>
> Date: Fri, 7 Dec 2012 19:56:01 +0100
>
> Wireless folks, please take a look. The issue is that,
> under the circumstances listed below, we get SKBs in
> the AF_PACKET input path that are shared.
>
> Given the logic present in ieee80211_deliver_skb() I think
> the mac80211 code doesn't expect this either.
Indeed, it would certainly not like this, I'll take a look.
Eric, what's the driver you're using? I'm wondering whether paged skbs
vs. all data in the header would make a difference, hence the question.
johannes
^ permalink raw reply
* Re: [PATCH net-next v3 0/3] Multiqueue support in virtio-net
From: David Miller @ 2012-12-07 20:35 UTC (permalink / raw)
To: jasowang
Cc: krkumar2, kvm, mst, netdev, linux-kernel, virtualization,
bhutchings, jwhan, shiyer
In-Reply-To: <1354899897-10423-1-git-send-email-jasowang@redhat.com>
From: Jason Wang <jasowang@redhat.com>
Date: Sat, 8 Dec 2012 01:04:54 +0800
> This series is an update version (hope the final version) of multiqueue
> (VIRTIO_NET_F_MQ) support in virtio-net driver. All previous comments were
> addressed, the work were based on Krishna Kumar's work to let virtio-net use
> multiple rx/tx queues to do the packets reception and transmission. Performance
> test show the aggregate latency were increased greately but may get some
> regression in small packet transmission. Due to this, multiqueue were disabled
> by default. If user want to benefit form the multiqueue, ethtool -L could be
> used to enable the feature.
>
> Please review and comments.
>
> A protype implementation of qemu-kvm support could by found in
> git://github.com/jasowang/qemu-kvm-mq.git. To start a guest with two queues, you
> could specify the queues parameters to both tap and virtio-net like:
>
> ./qemu-kvm -netdev tap,queues=2,... -device virtio-net-pci,queues=2,...
>
> then enable the multiqueue through ethtool by:
>
> ethtool -L eth0 combined 2
It seems like most, if not all, of the feedback given for this series
has been addressed by Jason.
Can I get some ACKs?
Thanks.
^ permalink raw reply
* [GIT] Networking
From: David Miller @ 2012-12-07 20:35 UTC (permalink / raw)
To: torvalds; +Cc: akpm, netdev, linux-kernel
Two stragglers:
1) The new code that adds new flushing semantics to GRO can cause
SKB pointer list corruption, manage the lists differently to
avoid the OOPS. Fix from Eric Dumazet.
2) When TCP fast open does a retransmit of data in a SYN-ACK or
similar, we update retransmit state that we shouldn't triggering
a WARN_ON later. Fix from Yuchung Cheng.
Please pull, thanks a lot!
The following changes since commit 1afa471706963643ceeda7cbbe9c605a1e883d53:
Merge tag 'mmc-fixes-for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc (2012-12-07 09:15:20 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git master
for you to fetch changes up to c3c7c254b2e8cd99b0adf288c2a1bddacd7ba255:
net: gro: fix possible panic in skb_gro_receive() (2012-12-07 14:39:29 -0500)
----------------------------------------------------------------
Eric Dumazet (1):
net: gro: fix possible panic in skb_gro_receive()
Yuchung Cheng (1):
tcp: bug fix Fast Open client retransmission
include/linux/netdevice.h | 3 +++
include/net/tcp.h | 1 +
net/core/dev.c | 2 ++
net/core/skbuff.c | 6 +++---
net/ipv4/tcp_input.c | 6 +++++-
net/ipv4/tcp_output.c | 15 ++++++++++-----
6 files changed, 24 insertions(+), 9 deletions(-)
^ permalink raw reply
* Re: [RFC PATCH] af_packet: don't to defrag shared skb
From: David Miller @ 2012-12-07 20:31 UTC (permalink / raw)
To: eric; +Cc: netdev, linux-wireless, johannes, linville
In-Reply-To: <1354906561-4695-1-git-send-email-eric@regit.org>
From: Eric Leblond <eric@regit.org>
Date: Fri, 7 Dec 2012 19:56:01 +0100
Wireless folks, please take a look. The issue is that,
under the circumstances listed below, we get SKBs in
the AF_PACKET input path that are shared.
Given the logic present in ieee80211_deliver_skb() I think
the mac80211 code doesn't expect this either.
More commentary from me below:
> This patch is adding a check on skb before trying to defrag the
> packet for the hash computation in fanout mode. The goal of this
> patch is to avoid an kernel crash in pskb_expand_head.
> It appears that under some specific condition there is a shared
> skb reaching the defrag code and this lead to a crash due to the
> following code:
>
> if (skb_shared(skb))
> BUG();
>
> I've observed this crash under the following condition:
> 1. a program is listening to an wifi interface (let say wlan0)
> 2. it is using fanout capture in flow load balancing mode
> 3. defrag option is on on the fanout socket
> 4. the interface disconnect (radio down for example)
> 5. the interface reconnect (radio switched up)
> 6. once reconnected a single packet is seen with skb->users=2
> 7. the kernel crash in pskb_expand_head at skbuff.c:1035
>
> [BBB55:744364] [<ffffffff812a2761>] ? __pskb_pull_tail+0x43x0x26f
> [BB8S5.744395] [<ffffffff812d29Tb>] ? ip_check_defrag+ox3a/0x14a
> [BBB55.744422] [<ffffffffB1344459>] ? packet_rcv_fanout+ox5e/oxf9
> [BBBS5.7444S0] [<ffffffffB12aaS9b>] ? __netif_receive_skb+ox444/ox4f9
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? netif_receive_skb+ox6d/0x?3
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_deliver_skb+0xbd/0xfa [mac80211]
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_h_data+0x1e0/0x21a [mac80211]
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? ieee80211_rx_handlers+0x3d5/0x480 [mac80211]
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? __wake_up
> [BBB55.T4447B] [<ffffffffB12aa?e1>] ? evdev_eventr+0xc0/0xcf [evdev]
>
> Signed-off-by: Eric Leblond <eric@regit.org>
So if we look at ieee80211_deliver_skb(), it has code to deal with unaligned
packet headers, wherein it memoves() the data into a better aligned location.
But if these SKBs really are skb_shared(), this packet data
modification is illegal.
I suspect that the assumptions built into this unaligned data handling
code, and AF_PACKET, are correct. Meaning that we should never see
skb_shared() packets here. We just have a missing skb_copy()
somewhere in mac80211, Johannes can you please take a look?
Thanks.
^ permalink raw reply
* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: Joseph Gasparakis @ 2012-12-07 20:18 UTC (permalink / raw)
To: David Miller
Cc: joseph.gasparakis, bhutchings, alexander.h.duyck, shemminger,
chrisw, gospo, netdev, linux-kernel, dmitry, saeed.bishara,
peter.p.waskiewicz.jr
In-Reply-To: <20121207.145239.1408511750725975741.davem@davemloft.net>
On Fri, 7 Dec 2012, David Miller wrote:
> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> Date: Fri, 7 Dec 2012 11:52:43 -0800 (PST)
>
> >
> >
> > On Fri, 7 Dec 2012, David Miller wrote:
> >
> >> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> >> Date: Fri, 7 Dec 2012 11:41:46 -0800 (PST)
> >>
> >> >
> >> >
> >> > On Fri, 7 Dec 2012, David Miller wrote:
> >> >
> >> >> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> >> >> Date: Fri, 7 Dec 2012 10:24:17 -0800 (PST)
> >> >>
> >> >> > So the idea here is that the driver will use the headers for checksumming
> >> >> > if the skb->encapsulation bit is on. The bit should be set in the protocol
> >> >> > driver.
> >> >> >
> >> >> > To answer the second comment, the flags that we use in this series of
> >> >> > patches is NETIF_F_IP_CSUM, NETIF_F_IPV6_CSUM and NETIF_F_SG. These are
> >> >> > the bits that we propose will be used for checksumming of encapsulation.
> >> >> > As per a previous comment in v2, the hw_enc_features field should be used
> >> >> > also in the future when NICs have more encap offloads, so one could
> >> >> > indicate these features there from the driver.
> >> >> >
> >> >> > Furthermore, I submitted a patch for Rx checksumming, where NETIF_F_RXCSUM
> >> >> > is used, again in conjunction with skb->encapsulation flag. As I mention
> >> >> > in my logs, the driver is expected to set the ip_summed to UNNECESSARY and
> >> >> > turn the skb->encapsulation on, to indicate that the inner headers are
> >> >> > already HW checksummed.
> >> >> >
> >> >>
> >> >> This is the kind of language that belongs in the commit message and
> >> >> code comments.
> >> >>
> >> > Sure. I'll wait to gather some more feedback if there is any and I will
> >> > re-spin this off adding more code comments and clarify this in the logs.
> >>
> >> Great.
> >>
> >> Please note that this request applies to your receive side change too.
> >>
> > I believe the logs are sufficient in the rx patch:
> >
> > This patch adds capability in vxlan to identify received
> > checksummed inner packets and signal them to the upper layers of
> > the stack. The driver needs to set the skb->encapsulation bit
> > and also set the skb->ip_summed to CHECKSUM_UNNECESSARY.
> >
> > ...but I can add more comments to the code if this is what you are
> > referring to.
>
> Yes, please do.
>
I will! Thanks.
^ permalink raw reply
* Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7
From: Linda Walsh @ 2012-12-07 20:06 UTC (permalink / raw)
To: Jay Vosburgh; +Cc: Cong Wang, LKML, Linux Kernel Network Developers
In-Reply-To: <4913.1354154236@death.nxdomain>
Sorry for the delay.... my distro (Suse) has made rebooting my system
a chore (have to often boot from rescue to get it to come up because
they put mount libs in /usr/lib expecting they will always boot
from their ram disk -- preventing those of use who boot directly
from disk from doing so easily...grrr.
Jay Vosburgh wrote:
> The miimon functionality is used to check link state and notice
> when slaves lose carrier.
---
If I am running 'rr' on 2 channels -- specifically for the purpose
of link speed aggregation (getting 1 20Gb channel out of 2 10Gb channels)
I'm not sure I see how miimon would provide benefit. -- if 1 link dies,
the other, being on the same card is likely to be dead too, so would
it really serve a purpose?
> Running without it will not detect failure of
> the bonding slaves, which is likely not what you want. The mode,
> balance-rr in your case, is what selects the load balance to use, and is
> separate from the miimon.
>
----
Wouldn't the entire link die if a slave dies -- like RAID0, 1 disk
dies, the entire link goes down?
The other end (windows) doesn't dynamically config for a static-link
aggregation, so I don't think it would provide benefit.
> That said, the problem you're seeing appears to be caused by two
> things: bonding holds a lock (in addition to RTNL) when calling
> __ethtool_get_settings, and an ixgbe function in the call path to
> retrieve the settings, ixgbe_acquire_swfw_sync_X540, can sleep.
>
> The test patch above handles one case in bond_enslave, but there
> is another case in bond_miimon_commit when a slave changes link state
> from down to up, which will occur shortly after the slave is added.
>
----
Added your 2nd patch -- no more error messages...
however -- likely unrelated, the max speed read or write I am seeing
is about 500MB/s, and that is rare -- usually it's barely <3x a 1Gb
network speed. (119/125 MB R/W). I'm not at all sure it's really
combining the links
properly. Anyway to verify that?
On the windows side it shows the bond-link as a 20Gb connection, but
I don't see anyplace for something similar on linux.
^ permalink raw reply
* Re: [PATCH net-next 10/10] tipc: refactor accept() code for improved readability
From: Paul Gortmaker @ 2012-12-07 19:56 UTC (permalink / raw)
To: Neil Horman; +Cc: David Miller, netdev, Jon Maloy
In-Reply-To: <20121207194226.GB30339@hmsreliant.think-freely.org>
[Re: [PATCH net-next 10/10] tipc: refactor accept() code for improved readability] On 07/12/2012 (Fri 14:42) Neil Horman wrote:
> On Fri, Dec 07, 2012 at 09:28:18AM -0500, Paul Gortmaker wrote:
> > In TIPC's accept() routine, there is a large block of code relating
> > to initialization of a new socket, all within an if condition checking
> > if the allocation succeeded.
> >
> > Here, we simply factor out that init code within the accept() function
> > to its own separate function, which improves readability, and makes
> > it easier to track which lock_sock() calls are operating on existing
> > vs. new sockets.
> >
> > Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> > ---
> > net/tipc/socket.c | 93 +++++++++++++++++++++++++++++--------------------------
> > 1 file changed, 49 insertions(+), 44 deletions(-)
> >
> > diff --git a/net/tipc/socket.c b/net/tipc/socket.c
> > index 38613cf..56661c8 100644
> > --- a/net/tipc/socket.c
> > +++ b/net/tipc/socket.c
> > @@ -1507,6 +1507,53 @@ static int listen(struct socket *sock, int len)
> > return res;
> > }
> >
> > +static void tipc_init_socket(struct sock *sk, struct socket *new_sock,
> > + struct sk_buff *buf)
> > +{
> Can you rename this to something more specific to its purpose? tipc_init_socket
> makes me wonder why you're not calling it internally from tipc_create. This
> seems more like a tipc_init_accept_sock, or some such. Alternatively, since
> you're ony using it from your accept call, you might consider not factoring it
> out at all, and just reversing the logic in your accept function so that you do:
>
> res = tipc_create(...)
> if (res)
> goto exit;
> <rest of tipc_init_socket goes here>
>
> That gives you the same level of readability, without the additional potential
> call instruction, plus you put the unlikely case inside the if conditional.
I'm inclined to do the latter and just flip the sense of the "if" since
I already scratched my head trying to think of a good name (and
apparently failed in the end).
Thanks,
Paul.
>
> Neil
>
^ permalink raw reply
* Re: [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit
From: Paul Gortmaker @ 2012-12-07 19:54 UTC (permalink / raw)
To: David Miller; +Cc: nhorman, netdev, jon.maloy, ying.xue
In-Reply-To: <20121207.123644.431593821406881255.davem@davemloft.net>
[Re: [PATCH net-next 02/10] tipc: eliminate aggregate sk_receive_queue limit] On 07/12/2012 (Fri 12:36) David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Fri, 7 Dec 2012 11:07:33 -0500
>
> > On Fri, Dec 07, 2012 at 09:28:10AM -0500, Paul Gortmaker wrote:
> >> From: Ying Xue <ying.xue@windriver.com>
> >>
> >> As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
> >> global atomic counter for the sum of sk_recv_queue sizes across all
> >> tipc sockets. When incremented, the counter is compared to an upper
> >> threshold value, and if this is reached, the message is rejected
> >> with error code TIPC_OVERLOAD.
> >>
> >> This check was originally meant to protect the node against
> >> buffer exhaustion and general CPU overload. However, all experience
> >> indicates that the feature not only is redundant on Linux, but even
> >> harmful. Users run into the limit very often, causing disturbances
> >> for their applications, while removing it seems to have no negative
> >> effects at all. We have also seen that overall performance is
> >> boosted significantly when this bottleneck is removed.
> >>
> >> Furthermore, we don't see any other network protocols maintaining
> >> such a mechanism, something strengthening our conviction that this
> >> control can be eliminated.
> >>
> >> Signed-off-by: Ying Xue <ying.xue@windriver.com>
> >> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
> >> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> ...
> >> @@ -1241,11 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
> >> }
> >>
> >> /* Reject message if there isn't room to queue it */
> >> - recv_q_len = (u32)atomic_read(&tipc_queue_size);
> >> - if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
> >> - if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
> >> - return TIPC_ERR_OVERLOAD;
> >> - }
> > If you're going to remove the one place that you read this variable, don't you
> > also want to remove the points where you increment/decrement the atomic as well,
> > and for that matter eliminate the definition itself?
>
> There's another reader, a getsockopt() call.
>
> I would just make it return zero or similar.
Updated commit below for reference, just to double check that I'm
doing what is requested properly.
Paul.
--
From 9da3d475874f4da49057767913af95ce01063ba3 Mon Sep 17 00:00:00 2001
From: Ying Xue <ying.xue@windriver.com>
Date: Tue, 27 Nov 2012 06:15:27 -0500
Subject: [PATCH] tipc: eliminate aggregate sk_receive_queue limit
As a complement to the per-socket sk_recv_queue limit, TIPC keeps a
global atomic counter for the sum of sk_recv_queue sizes across all
tipc sockets. When incremented, the counter is compared to an upper
threshold value, and if this is reached, the message is rejected
with error code TIPC_OVERLOAD.
This check was originally meant to protect the node against
buffer exhaustion and general CPU overload. However, all experience
indicates that the feature not only is redundant on Linux, but even
harmful. Users run into the limit very often, causing disturbances
for their applications, while removing it seems to have no negative
effects at all. We have also seen that overall performance is
boosted significantly when this bottleneck is removed.
Furthermore, we don't see any other network protocols maintaining
such a mechanism, something strengthening our conviction that this
control can be eliminated.
As a result, the atomic variable tipc_queue_size is now unused
and so it can be deleted. There is a getsockopt call that used
to allow reading it; we retain that but just return zero for
maximum compatibility.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
[PG: phase out tipc_queue_size as pointed out by Neil Horman]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 1a720c8..848be69 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -2,7 +2,7 @@
* net/tipc/socket.c: TIPC socket API
*
* Copyright (c) 2001-2007, Ericsson AB
- * Copyright (c) 2004-2008, 2010-2011, Wind River Systems
+ * Copyright (c) 2004-2008, 2010-2012, Wind River Systems
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
@@ -73,8 +73,6 @@ static struct proto tipc_proto;
static int sockets_enabled;
-static atomic_t tipc_queue_size = ATOMIC_INIT(0);
-
/*
* Revised TIPC socket locking policy:
*
@@ -128,7 +126,6 @@ static atomic_t tipc_queue_size = ATOMIC_INIT(0);
static void advance_rx_queue(struct sock *sk)
{
kfree_skb(__skb_dequeue(&sk->sk_receive_queue));
- atomic_dec(&tipc_queue_size);
}
/**
@@ -140,10 +137,8 @@ static void discard_rx_queue(struct sock *sk)
{
struct sk_buff *buf;
- while ((buf = __skb_dequeue(&sk->sk_receive_queue))) {
- atomic_dec(&tipc_queue_size);
+ while ((buf = __skb_dequeue(&sk->sk_receive_queue)))
kfree_skb(buf);
- }
}
/**
@@ -155,10 +150,8 @@ static void reject_rx_queue(struct sock *sk)
{
struct sk_buff *buf;
- while ((buf = __skb_dequeue(&sk->sk_receive_queue))) {
+ while ((buf = __skb_dequeue(&sk->sk_receive_queue)))
tipc_reject_msg(buf, TIPC_ERR_NO_PORT);
- atomic_dec(&tipc_queue_size);
- }
}
/**
@@ -280,7 +273,6 @@ static int release(struct socket *sock)
buf = __skb_dequeue(&sk->sk_receive_queue);
if (buf == NULL)
break;
- atomic_dec(&tipc_queue_size);
if (TIPC_SKB_CB(buf)->handle != 0)
kfree_skb(buf);
else {
@@ -1241,11 +1233,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
}
/* Reject message if there isn't room to queue it */
- recv_q_len = (u32)atomic_read(&tipc_queue_size);
- if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
- if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
- return TIPC_ERR_OVERLOAD;
- }
recv_q_len = skb_queue_len(&sk->sk_receive_queue);
if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
@@ -1254,7 +1241,6 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
/* Enqueue message (finally!) */
TIPC_SKB_CB(buf)->handle = 0;
- atomic_inc(&tipc_queue_size);
__skb_queue_tail(&sk->sk_receive_queue, buf);
/* Initiate connection termination for an incoming 'FIN' */
@@ -1578,7 +1564,6 @@ restart:
/* Disconnect and send a 'FIN+' or 'FIN-' message to peer */
buf = __skb_dequeue(&sk->sk_receive_queue);
if (buf) {
- atomic_dec(&tipc_queue_size);
if (TIPC_SKB_CB(buf)->handle != 0) {
kfree_skb(buf);
goto restart;
@@ -1717,7 +1702,7 @@ static int getsockopt(struct socket *sock,
/* no need to set "res", since already 0 at this point */
break;
case TIPC_NODE_RECVQ_DEPTH:
- value = (u32)atomic_read(&tipc_queue_size);
+ value = 0; /* was tipc_queue_size, now obsolete */
break;
case TIPC_SOCK_RECVQ_DEPTH:
value = skb_queue_len(&sk->sk_receive_queue);
--
1.7.12.1
>
> Paul please do so and respin this series.
>
> Thanks.
^ permalink raw reply related
* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: Joseph Gasparakis @ 2012-12-07 19:52 UTC (permalink / raw)
To: David Miller
Cc: joseph.gasparakis, bhutchings, alexander.h.duyck, shemminger,
chrisw, gospo, netdev, linux-kernel, dmitry, saeed.bishara,
peter.p.waskiewicz.jr
In-Reply-To: <20121207.143725.1501926484331042018.davem@davemloft.net>
On Fri, 7 Dec 2012, David Miller wrote:
> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> Date: Fri, 7 Dec 2012 11:41:46 -0800 (PST)
>
> >
> >
> > On Fri, 7 Dec 2012, David Miller wrote:
> >
> >> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
> >> Date: Fri, 7 Dec 2012 10:24:17 -0800 (PST)
> >>
> >> > So the idea here is that the driver will use the headers for checksumming
> >> > if the skb->encapsulation bit is on. The bit should be set in the protocol
> >> > driver.
> >> >
> >> > To answer the second comment, the flags that we use in this series of
> >> > patches is NETIF_F_IP_CSUM, NETIF_F_IPV6_CSUM and NETIF_F_SG. These are
> >> > the bits that we propose will be used for checksumming of encapsulation.
> >> > As per a previous comment in v2, the hw_enc_features field should be used
> >> > also in the future when NICs have more encap offloads, so one could
> >> > indicate these features there from the driver.
> >> >
> >> > Furthermore, I submitted a patch for Rx checksumming, where NETIF_F_RXCSUM
> >> > is used, again in conjunction with skb->encapsulation flag. As I mention
> >> > in my logs, the driver is expected to set the ip_summed to UNNECESSARY and
> >> > turn the skb->encapsulation on, to indicate that the inner headers are
> >> > already HW checksummed.
> >> >
> >>
> >> This is the kind of language that belongs in the commit message and
> >> code comments.
> >>
> > Sure. I'll wait to gather some more feedback if there is any and I will
> > re-spin this off adding more code comments and clarify this in the logs.
>
> Great.
>
> Please note that this request applies to your receive side change too.
>
I believe the logs are sufficient in the rx patch:
This patch adds capability in vxlan to identify received
checksummed inner packets and signal them to the upper layers of
the stack. The driver needs to set the skb->encapsulation bit
and also set the skb->ip_summed to CHECKSUM_UNNECESSARY.
...but I can add more comments to the code if this is what you are
referring to.
^ permalink raw reply
* Re: [PATCH v3 1/4] net: Add support for hardware-offloaded encapsulation
From: David Miller @ 2012-12-07 19:52 UTC (permalink / raw)
To: joseph.gasparakis
Cc: bhutchings, alexander.h.duyck, shemminger, chrisw, gospo, netdev,
linux-kernel, dmitry, saeed.bishara, peter.p.waskiewicz.jr
In-Reply-To: <alpine.LFD.2.02.1212071149150.4985@morpheus.jf.intel.com>
From: Joseph Gasparakis <joseph.gasparakis@intel.com>
Date: Fri, 7 Dec 2012 11:52:43 -0800 (PST)
>
>
> On Fri, 7 Dec 2012, David Miller wrote:
>
>> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
>> Date: Fri, 7 Dec 2012 11:41:46 -0800 (PST)
>>
>> >
>> >
>> > On Fri, 7 Dec 2012, David Miller wrote:
>> >
>> >> From: Joseph Gasparakis <joseph.gasparakis@intel.com>
>> >> Date: Fri, 7 Dec 2012 10:24:17 -0800 (PST)
>> >>
>> >> > So the idea here is that the driver will use the headers for checksumming
>> >> > if the skb->encapsulation bit is on. The bit should be set in the protocol
>> >> > driver.
>> >> >
>> >> > To answer the second comment, the flags that we use in this series of
>> >> > patches is NETIF_F_IP_CSUM, NETIF_F_IPV6_CSUM and NETIF_F_SG. These are
>> >> > the bits that we propose will be used for checksumming of encapsulation.
>> >> > As per a previous comment in v2, the hw_enc_features field should be used
>> >> > also in the future when NICs have more encap offloads, so one could
>> >> > indicate these features there from the driver.
>> >> >
>> >> > Furthermore, I submitted a patch for Rx checksumming, where NETIF_F_RXCSUM
>> >> > is used, again in conjunction with skb->encapsulation flag. As I mention
>> >> > in my logs, the driver is expected to set the ip_summed to UNNECESSARY and
>> >> > turn the skb->encapsulation on, to indicate that the inner headers are
>> >> > already HW checksummed.
>> >> >
>> >>
>> >> This is the kind of language that belongs in the commit message and
>> >> code comments.
>> >>
>> > Sure. I'll wait to gather some more feedback if there is any and I will
>> > re-spin this off adding more code comments and clarify this in the logs.
>>
>> Great.
>>
>> Please note that this request applies to your receive side change too.
>>
> I believe the logs are sufficient in the rx patch:
>
> This patch adds capability in vxlan to identify received
> checksummed inner packets and signal them to the upper layers of
> the stack. The driver needs to set the skb->encapsulation bit
> and also set the skb->ip_summed to CHECKSUM_UNNECESSARY.
>
> ...but I can add more comments to the code if this is what you are
> referring to.
Yes, please do.
^ permalink raw reply
* Re: [PATCH] fix initializer entry defined twice issue
From: Patrick Trantham @ 2012-12-07 19:51 UTC (permalink / raw)
To: Cong Ding
Cc: David S. Miller, Otavio Salvador, Javier Martinez Canillas,
Jiri Kosina, netdev, linux-kernel
In-Reply-To: <20121207000511.GD2510@gmail.com>
On 12/06/2012 06:05 PM, Cong Ding wrote:
> On Thu, Dec 06, 2012 at 05:59:21PM -0600, Patrick Trantham wrote:
>> On 12/06/2012 05:16 PM, Cong Ding wrote:
>>> the ".config_intr" is defined twice in both line 208 and 212.
>>>
>>> Signed-off-by: Cong Ding <dinggnu@gmail.com>
>>> ---
>>> drivers/net/phy/smsc.c | 1 -
>>> 1 files changed, 0 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
>>> index 16dceed..5cee6bd 100644
>>> --- a/drivers/net/phy/smsc.c
>>> +++ b/drivers/net/phy/smsc.c
>>> @@ -205,7 +205,6 @@ static struct phy_driver smsc_phy_driver[] = {
>>> /* basic functions */
>>> .config_aneg = genphy_config_aneg,
>>> .read_status = lan87xx_read_status,
>>> - .config_intr = smsc_phy_config_intr,
>>> /* IRQ related */
>>> .ack_interrupt = smsc_phy_ack_interrupt,
>> Hi Cong,
>>
>> That looks like a mistake from my previous patch to that file. Copy
>> and paste fail.
>>
>> The line should read:
>> .config_init = smsc_phy_config_init,
>>
>> I can submit a patch once I get off work unless you beat me to it.
> Thanks for the note. It's better that you submit a patch together with your
> code.
>
> - cong
Patch submitted, thanks for pointing this out!
- Patrick
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox