Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Oops with latest (netfilter) nf-next tree, when unloading iptable_nat
From: Patrick McHardy @ 2012-09-20  7:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Pablo Neira Ayuso, Florian Westphal, netfilter-devel, netdev,
	yongjun_wei
In-Reply-To: <1348126142.2761.172.camel@localhost>

On Thu, 20 Sep 2012, Jesper Dangaard Brouer wrote:

> On Thu, 2012-09-20 at 08:57 +0200, Patrick McHardy wrote:
>> On Wed, 19 Sep 2012, Jesper Dangaard Brouer wrote:
>>
>>> On Fri, 2012-09-14 at 15:15 +0200, Patrick McHardy wrote:
>>>> On Fri, 14 Sep 2012, Pablo Neira Ayuso wrote:
>>>>
>>> [...cut...]
>>>>>> Patrick, any other idea?
>>>>>
>>> [...cut...]
> [... (hair)cut(?)...]
>
>>> No it does not work :-(
>>
>> Ok I think I understand the problem now, we're invoking the NAT cleanup
>> callback twice with clean->hash = true, once for each direction of the
>> conntrack.
>>
>> Does this patch fix the problem?
>
> Yes, it fixes the problem :-)
>
> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

Great, thanks for testing.

^ permalink raw reply

* Re: [RFC net-next] netpoll: use static branch
From: Cong Wang @ 2012-09-20  7:30 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, Eric Dumazet, netdev
In-Reply-To: <20120919130059.7b7ebc83@s6510.linuxnetplumber.net>

On Wed, 2012-09-19 at 13:00 -0700, Stephen Hemminger wrote:
> On Wed, 19 Sep 2012 12:50:10 +0800
> Cong Wang <amwang@redhat.com> wrote:
> 
> > On Tue, 2012-09-18 at 14:10 -0700, Stephen Hemminger wrote:
> > > This is an attempt to optimize netpoll when not used.
> > > 
> > > Since distro's enable everything and netpoll is only occasionally
> > > used, improve performance by getting netpoll condition check
> > > out of the Rx fastpath.
> > > 
> > > Compile tested only, I have no real use for netpoll.
> > > 
> > > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > > 
> > > 
> > > ---
> > >  include/linux/netpoll.h |   28 ++++++++++++++++++++--------
> > >  net/core/netpoll.c      |    8 +++++++-
> > >  2 files changed, 27 insertions(+), 9 deletions(-)
> > > 
> > > --- a/include/linux/netpoll.h	2012-09-18 13:25:15.575750004 -0700
> > > +++ b/include/linux/netpoll.h	2012-09-18 13:29:16.245323347 -0700
> > > @@ -66,10 +66,16 @@ static inline void netpoll_send_skb(stru
> > >  
> > > 
> > >  #ifdef CONFIG_NETPOLL
> > > +extern struct static_key netpoll_needed;
> > > +
> > >  static inline bool netpoll_rx_on(struct sk_buff *skb)
> > >  {
> > > -	struct netpoll_info *npinfo = rcu_dereference_bh(skb->dev->npinfo);
> > > +	struct netpoll_info *npinfo;
> > > +
> > > +	if (static_key_true(&netpoll_needed))
> > > +		return false;
> > >  
> > 
> > I think we should use static_key_false() here, as netpoll is an
> > "unlikely" code path.
> > 
> > Using static branch is a good idea though.
> > 
> > Thanks.
> 
> But static_key_true is just a wrapper around !static_key_false()

For !HAVE_JUMP_LABEL arch, the definition is below:

static __always_inline bool static_key_false(struct static_key *key)
{
        if (unlikely(atomic_read(&key->enabled)) > 0)
                return true;
        return false;
}       
        
static __always_inline bool static_key_true(struct static_key *key)
{       
        if (likely(atomic_read(&key->enabled)) > 0)
                return true;
        return false;
}

^ permalink raw reply

* Re: Oops with latest (netfilter) nf-next tree, when unloading iptable_nat
From: Jesper Dangaard Brouer @ 2012-09-20  7:29 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Pablo Neira Ayuso, Florian Westphal, netfilter-devel, netdev,
	yongjun_wei
In-Reply-To: <Pine.GSO.4.63.1209200855500.8409@stinky-local.trash.net>

On Thu, 2012-09-20 at 08:57 +0200, Patrick McHardy wrote:
> On Wed, 19 Sep 2012, Jesper Dangaard Brouer wrote:
> 
> > On Fri, 2012-09-14 at 15:15 +0200, Patrick McHardy wrote:
> >> On Fri, 14 Sep 2012, Pablo Neira Ayuso wrote:
> >>
> > [...cut...]
> >>>> Patrick, any other idea?
> >>>
> > [...cut...]
[... (hair)cut(?)...]

> > No it does not work :-(
> 
> Ok I think I understand the problem now, we're invoking the NAT cleanup
> callback twice with clean->hash = true, once for each direction of the
> conntrack.
> 
> Does this patch fix the problem?

Yes, it fixes the problem :-)

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

^ permalink raw reply

* Re: [PATCH 6/6] xfrm_user: don't copy esn replay window twice for new states
From: Steffen Klassert @ 2012-09-20  7:27 UTC (permalink / raw)
  To: Mathias Krause; +Cc: David S. Miller, netdev, linux-kernel
In-Reply-To: <1348090423-32665-7-git-send-email-minipli@googlemail.com>

On Wed, Sep 19, 2012 at 11:33:43PM +0200, Mathias Krause wrote:
> The ESN replay window was already fully initialized in
> xfrm_alloc_replay_state_esn(). No need to copy it again.
> 
> Cc: Steffen Klassert <steffen.klassert@secunet.com>
> Signed-off-by: Mathias Krause <minipli@googlemail.com>

Acked-by: Steffen Klassert <steffen.klassert@secunet.com>

^ permalink raw reply

* Re: [PATCH 4/6] xfrm_user: fix info leak in copy_to_user_tmpl()
From: Steffen Klassert @ 2012-09-20  7:26 UTC (permalink / raw)
  To: Mathias Krause; +Cc: David S. Miller, netdev, linux-kernel, Brad Spengler
In-Reply-To: <1348090423-32665-5-git-send-email-minipli@googlemail.com>

On Wed, Sep 19, 2012 at 11:33:41PM +0200, Mathias Krause wrote:
> The memory used for the template copy is a local stack variable. As
> struct xfrm_user_tmpl contains multiple holes added by the compiler for
> alignment, not initializing the memory will lead to leaking stack bytes
> to userland. Add an explicit memset(0) to avoid the info leak.
> 
> Initial version of the patch by Brad Spengler.
> 
> Cc: Brad Spengler <spender@grsecurity.net>
> Signed-off-by: Mathias Krause <minipli@googlemail.com>

Patches 1-4:

Acked-by: Steffen Klassert <steffen.klassert@secunet.com>

^ permalink raw reply

* [PATCH net-next] net: qmi_wwan: adding Huawei E367, ZTE MF683 and Pantech P4200
From: Bjørn Mork @ 2012-09-20  7:18 UTC (permalink / raw)
  To: netdev
  Cc: linux-usb, Bjørn Mork, Fangxiaozhi (Franko),
	Thomas Schäfer, Dan Williams, Shawn J. Goff
In-Reply-To: <1348085016-30850-1-git-send-email-bjorn@mork.no>

One of the modes of Huawei E367 has this QMI/wwan interface:

 I:* If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=01 Prot=07 Driver=(none)
 E:  Ad=83(I) Atr=03(Int.) MxPS=  64 Ivl=2ms
 E:  Ad=84(I) Atr=02(Bulk) MxPS= 512 Ivl=0ms
 E:  Ad=02(O) Atr=02(Bulk) MxPS= 512 Ivl=4ms

Huawei use subclass and protocol to identify vendor specific
functions, so adding a new vendor rule for this combination.

The Pantech devices UML290 (106c:3718) and P4200 (106c:3721) use
the same subclass to identify the QMI/wwan function.  Replace the
existing device specific UML290 entries with generic vendor matching,
adding support for the Pantech P4200.

The ZTE MF683 has 6 vendor specific interfaces, all using
ff/ff/ff for cls/sub/prot.  Adding a match on interface #5 which
is a QMI/wwan interface.

Cc: Fangxiaozhi (Franko) <fangxiaozhi@huawei.com>
Cc: Thomas Schäfer <tschaefer@t-online.de>
Cc: Dan Williams <dcbw@redhat.com>
Cc: Shawn J. Goff <shawn7400@gmail.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
Hello David,

This is the net-next version of the previously posted patch with the
same title.  I totally forgot that I had messed with the .driver_info
fields, making it impossible to apply the same patch to both net and 
net-next.

Sorry about that.  Please apply this version to net-next.  The other
one should still be applied to net (if still possible) and stable.


Thanks,
Bjørn

 drivers/net/usb/qmi_wwan.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index e7b53f0..ca25320 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -353,16 +353,20 @@ static const struct usb_device_id products[] = {
 	},
 
 	/* 2. Combined interface devices matching on class+protocol */
+	{       /* Huawei E367 and possibly others in "Windows mode" */
+		USB_VENDOR_AND_INTERFACE_INFO(HUAWEI_VENDOR_ID, USB_CLASS_VENDOR_SPEC, 1, 7),
+		.driver_info        = (unsigned long)&qmi_wwan_info,
+	},
 	{	/* Huawei E392, E398 and possibly others in "Windows mode" */
 		USB_VENDOR_AND_INTERFACE_INFO(HUAWEI_VENDOR_ID, USB_CLASS_VENDOR_SPEC, 1, 17),
 		.driver_info        = (unsigned long)&qmi_wwan_info,
 	},
-	{	/* Pantech UML290 */
-		USB_DEVICE_AND_INTERFACE_INFO(0x106c, 0x3718, USB_CLASS_VENDOR_SPEC, 0xf0, 0xff),
+	{       /* Pantech UML290, P4200 and more */
+		USB_VENDOR_AND_INTERFACE_INFO(0x106c, USB_CLASS_VENDOR_SPEC, 0xf0, 0xff),
 		.driver_info        = (unsigned long)&qmi_wwan_info,
 	},
 	{	/* Pantech UML290 - newer firmware */
-		USB_DEVICE_AND_INTERFACE_INFO(0x106c, 0x3718, USB_CLASS_VENDOR_SPEC, 0xf1, 0xff),
+		USB_VENDOR_AND_INTERFACE_INFO(0x106c, USB_CLASS_VENDOR_SPEC, 0xf1, 0xff),
 		.driver_info        = (unsigned long)&qmi_wwan_info,
 	},
 
@@ -370,6 +374,7 @@ static const struct usb_device_id products[] = {
 	{QMI_FIXED_INTF(0x19d2, 0x0055, 1)},	/* ZTE (Vodafone) K3520-Z */
 	{QMI_FIXED_INTF(0x19d2, 0x0063, 4)},	/* ZTE (Vodafone) K3565-Z */
 	{QMI_FIXED_INTF(0x19d2, 0x0104, 4)},	/* ZTE (Vodafone) K4505-Z */
+	{QMI_FIXED_INTF(0x19d2, 0x0157, 5)},	/* ZTE MF683 */
 	{QMI_FIXED_INTF(0x19d2, 0x0167, 4)},	/* ZTE MF820D */
 	{QMI_FIXED_INTF(0x19d2, 0x0326, 4)},	/* ZTE MF821D */
 	{QMI_FIXED_INTF(0x19d2, 0x1008, 4)},	/* ZTE (Vodafone) K3570-Z */
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 5/6] xfrm_user: ensure user supplied esn replay window is valid
From: Mathias Krause @ 2012-09-20  7:13 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David S. Miller, Steffen Klassert, netdev, linux-kernel,
	Martin Willi
In-Reply-To: <CA+rthh8Q464Jw5okH5aXds0QZztay9dpcyniahtWFxev8tpN9w@mail.gmail.com>

On Thu, Sep 20, 2012 at 8:12 AM, Mathias Krause <minipli@googlemail.com> wrote:
> What still might happen is the overflow in xfrm_replay_state_esn_len()
> resulting in a to small bitmap allocation for the requested replay
> size. But that gets catched in xfrm_init_replay(). Little late, but
> hey.

Sorry, I mixed that up. The replay_window check in xfrm_init_replay()
has only little to do with the bmp_len overflow. But changing the
return type of xfrm_replay_state_esn_len() to size_t and by doing so,
making the all the size compares operating on positive values, we'll
at least allocate enough memory to not run into memory corruptions.
Though, the replay window will be much smaller, than requested -- due
to the overflow. But userland should expect this. A check for some
upper limit in verify_replay() could catch this early.

Mathias

^ permalink raw reply

* [PATCH] rds: Error on offset mismatch if not loopback
From: John Jolly @ 2012-09-20  7:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: Venkat Venkatsubra, netdev

Attempting an rds connection from the IP address of an IPoIB interface
to itself causes a kernel panic due to a BUG_ON() being triggered. Making
the test less strict allows rds-ping to work without crashing the machine.

A local unprivileged user could use this flaw to crash the sytem.
---
 net/rds/ib_send.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index e590949..7920c85 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -544,7 +544,7 @@ int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm,
 	int flow_controlled = 0;
 	int nr_sig = 0;
 
-	BUG_ON(off % RDS_FRAG_SIZE);
+	BUG_ON(!conn->c_loopback && off % RDS_FRAG_SIZE);
 	BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header));
 
 	/* Do not send cong updates to IB loopback */
-- 
1.7.7

^ permalink raw reply related

* RE: [PATCH net-next] mlx4: use dev_kfree_skb() instead of dev_kfree_skb_any()
From: Yevgeny Petrilin @ 2012-09-20  7:03 UTC (permalink / raw)
  To: Ying Cai, Eric Dumazet; +Cc: David Miller, netdev, Or Gerlitz
In-Reply-To: <CAL1qit_pkQ7YBzcAPMSWF4zeFS6yFj18P1yJMvj_Z-7SRMxeRg@mail.gmail.com>

Hi Ying,
 
> It seems all the TxQs are sharing the same interrupt for Tx
> completions. Will it be better to have separate interrupt per
> num_tx_rings_p_up (8) queues? E.g. for a 16 core system, with 16 * 8
> Tx queues, to have 16 interrupts for Tx completions of those 128 Tx
> queues?

Actually not all TxQs share same interrupt vector.
In commit 76532d0c we assigned an interrupt vector for each TX ring.
When the number of Queues is higher than number of interrupt vectors, there are queues that share interrupts
And actually reaching the assignment you specified.

> 
> Also I'm looking at mlx4_en_select_queue(), it is using
> __skb_tx_hash(). Use something to achieve XPS may bring better
> performances.
> 

We are considering this change. 

^ permalink raw reply

* Re: [PATCH 5/6] xfrm_user: ensure user supplied esn replay window is valid
From: Steffen Klassert @ 2012-09-20  7:05 UTC (permalink / raw)
  To: Mathias Krause
  Cc: Ben Hutchings, David S. Miller, netdev, linux-kernel,
	Martin Willi
In-Reply-To: <CA+rthh8Q464Jw5okH5aXds0QZztay9dpcyniahtWFxev8tpN9w@mail.gmail.com>

On Thu, Sep 20, 2012 at 08:12:11AM +0200, Mathias Krause wrote:
> On Thu, Sep 20, 2012 at 12:38 AM, Ben Hutchings
> <bhutchings@solarflare.com> wrote:
> > On Wed, 2012-09-19 at 23:33 +0200, Mathias Krause wrote:
> 
> > I'm a little worried that the user-provided
> > xfrm_replay_state_esn::bmp_len is not being directly validated anywhere.
> 
> That's what my P.S. in the cover letter tried to hint at -- a missing
> upper limit check. But as I wanted to avoid lengthy discussions about
> the concrete value and the possible need for some sysctl knob to tune
> this even further, I just left this as an exercise for someone else
> who is more familiar with the code ;)
> 

I think we should limit bmp_len to some sane value. RFC 4303 recommends
an anti replay window size of 64 packets, so limiting bmp_len to cover
4096 packets should be more that enough. Also we can increase this value
later without changing the user API if this is needed.

> > Currently xfrm_replay_state_esn_len() may overflow, and as its return
> > type is int it may unexpectedly return a negative value.
> 
> So xfrm_replay_state_esn_len() should return size_t instead as it's
> value should always be positive -- it represents a length. Negative
> lengths make no sense. It can overflow, still. But it cannot get
> negative, at least. Still, the upper limit check would be required to
> avoid other user induced nastiness.
> 
> >
> > [...]
> >> --- a/net/xfrm/xfrm_user.c
> >> +++ b/net/xfrm/xfrm_user.c
> > [...]
> >> @@ -370,14 +378,15 @@ static inline int xfrm_replay_verify_len(struct xfrm_replay_state_esn *replay_es
> >>                                        struct nlattr *rp)
> >>  {
> >>       struct xfrm_replay_state_esn *up;
> >> +     size_t ulen;
> >
> > I would normally expect to see sizes declared as size_t but mixing
> > size_t and int in comparisons tends to result in bugs.  So I think this
> > should to be int, matching the return types of nla_len() and
> > xfrm_replay_state_esn_len() (and apparently all lengths in netlink...)
> 
> I disagree. The value of nla_len() is ensured to be in the range of
> [sizeof(*up), USHRT_MAX-NLA_HDRLEN], i.e. a positive 16 bit number,
> when it passes nlmsg_parse() in xfrm_user_rcv_msg(). This in turn
> allows us to assume the int value returned by nla_len() is actually
> positive and the compiler can safely make it unsigned for the compare
> -- no sign bit, no hassle.

I think xfrm_replay_state_esn_len() should return the same type as
nla_len(), no matter what we can assume from the current code base.
Also it should not return anything else than the other xfrm length
calculation functions.

Once we limited bmp_len, xfrm_replay_state_esn_len() should return
always a positive value.

^ permalink raw reply

* [PATCH] at91ether: return PTR_ERR if call to clk_get fails
From: Devendra Naga @ 2012-09-20  7:04 UTC (permalink / raw)
  To: Nicolas Ferre, netdev; +Cc: Devendra Naga

we are currently returning ENODEV, as the clk_get may give a exact
error code in its returned pointer, assign it to the ret by using the
PTR_ERR function, so that the subsequent goto label will jump to the
error path and clean the driver and return the error correctly.

Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
---
 drivers/net/ethernet/cadence/at91_ether.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/cadence/at91_ether.c b/drivers/net/ethernet/cadence/at91_ether.c
index 7788419..4e980a7 100644
--- a/drivers/net/ethernet/cadence/at91_ether.c
+++ b/drivers/net/ethernet/cadence/at91_ether.c
@@ -1086,7 +1086,7 @@ static int __init at91ether_probe(struct platform_device *pdev)
 	/* Clock */
 	lp->ether_clk = clk_get(&pdev->dev, "ether_clk");
 	if (IS_ERR(lp->ether_clk)) {
-		res = -ENODEV;
+		res = PTR_ERR(lp->ether_clk);
 		goto err_ioumap;
 	}
 	clk_enable(lp->ether_clk);
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH net-next] core: adjust checks for calling skb_copy_bits in skb_try_coalesce
From: Eric Dumazet @ 2012-09-20  7:00 UTC (permalink / raw)
  To: RongQing Li; +Cc: netdev, edumazet
In-Reply-To: <CAJFZqHz5Ny8kbXk4arj-CEfQs+r28oGUQChNJ+0nSOTTg=c82A@mail.gmail.com>

On Thu, 2012-09-20 at 13:57 +0800, RongQing Li wrote:

> 
> I am wrong, thanks.

Thanks for taking a look at this stuff ;)

^ permalink raw reply

* Re: [PATCH net-next 06/11] sfc: Add support for IEEE-1588 PTP
From: Richard Cochran @ 2012-09-20  6:59 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, netdev, linux-net-drivers, Rodolfo Giometti,
	Andrew Jackson
In-Reply-To: <1348082232.2636.21.camel@bwh-desktop.uk.solarflarecom.com>

On Wed, Sep 19, 2012 at 08:17:12PM +0100, Ben Hutchings wrote:
> From: Stuart Hodgson <smhodgson@solarflare.com>
> 
> Add PTP IEEE-1588 support and make accesible via the PHC subsystem.
> 
> This work is based on prior code by Andrew Jackson
> 
> Signed-off-by: Stuart Hodgson <smhodgson@solarflare.com>
> [bwh:
>  - Add byte order conversion in efx_ptp_send_times()
>  - Simplify conversion of PPS event times
>  - Add the built-in vs module check to CONFIG_SFC_PTP dependencies]
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>

Acked-by: Richard Cochran <richardcochran@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next 01/11] pps/ptp: Allow PHC devices to adjust PPS events for known delay
From: Richard Cochran @ 2012-09-20  6:57 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, netdev, linux-net-drivers, Rodolfo Giometti,
	Andrew Jackson
In-Reply-To: <1348082024.2636.16.camel@bwh-desktop.uk.solarflarecom.com>

On Wed, Sep 19, 2012 at 08:13:44PM +0100, Ben Hutchings wrote:
> Initial version by Stuart Hodgson <smhodgson@solarflare.com>
> 
> Some PHC device drivers may deliver PPS events with a significant
> and variable delay, but still be able to measure precisely what
> that delay is.
> 
> Add a pps_sub_ts() function for subtracting a delay from the
> timestamp(s) in a PPS event, and a PTP event type (PTP_CLOCK_PPSUSR)
> for which the caller provides a complete PPS event.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>

Acked-by: Richard Cochran <richardcochran@gmail.com>

^ permalink raw reply

* Re: Oops with latest (netfilter) nf-next tree, when unloading iptable_nat
From: Patrick McHardy @ 2012-09-20  6:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Pablo Neira Ayuso, Florian Westphal, netfilter-devel, netdev,
	yongjun_wei
In-Reply-To: <1348058791.2761.94.camel@localhost>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1333 bytes --]

On Wed, 19 Sep 2012, Jesper Dangaard Brouer wrote:

> On Fri, 2012-09-14 at 15:15 +0200, Patrick McHardy wrote:
>> On Fri, 14 Sep 2012, Pablo Neira Ayuso wrote:
>>
> [...cut...]
>>>> Patrick, any other idea?
>>>
> [...cut...]
>>>>
>>> We can add nf_nat_iterate_cleanup that can iterate over the NAT
>>> hashtable to replace current usage of nf_ct_iterate_cleanup.
>>
>> Lets just bail out when IPS_SRC_NAT_DONE is not set, that should also fix
>> it. Could you try this patch please?
>
> On Fri, 2012-09-14 at 15:15 +0200, Patrick McHardy wrote:
> diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
>> index 29d4452..8b5d220 100644
>> --- a/net/netfilter/nf_nat_core.c
>> +++ b/net/netfilter/nf_nat_core.c
>> @@ -481,6 +481,8 @@ static int nf_nat_proto_clean(struct nf_conn *i,
> void *data)
>>
>>         if (!nat)
>>                 return 0;
>> +       if (!(i->status & IPS_SRC_NAT_DONE))
>> +               return 0;
>>         if ((clean->l3proto && nf_ct_l3num(i) != clean->l3proto) ||
>>             (clean->l4proto && nf_ct_protonum(i) != clean->l4proto))
>>                 return 0;
>>
>
> No it does not work :-(

Ok I think I understand the problem now, we're invoking the NAT cleanup
callback twice with clean->hash = true, once for each direction of the
conntrack.

Does this patch fix the problem?

[-- Attachment #2: Type: TEXT/PLAIN, Size: 4200 bytes --]

commit 6c46a3bfb2776ca098565daf7e872a3283d14e0d
Author: Patrick McHardy <kaber@trash.net>
Date:   Thu Sep 20 08:43:02 2012 +0200

    netfilter: nf_nat: fix oops when unloading protocol modules
    
    When unloading a protocol module nf_ct_iterate_cleanup() is used to
    remove all conntracks using the protocol from the bysource hash and
    clean their NAT sections. Since the conntrack isn't actually killed,
    the NAT callback is invoked twice, once for each direction, which
    causes an oops when trying to delete it from the bysource hash for
    the second time.
    
    The same oops can also happen when removing both an L3 and L4 protocol
    since the cleanup function doesn't check whether the conntrack has
    already been cleaned up.
    
    Pid: 4052, comm: modprobe Not tainted 3.6.0-rc3-test-nat-unload-fix+ #32 Red Hat KVM
    RIP: 0010:[<ffffffffa002c303>]  [<ffffffffa002c303>] nf_nat_proto_clean+0x73/0xd0 [nf_nat]
    RSP: 0018:ffff88007808fe18  EFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffff8800728550c0 RCX: ffff8800756288b0
    RDX: dead000000200200 RSI: ffff88007808fe88 RDI: ffffffffa002f208
    RBP: ffff88007808fe28 R08: ffff88007808e000 R09: 0000000000000000
    R10: dead000000200200 R11: dead000000100100 R12: ffffffff81c6dc00
    R13: ffff8800787582b8 R14: ffff880078758278 R15: ffff88007808fe88
    FS:  00007f515985d700(0000) GS:ffff88007cd00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f515986a000 CR3: 000000007867a000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process modprobe (pid: 4052, threadinfo ffff88007808e000, task ffff8800756288b0)
    Stack:
     ffff88007808fe68 ffffffffa002c290 ffff88007808fe78 ffffffff815614e3
     ffffffff00000000 00000aeb00000246 ffff88007808fe68 ffffffff81c6dc00
     ffff88007808fe88 ffffffffa00358a0 0000000000000000 000000000040f5b0
    Call Trace:
     [<ffffffffa002c290>] ? nf_nat_net_exit+0x50/0x50 [nf_nat]
     [<ffffffff815614e3>] nf_ct_iterate_cleanup+0xc3/0x170
     [<ffffffffa002c55a>] nf_nat_l3proto_unregister+0x8a/0x100 [nf_nat]
     [<ffffffff812a0303>] ? compat_prepare_timeout+0x13/0xb0
     [<ffffffffa0035848>] nf_nat_l3proto_ipv4_exit+0x10/0x23 [nf_nat_ipv4]
     ...
    
    To fix this,
    
    - check whether the conntrack has already been cleaned up in
      nf_nat_proto_clean
    
    - change nf_ct_iterate_cleanup() to only invoke the callback function
      once for each conntrack (IP_CT_DIR_ORIGINAL).
    
    The second change doesn't affect other callers since when conntracks are
    actually killed, both directions are removed from the hash immediately
    and the callback is already only invoked once. If it is not killed, the
    second callback invocation will always return the same decision not to
    kill it.
    
    Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Signed-off-by: Patrick McHardy <kaber@trash.net>

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index dcb2791..0f241be 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1224,6 +1224,8 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 	spin_lock_bh(&nf_conntrack_lock);
 	for (; *bucket < net->ct.htable_size; (*bucket)++) {
 		hlist_nulls_for_each_entry(h, n, &net->ct.hash[*bucket], hnnode) {
+			if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
+				continue;
 			ct = nf_ct_tuplehash_to_ctrack(h);
 			if (iter(ct, data))
 				goto found;
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index 1816ad3..65cf694 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -481,6 +481,8 @@ static int nf_nat_proto_clean(struct nf_conn *i, void *data)
 
 	if (!nat)
 		return 0;
+	if (!(i->status & IPS_SRC_NAT_DONE))
+		return 0;
 	if ((clean->l3proto && nf_ct_l3num(i) != clean->l3proto) ||
 	    (clean->l4proto && nf_ct_protonum(i) != clean->l4proto))
 		return 0;

^ permalink raw reply related

* Re: [PATCH 2/2] Using LP firmware for taking advantage of the low-power capabilities.
From: Jarl Friis @ 2012-09-20  6:32 UTC (permalink / raw)
  To: Rafał Miłecki
  Cc: Stefano Brivio, Gábor Stefanik,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	b43-dev-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	netdev-u79uwXL29TY76Z2rM5mHXA, John W. Linville
In-Reply-To: <CACna6ryRXB6GjymaP-RM8HTEeWRHnE8B=5tdH3J4JGyeHBEE6Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

2012/9/20 Rafał Miłecki <zajec5-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>:
> Well, your BCM4322 is not LP-PHY card... does it really work better
> after that change, and wakes up faster? ;)

That was my un-scientific impression, yes (probably biased by
expectations, right? :-) ). But it could also be that I was using a
more upstream kernel than the stock ubuntu kernel. I don't know :-)

> BCM4313 is LCN-PHY, so your change won't affect it, not to mention it
> is not supported by b43.
>
> 14e4:4315 is really BCM4312 (LP-PHY) and maybe could be affected by
> this patch... but just in case of core 16. AFAIK our 14e4:4315 are
> usually devices with core rev 15.

Thanks for clearing that out...

Jarl
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v2] xfrm_user: ensure user supplied esn replay window is valid
From: Mathias Krause @ 2012-09-20  6:22 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David S. Miller, Steffen Klassert, netdev, linux-kernel,
	Mathias Krause, Martin Willi
In-Reply-To: <CA+rthh8Q464Jw5okH5aXds0QZztay9dpcyniahtWFxev8tpN9w@mail.gmail.com>

The current code fails to ensure that the netlink message actually
contains as many bytes as the header indicates. If a user creates a new
state or updates an existing one but does not supply the bytes for the
whole ESN replay window, the kernel copies random heap bytes into the
replay bitmap, the ones happen to follow the XFRMA_REPLAY_ESN_VAL
netlink attribute. This leads to following issues:

1. The replay window has random bits set confusing the replay handling
   code later on.

2. A malicious user could use this flaw to leak up to ~3.5kB of heap
   memory when she has access to the XFRM netlink interface (requires
   CAP_NET_ADMIN).

Known users of the ESN replay window are strongSwan and Steffen's
iproute2 patch (<http://patchwork.ozlabs.org/patch/85962/>). The latter
uses the interface with a bitmap supplied while the former does not.
strongSwan is therefore prone to run into issue 1.

To fix both issues without breaking existing userland allow using the
XFRMA_REPLAY_ESN_VAL netlink attribute with either an empty bitmap or a
fully specified one. For the former case we initialize the in-kernel
bitmap with zero, for the latter we copy the user supplied bitmap.

For state updates the full bitmap must be supplied.

While at it, fix xfrm_replay_state_esn_len() to return size_t instead of
int as it calculates a length and all users expect the return value to
be positive.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Martin Willi <martin@revosec.ch>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
---
v2:
- compare against klen in xfrm_alloc_replay_state_esn (suggested by Ben)
- make xfrm_replay_state_esn_len() return size_t

 include/net/xfrm.h   |    4 ++--
 net/xfrm/xfrm_user.c |   27 +++++++++++++++++++++------
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 639dd13..3f7eadd 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1621,9 +1621,9 @@ static inline int xfrm_alg_auth_len(const struct xfrm_algo_auth *alg)
 	return sizeof(*alg) + ((alg->alg_key_len + 7) / 8);
 }
 
-static inline int xfrm_replay_state_esn_len(struct xfrm_replay_state_esn *replay_esn)
+static inline size_t xfrm_replay_state_esn_len(struct xfrm_replay_state_esn *rs)
 {
-	return sizeof(*replay_esn) + replay_esn->bmp_len * sizeof(__u32);
+	return sizeof(*rs) + rs->bmp_len * sizeof(__u32);
 }
 
 #ifdef CONFIG_XFRM_MIGRATE
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 9f1e749..44c4b98 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -123,9 +123,17 @@ static inline int verify_replay(struct xfrm_usersa_info *p,
 				struct nlattr **attrs)
 {
 	struct nlattr *rt = attrs[XFRMA_REPLAY_ESN_VAL];
+	struct xfrm_replay_state_esn *rs;
 
-	if ((p->flags & XFRM_STATE_ESN) && !rt)
-		return -EINVAL;
+	if (p->flags & XFRM_STATE_ESN) {
+		if (!rt)
+			return -EINVAL;
+
+		rs = nla_data(rt);
+		if (nla_len(rt) < xfrm_replay_state_esn_len(rs) &&
+		    nla_len(rt) != sizeof(*rs))
+			return -EINVAL;
+	}
 
 	if (!rt)
 		return 0;
@@ -370,14 +378,15 @@ static inline int xfrm_replay_verify_len(struct xfrm_replay_state_esn *replay_es
 					 struct nlattr *rp)
 {
 	struct xfrm_replay_state_esn *up;
+	size_t ulen;
 
 	if (!replay_esn || !rp)
 		return 0;
 
 	up = nla_data(rp);
+	ulen = xfrm_replay_state_esn_len(up);
 
-	if (xfrm_replay_state_esn_len(replay_esn) !=
-			xfrm_replay_state_esn_len(up))
+	if (nla_len(rp) < ulen || xfrm_replay_state_esn_len(replay_esn) != ulen)
 		return -EINVAL;
 
 	return 0;
@@ -388,22 +397,28 @@ static int xfrm_alloc_replay_state_esn(struct xfrm_replay_state_esn **replay_esn
 				       struct nlattr *rta)
 {
 	struct xfrm_replay_state_esn *p, *pp, *up;
+	size_t klen, ulen;
 
 	if (!rta)
 		return 0;
 
 	up = nla_data(rta);
+	klen = xfrm_replay_state_esn_len(up);
+	ulen = nla_len(rta) >= klen ? klen : sizeof(*up);
 
-	p = kmemdup(up, xfrm_replay_state_esn_len(up), GFP_KERNEL);
+	p = kzalloc(klen, GFP_KERNEL);
 	if (!p)
 		return -ENOMEM;
 
-	pp = kmemdup(up, xfrm_replay_state_esn_len(up), GFP_KERNEL);
+	pp = kzalloc(klen, GFP_KERNEL);
 	if (!pp) {
 		kfree(p);
 		return -ENOMEM;
 	}
 
+	memcpy(p, up, ulen);
+	memcpy(pp, up, ulen);
+
 	*replay_esn = p;
 	*preplay_esn = pp;
 
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 5/6] xfrm_user: ensure user supplied esn replay window is valid
From: Mathias Krause @ 2012-09-20  6:12 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David S. Miller, Steffen Klassert, netdev, linux-kernel,
	Martin Willi
In-Reply-To: <1348094309.2636.80.camel@bwh-desktop.uk.solarflarecom.com>

On Thu, Sep 20, 2012 at 12:38 AM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
> On Wed, 2012-09-19 at 23:33 +0200, Mathias Krause wrote:
>> The current code fails to ensure that the netlink message actually
>> contains as many bytes as the header indicates. If a user creates a new
>> state or updates an existing one but does not supply the bytes for the
>> whole ESN replay window, the kernel copies random heap bytes into the
>> replay bitmap, the ones happen to follow the XFRMA_REPLAY_ESN_VAL
>> netlink attribute. This leads to following issues:
>>
>> 1. The replay window has random bits set confusing the replay handling
>>    code later on.
>>
>> 2. A malicious user could use this flaw to leak up to ~3.5kB of heap
>>    memory when she has access to the XFRM netlink interface (requires
>>    CAP_NET_ADMIN).
>
> Where does this limit come from?  Is that just the standard size netlink
> skb?

It's from then "msg_new(NLMSG_DEFAULT_SIZE, GFP_ATOMIC)" in
xfrm_state_netlink() which boils down to roughly 3.7k free space for
the netlink message. Excluding the space used for struct
xfrm_usersa_info and some minimal extensions leaves roughly 3.5k for
the replay window.
Maybe that's just another bug and the code should allocate a netlink
message big enough for the whole state dump. Don't know. I'm not
familiar with the code.

> I'm a little worried that the user-provided
> xfrm_replay_state_esn::bmp_len is not being directly validated anywhere.

That's what my P.S. in the cover letter tried to hint at -- a missing
upper limit check. But as I wanted to avoid lengthy discussions about
the concrete value and the possible need for some sysctl knob to tune
this even further, I just left this as an exercise for someone else
who is more familiar with the code ;)

> Currently xfrm_replay_state_esn_len() may overflow, and as its return
> type is int it may unexpectedly return a negative value.

So xfrm_replay_state_esn_len() should return size_t instead as it's
value should always be positive -- it represents a length. Negative
lengths make no sense. It can overflow, still. But it cannot get
negative, at least. Still, the upper limit check would be required to
avoid other user induced nastiness.

>
> [...]
>> --- a/net/xfrm/xfrm_user.c
>> +++ b/net/xfrm/xfrm_user.c
> [...]
>> @@ -370,14 +378,15 @@ static inline int xfrm_replay_verify_len(struct xfrm_replay_state_esn *replay_es
>>                                        struct nlattr *rp)
>>  {
>>       struct xfrm_replay_state_esn *up;
>> +     size_t ulen;
>
> I would normally expect to see sizes declared as size_t but mixing
> size_t and int in comparisons tends to result in bugs.  So I think this
> should to be int, matching the return types of nla_len() and
> xfrm_replay_state_esn_len() (and apparently all lengths in netlink...)

I disagree. The value of nla_len() is ensured to be in the range of
[sizeof(*up), USHRT_MAX-NLA_HDRLEN], i.e. a positive 16 bit number,
when it passes nlmsg_parse() in xfrm_user_rcv_msg(). This in turn
allows us to assume the int value returned by nla_len() is actually
positive and the compiler can safely make it unsigned for the compare
-- no sign bit, no hassle.
What still might happen is the overflow in xfrm_replay_state_esn_len()
resulting in a to small bitmap allocation for the requested replay
size. But that gets catched in xfrm_init_replay(). Little late, but
hey.

>
>>       if (!replay_esn || !rp)
>>               return 0;
>>
>>       up = nla_data(rp);
>> +     ulen = xfrm_replay_state_esn_len(up);
>>
>> -     if (xfrm_replay_state_esn_len(replay_esn) !=
>> -                     xfrm_replay_state_esn_len(up))
>> +     if (nla_len(rp) < ulen || xfrm_replay_state_esn_len(replay_esn) != ulen)
>>               return -EINVAL;
>>
>>       return 0;
>> @@ -388,22 +397,28 @@ static int xfrm_alloc_replay_state_esn(struct xfrm_replay_state_esn **replay_esn
>>                                      struct nlattr *rta)
>>  {
>>       struct xfrm_replay_state_esn *p, *pp, *up;
>> +     size_t klen, ulen;
>
> Also int, for the same reason.

No, for the reason stated above. I'll fixup
xfrm_replay_state_esn_len() to return size_t instead.

>
>>       if (!rta)
>>               return 0;
>>
>>       up = nla_data(rta);
>> +     klen = xfrm_replay_state_esn_len(up);
>> +     ulen = nla_len(rta) > sizeof(*up) ? klen : sizeof(*up);
> [...]
>
> I understand that this is correct since verify_replay() previously
> checked that nla_len(rta) is either == sizeof(*up) or >= klen.  But
> would it not be more obviously correct to test nla_len(rta) >= klen?

It is. Comparing against klen makes the code more readable.

Thanks,
Mathias

>
> Ben.
>
> --
> Ben Hutchings, Staff Engineer, Solarflare
> Not speaking for my employer; that's the marketing department's job.
> They asked us to note that Solarflare product names are trademarked.
>

^ permalink raw reply

* Re: [PATCH 1/2] Added information about which firmware file is being requested.
From: Rafał Miłecki @ 2012-09-20  6:06 UTC (permalink / raw)
  To: Jarl Friis
  Cc: Michael Tokarev, Stefano Brivio, Gábor Stefanik,
	linux-wireless, b43-dev, netdev, John W. Linville
In-Reply-To: <CAOjsGA3BgzBn-YOEFYFs5HqLAh4_d46Juj1M-XYooqXeVtn9uw@mail.gmail.com>

2012/9/19 Jarl Friis <jarl@softace.dk>:
> 2012/9/19 Michael Tokarev <mjt@tls.msk.ru>:
>> On 19.09.2012 15:18, Jarl Friis wrote:
>>
>>> +     b43info(ctx->dev->wl, "Requesting firmware file '%s'\n", ctx->fwname);
>>>       err = request_firmware(&blob, ctx->fwname, ctx->dev->dev->dev);
>>
>> Hmm.  I wonder if this should be printed in request_firmware()
>> itself instead of in all callers?
>
> Now that you mention it, I also think that is a much better idea.
> However that would be a much more central place to do the change, so I
> would gladly see somebody else do that patch (in replacement of mine)

I agree, please submit patch modifying request_firmware if you believe
it's important.

-- 
Rafał

^ permalink raw reply

* Re: [PATCH 2/2] Using LP firmware for taking advantage of the low-power capabilities.
From: Rafał Miłecki @ 2012-09-20  6:05 UTC (permalink / raw)
  To: Jarl Friis
  Cc: Stefano Brivio, Gábor Stefanik, linux-wireless, b43-dev,
	netdev, John W. Linville
In-Reply-To: <CAOjsGA2uvCxnX1PH85w2vO894kZT+af-Tgj7bV-3kw3ScxN10g@mail.gmail.com>

2012/9/19 Jarl Friis <jarl@softace.dk>:
> 2012/9/19 Jarl Friis <jarl@softace.dk>:
>> This is using the LP specific firmware to better take advantage of the
>> Low-Power capabilities.
>
> Gosh... I just realized that the code I introduced is completely
> untested. My hardware does not reach these pieces of code...
>
> Sorry...
>
> However the code seems natural (due to the firmware file name pattern)
> for a PHY-LP hardware, that is a bcm4313 chip (pciid 14e4:4315).

Well, your BCM4322 is not LP-PHY card... does it really work better
after that change, and wakes up faster? ;)

BCM4313 is LCN-PHY, so your change won't affect it, not to mention it
is not supported by b43.

14e4:4315 is really BCM4312 (LP-PHY) and maybe could be affected by
this patch... but just in case of core 16. AFAIK our 14e4:4315 are
usually devices with core rev 15.

-- 
Rafał

^ permalink raw reply

* Re: [PATCH net-next] core: adjust checks for calling skb_copy_bits in skb_try_coalesce
From: RongQing Li @ 2012-09-20  5:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, edumazet
In-Reply-To: <1348120438.31352.63.camel@edumazet-glaptop>

> Did you read skb_tailroom(to) definition by any chance ?
>
> static inline int skb_tailroom(const struct sk_buff *skb)
> {
>         return skb_is_nonlinear(skb) ? 0 : skb->end - skb->tail;
> }
>
> Current code is fine, because if @to is linear, its ... linear.
>
>
>

I am wrong, thanks.

^ permalink raw reply

* Re: [PATCH net-next] core: adjust checks for calling skb_copy_bits in skb_try_coalesce
From: Eric Dumazet @ 2012-09-20  5:53 UTC (permalink / raw)
  To: RongQing Li; +Cc: netdev, edumazet
In-Reply-To: <CAJFZqHzTEwgJaKjvNLW3yYbMLUu0xm8tHwQ2e_TZJNdiTKOTFw@mail.gmail.com>

On Thu, 2012-09-20 at 13:40 +0800, RongQing Li wrote:
> >>               unsigned int offset;
> >
> > This is not needed at all.
> >
> >
> 
> I think the below modification maybe needed,
> if (len <= skb_tailroom(to) && !skb_shinfo(to)->nr_frags) {
> ..
> }
> 
> First skb A is added to skb TO frags, since the len is larger
> than skb_tailroot(TO), but second len of skb B is less than
> skb_tailroot(To)  which will call skb_copy_bits.
> 
> Of cause, this kinds of cases maybe only exist on my mind.
> 

Did you read skb_tailroom(to) definition by any chance ?

static inline int skb_tailroom(const struct sk_buff *skb)
{
        return skb_is_nonlinear(skb) ? 0 : skb->end - skb->tail;
}

Current code is fine, because if @to is linear, its ... linear.

^ permalink raw reply

* Re: [PATCH net-next] core: adjust checks for calling skb_copy_bits in skb_try_coalesce
From: RongQing Li @ 2012-09-20  5:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, edumazet
In-Reply-To: <1348118417.31352.52.camel@edumazet-glaptop>

>>               unsigned int offset;
>
> This is not needed at all.
>
>

I think the below modification maybe needed,
if (len <= skb_tailroom(to) && !skb_shinfo(to)->nr_frags) {
..
}

First skb A is added to skb TO frags, since the len is larger
than skb_tailroot(TO), but second len of skb B is less than
skb_tailroot(To)  which will call skb_copy_bits.

Of cause, this kinds of cases maybe only exist on my mind.

-Roy

^ permalink raw reply

* Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
From: Eric Dumazet @ 2012-09-20  5:37 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: David Miller, netdev
In-Reply-To: <CAGK4HS_ZKKrzFh+c3WDxbkMFpDTg7ehw9ydp39k6JPFA_j1YOw@mail.gmail.com>

On Wed, 2012-09-19 at 15:20 -0700, Vijay Subramanian wrote:
> > I did some tests and got no problem so far, even using splice() [ this
> > one was tricky because it only deals with order-0 pages at this moment ]
> >
> > NIC tested : ixgbe, igb, bnx2x, tg3, mellanox mlx4
> 
> 
> I applied this patch to net-next and tested with e1000e driver.
> With iperf I got around 8 % improvement on loopback.
> 
> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> 
> 
> Vijay

If you keep the producer and consumer on separate cpus, and use large
enough send() (64KB or 128KB), gain is more like 15 or 20%

iperf uses 8KB writes, while netperf uses a 16KB default.

TCP stack has a problem because /proc/sys/net/ipv4/tcp_reordering
default value (3) is too small for loopback, since a packet contains 4
MSS

A single reorder and some packets are retransmitted.

Following setting is better

echo 16 >/proc/sys/net/ipv4/tcp_reordering

loopback is lossless, so its always surprising we can have TCP
retransmits on this medium ;)

Thanks

^ permalink raw reply

* [PATCH net-next v2] ipv6: remove unnecessary call rt6_clean_expires
From: roy.qing.li @ 2012-09-20  5:31 UTC (permalink / raw)
  To: netdev, gaofeng

From: Li RongQing <roy.qing.li@gmail.com>

the from of dst_entry and rt6i_flags of rt6_info have been zeroed out in
ip6_blackhole_route or after calling ip6_dst_alloc, so it is unnecessary
to call rt6_clean_expires again.

Cc: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
 net/ipv6/route.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0607ee3..fd5dabf 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -996,8 +996,7 @@ struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_ori
 			in6_dev_hold(rt->rt6i_idev);
 
 		rt->rt6i_gateway = ort->rt6i_gateway;
-		rt->rt6i_flags = ort->rt6i_flags;
-		rt6_clean_expires(rt);
+		rt->rt6i_flags = ort->rt6i_flags & ~RTF_EXPIRES;
 		rt->rt6i_metric = 0;
 
 		memcpy(&rt->rt6i_dst, &ort->rt6i_dst, sizeof(struct rt6key));
@@ -1393,8 +1392,6 @@ int ip6_route_add(struct fib6_config *cfg)
 	if (cfg->fc_flags & RTF_EXPIRES)
 		rt6_set_expires(rt, jiffies +
 				clock_t_to_jiffies(cfg->fc_expires));
-	else
-		rt6_clean_expires(rt);
 
 	if (cfg->fc_protocol == RTPROT_UNSPEC)
 		cfg->fc_protocol = RTPROT_BOOT;
@@ -1803,12 +1800,10 @@ static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
 		rt->dst.lastuse = jiffies;
 
 		rt->rt6i_gateway = ort->rt6i_gateway;
-		rt->rt6i_flags = ort->rt6i_flags;
+		rt->rt6i_flags = ort->rt6i_flags & ~RTF_EXPIRES;
 		if ((ort->rt6i_flags & (RTF_DEFAULT | RTF_ADDRCONF)) ==
 		    (RTF_DEFAULT | RTF_ADDRCONF))
 			rt6_set_from(rt, ort);
-		else
-			rt6_clean_expires(rt);
 		rt->rt6i_metric = 0;
 
 #ifdef CONFIG_IPV6_SUBTREES
-- 
1.7.4.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox