Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: [PATCH 19/26] drivers: isdn: Move prototype declaration to header file platform.h from diva_didd.c
From: David Laight @ 2014-02-07 13:33 UTC (permalink / raw)
  To: 'Rashika Kheria', linux-kernel@vger.kernel.org
  Cc: Armin Schindler, Karsten Keil, netdev@vger.kernel.org,
	josh@joshtriplett.org
In-Reply-To: <93d7188538418423685ecefaf3d2bb311b37be6e.1390408518.git.rashika.kheria@gmail.com>

From: Rashika Kheria
> Move prototype declarations of function to header file
> hardware/eicon/platform.h because they are used by more than one file.
> 
> This eliminates the following warnings in hardware/eicon/diddfunc.c:
> drivers/isdn/hardware/eicon/diddfunc.c:95:12: warning: no previous prototype for diddfunc_init [-
> Wmissing-prototypes]
> drivers/isdn/hardware/eicon/diddfunc.c:110:13: warning: no previous prototype for diddfunc_finit [-
> Wmissing-prototypes]
...
> diff --git a/drivers/isdn/hardware/eicon/diva_didd.c b/drivers/isdn/hardware/eicon/diva_didd.c
> index fab6ccf..56d32a7 100644
> --- a/drivers/isdn/hardware/eicon/diva_didd.c
> +++ b/drivers/isdn/hardware/eicon/diva_didd.c
> @@ -39,9 +39,6 @@ MODULE_LICENSE("GPL");
>  #define DBG_MINIMUM  (DL_LOG + DL_FTL + DL_ERR)
>  #define DBG_DEFAULT  (DBG_MINIMUM + DL_XLOG + DL_REG)
> 
> -extern int diddfunc_init(void);
> -extern void diddfunc_finit(void);
> -
>  extern void DIVA_DIDD_Read(void *, int);

You should move that one as well.
There really shouldn't be 'extern' definitions for any function in
any C files since you want the compiler to check they are correct
when the function itself is compiled.

	David


^ permalink raw reply

* Re: [PATCH] net: asix: fix bad header length bug
From: Emil Goode @ 2014-02-07 13:53 UTC (permalink / raw)
  To: Bjørn Mork
  Cc: David Laight, 'Igor Gnatenko', David S. Miller, Ming Lei,
	Mark Brown, Jeff Kirsher, Glen Turner, linux-usb@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <87wqh7i7pf.fsf@nemi.mork.no>

On Fri, Feb 07, 2014 at 10:38:04AM +0100, Bjørn Mork wrote:
> Emil Goode <emilgoode@gmail.com> writes:
> > On Thu, Feb 06, 2014 at 03:28:13PM +0000, David Laight wrote:
> >> From: Igor Gnatenko
> >> > On Thu, 2014-02-06 at 13:56 +0100, Emil Goode wrote:
> >> > > The AX88772B occasionally send rx packets that cross urb boundaries
> >> > > and the remaining partial packet is sent with no header.
> >> > > When the buffer with a partial packet is of less number of octets
> >> > > than the value of hard_header_len the buffer is discarded by the
> >> > > usbnet module. This is causing dropped packages and error messages
> >> > > in dmesg.
> 
> > I will do some more digging in the code, but the test of skb->len
> > against hard_header_len is done already in the completion callback
> > function passed to usb_fill_bulk_urb so it seems that buffers of less
> > than hard_header_len number of octets will be dropped regardless.
> 
> I am pretty sure you are right about this bug. And the exact same
> solution is already used by the cx82310_eth minidriver, so I don't see
> the problem.  Your fix is fine IMHO.  But you should apply it to all the
> devices using asix_rx_fixup_common(), not just the ax88772 ones.
> 
> You could maybe make this a usbnet flag instead and create a generic
> solution in usbnet, but frankly I believe the number of flags and their
> meaning have exceeded drivers authors capabilities a long time ago.  At
> least mine, which are quite limited ;-)
> 
> An example of that problem is another bloody obvious bug I noticed while
> looking at this driver: The 'struct driver_info ax88178_info' points to
> asix_rx_fixup_common without setting the FLAG_MULTI_PACKET.  This will
> result in usbnet rx_process() calling usbnet_skb_return() on skbs which
> are already consumed by the minidriver.  Not a big problem, but will
> give some odd results.  But if you allow skbs shorter than ETH_HLEN to
> slip through then it might go boom, so you should probably fix that as
> well.
> 
> 
> Bjørn

Yes I believe the patch is necessary, but maybe it would be nice with
a prettier solution rather than setting hard_header_len to 0 for all
devices with this behaviour. Perhaps it would be better to let each
driver that uses the usbnet module decide what skbs to discard?

What David describes seems to be another bug, but I don't think it is
related to this patch as I'm able to reproduce the bug without the patch
beeing applied by setting the mtu to pretty much any value other than 
1500 and using ping with a larger packet size than that mtu value.

Best regards,

Emil Goode

^ permalink raw reply

* RE: [PATCH] net: asix: fix bad header length bug
From: David Laight @ 2014-02-07 14:40 UTC (permalink / raw)
  To: 'Emil Goode', Bjørn Mork
  Cc: 'Igor Gnatenko', David S. Miller, Ming Lei, Mark Brown,
	Jeff Kirsher, Glen Turner, linux-usb@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20140207135320.GA4252@lianli>

From: Emil Goode 
> On Fri, Feb 07, 2014 at 10:38:04AM +0100, Bjørn Mork wrote:
> > Emil Goode <emilgoode@gmail.com> writes:
> > > On Thu, Feb 06, 2014 at 03:28:13PM +0000, David Laight wrote:
> > >> From: Igor Gnatenko
> > >> > On Thu, 2014-02-06 at 13:56 +0100, Emil Goode wrote:
> > >> > > The AX88772B occasionally send rx packets that cross urb boundaries
> > >> > > and the remaining partial packet is sent with no header.
> > >> > > When the buffer with a partial packet is of less number of octets
> > >> > > than the value of hard_header_len the buffer is discarded by the
> > >> > > usbnet module. This is causing dropped packages and error messages
> > >> > > in dmesg.
> >
> > > I will do some more digging in the code, but the test of skb->len
> > > against hard_header_len is done already in the completion callback
> > > function passed to usb_fill_bulk_urb so it seems that buffers of less
> > > than hard_header_len number of octets will be dropped regardless.
> >
> > I am pretty sure you are right about this bug. And the exact same
> > solution is already used by the cx82310_eth minidriver, so I don't see
> > the problem.  Your fix is fine IMHO.  But you should apply it to all the
> > devices using asix_rx_fixup_common(), not just the ax88772 ones.
> >
> > You could maybe make this a usbnet flag instead and create a generic
> > solution in usbnet, but frankly I believe the number of flags and their
> > meaning have exceeded drivers authors capabilities a long time ago.  At
> > least mine, which are quite limited ;-)
> >
> > An example of that problem is another bloody obvious bug I noticed while
> > looking at this driver: The 'struct driver_info ax88178_info' points to
> > asix_rx_fixup_common without setting the FLAG_MULTI_PACKET.  This will
> > result in usbnet rx_process() calling usbnet_skb_return() on skbs which
> > are already consumed by the minidriver.  Not a big problem, but will
> > give some odd results.  But if you allow skbs shorter than ETH_HLEN to
> > slip through then it might go boom, so you should probably fix that as
> > well.
> >
> >
> > Bjørn
> 
> Yes I believe the patch is necessary, but maybe it would be nice with
> a prettier solution rather than setting hard_header_len to 0 for all
> devices with this behaviour. Perhaps it would be better to let each
> driver that uses the usbnet module decide what skbs to discard?
> 
> What David describes seems to be another bug, but I don't think it is
> related to this patch as I'm able to reproduce the bug without the patch
> beeing applied by setting the mtu to pretty much any value other than
> 1500 and using ping with a larger packet size than that mtu value.

Yes - plenty of bugs if you just look for them!

I did a quick scan through the sub-drivers and although the usbnet code
seems to treat the 'hard_header_len' as a constant to add to the mtu when
allocating rx urb (when the driver doesn't set rx_urb_len), some of the
sub-drivers seem to have three length, the rx header, tx header and hard_header,
and set them separately (I've not just rechecked) - which may not exactly
match what the usbnet code does is the lengths are different.

The ax88772b driver seems to support several different bits of silicon.
Only some put multiple ethernet frames in a single urb, and only for these
does the driver set the rx_urb_length to 2048.
For the other silicon it relies on usbnet setting the rx urb size - so
the hard_header_len better not be set to zero.

Someone with some time to spare needs to modify usbnet to support page
aligned rx buffers (probably 4k urb) and then build correctly formatted
skb from them.

At the moment the ax179_178a driver allocates 20kB urb which end up
with an 0x40 byte offset into the page (so are probably 24k) and
then cause alignment issues in the xhci driver which currently doesn't
correctly handle non-aligned 64k address boundaries when the cross the
ring end.

	David


^ permalink raw reply

* nonagle flags for TSQ
From: John Ogness @ 2014-02-07 15:08 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet

Hi,

This email is referring to your Linux patch
46d3ceabd8d98ed0ad10f20c595ca784e34786c5.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=46d3ceabd8d98ed0ad10f20c595ca784e34786c5

I have a question about the use of tcp_write_xmit() in
net/ipv4/tcp_output.c

When tcp_write_xmit() is called, the nonagle flag of the tcp socket is
ignored and instead 0 is passed. This causes the Nagle-algorithm to be
used even if it should not be, which in some cases causes a large delay.

Was there a reason that 0 was hard-coded?

Although current mainline code has been refactored, 0 is still
hard-coded for TSQ cases.

John Ogness

^ permalink raw reply

* Re: nonagle flags for TSQ
From: Eric Dumazet @ 2014-02-07 15:34 UTC (permalink / raw)
  To: John Ogness; +Cc: netdev
In-Reply-To: <87fvnv2c5k.fsf@linutronix.de>

On Fri, 2014-02-07 at 16:08 +0100, John Ogness wrote:
> Hi,
> 
> This email is referring to your Linux patch
> 46d3ceabd8d98ed0ad10f20c595ca784e34786c5.
> 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=46d3ceabd8d98ed0ad10f20c595ca784e34786c5
> 
> I have a question about the use of tcp_write_xmit() in
> net/ipv4/tcp_output.c
> 
> When tcp_write_xmit() is called, the nonagle flag of the tcp socket is
> ignored and instead 0 is passed. This causes the Nagle-algorithm to be
> used even if it should not be, which in some cases causes a large delay.
> 
> Was there a reason that 0 was hard-coded?
> 
> Although current mainline code has been refactored, 0 is still
> hard-coded for TSQ cases.

Hi John

Do you have any data, like exact kernel version you use, tcpdump or
things like that ?

When the TCP writes are throttled, its only up to the point next packet
is TX completed, and only if you have at least 128KB worth of bytes
consumed in the QDISC/NIC layers for this socket.

We had some issues at very high speeds, not related to Nagle at all.

98e09386c0ef tcp: tsq: restore minimal amount of queueing
c9eeec26e32e tcp: TSQ can use a dynamic limit
d6a4a1041176 tcp: GSO should be TSQ friendly
d01cb20711e3 tcp: add LAST_ACK as a valid state for TSQ

I am not aware of TSQ being a problem for Nagle.

Also take a look at recent TCP autocork patches, as they are more
related to Nagle

a181ceb501b3 tcp: autocork should not hold first packet in write queue
f54b311142a9 tcp: auto corking

Thanks

^ permalink raw reply

* Re: nonagle flags for TSQ
From: Eric Dumazet @ 2014-02-07 15:58 UTC (permalink / raw)
  To: John Ogness; +Cc: netdev
In-Reply-To: <1391787297.10160.50.camel@edumazet-glaptop2.roam.corp.google.com>

On Fri, 2014-02-07 at 07:34 -0800, Eric Dumazet wrote:
> On Fri, 2014-02-07 at 16:08 +0100, John Ogness wrote:
> > Hi,
> > 
> > This email is referring to your Linux patch
> > 46d3ceabd8d98ed0ad10f20c595ca784e34786c5.
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=46d3ceabd8d98ed0ad10f20c595ca784e34786c5
> > 
> > I have a question about the use of tcp_write_xmit() in
> > net/ipv4/tcp_output.c
> > 
> > When tcp_write_xmit() is called, the nonagle flag of the tcp socket is
> > ignored and instead 0 is passed. This causes the Nagle-algorithm to be
> > used even if it should not be, which in some cases causes a large delay.
> > 
> > Was there a reason that 0 was hard-coded?
> > 
> > Although current mainline code has been refactored, 0 is still
> > hard-coded for TSQ cases.
> 
> Hi John
> 
> Do you have any data, like exact kernel version you use, tcpdump or
> things like that ?
> 
> When the TCP writes are throttled, its only up to the point next packet
> is TX completed, and only if you have at least 128KB worth of bytes
> consumed in the QDISC/NIC layers for this socket.
> 
> We had some issues at very high speeds, not related to Nagle at all.
> 
> 98e09386c0ef tcp: tsq: restore minimal amount of queueing
> c9eeec26e32e tcp: TSQ can use a dynamic limit
> d6a4a1041176 tcp: GSO should be TSQ friendly
> d01cb20711e3 tcp: add LAST_ACK as a valid state for TSQ
> 
> I am not aware of TSQ being a problem for Nagle.
> 
> Also take a look at recent TCP autocork patches, as they are more
> related to Nagle
> 
> a181ceb501b3 tcp: autocork should not hold first packet in write queue
> f54b311142a9 tcp: auto corking
> 
> Thanks

I think I mentioned this once, but the "a181ceb501b3" fix
included this bit :

 Also, as TX completion is lockless, it's safer to perform sk_wmem_alloc
 test after setting TSQ_THROTTLED.

So its possible you hit the same race, its only a guess...

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 03d26b85eab8..c99a63c6e91a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1904,7 +1904,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
-			break;
+			/* It is possible TX completion already happened
+			 * before we set TSQ_THROTTLED, so we must
+			 * test again the condition.
+			 */
+			if (atomic_read(&sk->sk_wmem_alloc) > limit)
+				break;
 		}
 
 		limit = mss_now;

^ permalink raw reply related

* Re: [PATCH v2.53] datapath: Add basic MPLS support to kernel
From: Ben Pfaff @ 2014-02-07 16:15 UTC (permalink / raw)
  To: Jesse Gross, Simon Horman
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA, Ravi K
In-Reply-To: <1391590479-15567-1-git-send-email-horms-/R6kz+dDXgpPR4JQBCEnsQ@public.gmane.org>

On Wed, Feb 05, 2014 at 05:54:38PM +0900, Simon Horman wrote:
> * Ben has explained to me that there has been a change of direction
>   at the ONF with regards to MPLS and VLAN tag ordering. As per
>   his changelog for "Always insert MPLS labels after VLAN tags",
>   the situation is now as follows:
> 
>   * OpenFlow 1.1 and 1.2 always inserted MPLS labels after VLAN tags.
>   * OpenFlow 1.3 and 1.4 insert MPLS labels before VLAN tags.
>   * OpenFlow 1.3.4 and 1.5, both in preparation, recognize that the change
>     in 1.3 was an error and revert it.
> 
>   With this in mind only OF1.4 specifies a requirement for inserting MPLS
>   labels before VLAN tags. This appears to be an abbertation at this
>   point.
> 
>   Ben's suggestion as per his patch "Always insert MPLS labels after VLAN
>   tags" is that Open vSwtich should only support inserting MPLS labels
>   after VLAN tags. I agree with this. And to this end I have
>   updated the MPLS code for the kernel datapath (this patch).

I spoke to Jesse briefly about MPLS versus VLANs a few days ago.  He
expressed a concern about the larger problem of figuring out how all the
various forms of tags should interact: VLAN, MPLS, PBB, and others that
I can't think of at the moment.  It would be more consistent from this
higher point of view if "push" and "pop" actions always acted in a
predictable place.

I agree with that point, but it still stands that I don't know of a
well-defined way for VLAN-inside-MPLS to function.  So here is my
proposal for you and Jesse to consider:

        - The datapath push_mpls action is well-defined when no VLAN is
          present.  Great.

        - Declare the behavior of the datapath push_mpls action to be
          undefined in the presence of a VLAN for now, since we don't
          really know what should happen.  For now, make the datapaths
          reject any flow that does push_mpls in the presence of a VLAN.
          Later, we can redefine that case as we please since userspace
          should not depend on the previous behavior.

        - Modify userspace to pop_vlan, push_mpls, push_vlan if
          necessary to avoid the undefined behavior.

Jesse, is that reasonable?

^ permalink raw reply

* Re: [PATCH v3 net 2/9] bridge: Fix the way to insert new local fdb entries in br_fdb_changeaddr
From: Stephen Hemminger @ 2014-02-07 16:31 UTC (permalink / raw)
  To: Toshiaki Makita; +Cc: David S . Miller, Vlad Yasevich, netdev
In-Reply-To: <1391759306-24956-3-git-send-email-makita.toshiaki@lab.ntt.co.jp>

On Fri,  7 Feb 2014 16:48:19 +0900
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:

> Since commit bc9a25d21ef8 ("bridge: Add vlan support for local fdb entries"),
> br_fdb_changeaddr() has inserted a new local fdb entry only if it can
> find old one. But if we have two ports where they have the same address
> or user has deleted a local entry, there will be no entry for one of the
> ports.
> 
> Example of problematic case:
>   ip link set eth0 address aa:bb:cc:dd:ee:ff
>   ip link set eth1 address aa:bb:cc:dd:ee:ff
>   brctl addif br0 eth0
>   brctl addif br0 eth1 # eth1 will not have a local entry due to dup.

I think the second addif should fail, it doesn't seem valid to have
two interfaces on same bridge with same address. Most hardware switches
would disable the port in that case.

^ permalink raw reply

* Re: nonagle flags for TSQ
From: John Ogness @ 2014-02-07 16:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1391787297.10160.50.camel@edumazet-glaptop2.roam.corp.google.com>

Hi Eric,

On 2014-02-07, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> I have a question about the use of tcp_write_xmit() in
>> net/ipv4/tcp_output.c
>> 
>> When tcp_write_xmit() is called, the nonagle flag of the tcp socket
>> is ignored and instead 0 is passed. This causes the Nagle-algorithm
>> to be used even if it should not be, which in some cases causes a
>> large delay.
>> 
>> Was there a reason that 0 was hard-coded?
>> 
>> Although current mainline code has been refactored, 0 is still
>> hard-coded for TSQ cases.
>
> Do you have any data, like exact kernel version you use, tcpdump or
> things like that ?

Yes. Provided below.

> I am not aware of TSQ being a problem for Nagle.

We are talking about a scenario where Nagle is not supposed to be used
(i.e. TCP_NODELAY is set on the socket).

Here are TCP dumps taken when initiating a gdb debug session from host
to clnt. One from a 2.6 kernel, where the delay is not present. One from
a 3.10 kernel, where a 39ms(!) delay is present. The packets are the
same except the 3.10 kernel shows an extra ACK packet (14th packet from
3.10 dump).

On 2.6 Kernel:
00:00:00.000000 IP host.51922 > clnt.10123: Flags [S], seq 2979238116, win 5840, options [mss 1460,sackOK,TS val 1988280883 ecr 0,nop,wscale 6], length 0
00:00:00.000587 IP clnt.10123 > host.51922: Flags [S.], seq 3460146061, ack 2979238117, win 5792, options [mss 1460,sackOK,TS val 4294688455 ecr 1988280883,nop,wscale 6], length 0
00:00:00.000022 IP host.51922 > clnt.10123: Flags [.], ack 1, win 92, options [nop,nop,TS val 1988280883 ecr 4294688455], length 0
00:00:00.000064 IP host.51922 > clnt.10123: Flags [P.], seq 1:15, ack 1, win 92, options [nop,nop,TS val 1988280883 ecr 4294688455], length 14
00:00:00.000507 IP clnt.10123 > host.51922: Flags [.], ack 15, win 91, options [nop,nop,TS val 4294688456 ecr 1988280883], length 0
00:00:00.000010 IP clnt.10123 > host.51922: Flags [P.], seq 1:2, ack 15, win 91, options [nop,nop,TS val 4294688456 ecr 1988280883], length 1
00:00:00.000010 IP host.51922 > clnt.10123: Flags [.], ack 2, win 92, options [nop,nop,TS val 1988280884 ecr 4294688456], length 0
00:00:00.000005 IP clnt.10123 > host.51922: Flags [P.], seq 2:94, ack 15, win 91, options [nop,nop,TS val 4294688456 ecr 1988280883], length 92
00:00:00.000006 IP host.51922 > clnt.10123: Flags [.], ack 94, win 92, options [nop,nop,TS val 1988280884 ecr 4294688456], length 0
00:00:00.000027 IP host.51922 > clnt.10123: Flags [P.], seq 15:16, ack 94, win 92, options [nop,nop,TS val 1988280884 ecr 4294688456], length 1
00:00:00.000030 IP host.51922 > clnt.10123: Flags [P.], seq 16:56, ack 94, win 92, options [nop,nop,TS val 1988280884 ecr 4294688456], length 40
00:00:00.000450 IP clnt.10123 > host.51922: Flags [.], ack 56, win 91, options [nop,nop,TS val 4294688456 ecr 1988280884], length 0
00:00:00.000010 IP clnt.10123 > host.51922: Flags [P.], seq 94:95, ack 56, win 91, options [nop,nop,TS val 4294688456 ecr 1988280884], length 1
00:00:00.000006 IP clnt.10123 > host.51922: Flags [P.], seq 95:150, ack 56, win 91, options [nop,nop,TS val 4294688456 ecr 1988280884], length 55
00:00:00.000053 IP host.51922 > clnt.10123: Flags [.], ack 150, win 92, options [nop,nop,TS val 1988280885 ecr 4294688456], length 0

On 3.10 Kernel:
00:00:00.000000 IP host.56618 > clnt.10123: Flags [S], seq 315694752, win 5840, options [mss 1460,sackOK,TS val 1987836230 ecr 0,nop,wscale 6], length 0
00:00:00.000267 IP clnt.10123 > host.56618: Flags [S.], seq 469364385, ack 315694753, win 14480, options [mss 1460,sackOK,TS val 6070753 ecr 1987836230,nop,wscale 7], length 0
00:00:00.000022 IP host.56618 > clnt.10123: Flags [.], ack 1, win 92, options [nop,nop,TS val 1987836231 ecr 6070753], length 0
00:00:00.000071 IP host.56618 > clnt.10123: Flags [P.], seq 1:15, ack 1, win 92, options [nop,nop,TS val 1987836231 ecr 6070753], length 14
00:00:00.000501 IP clnt.10123 > host.56618: Flags [.], ack 15, win 114, options [nop,nop,TS val 6070753 ecr 1987836231], length 0
00:00:00.000011 IP clnt.10123 > host.56618: Flags [P.], seq 1:2, ack 15, win 114, options [nop,nop,TS val 6070753 ecr 1987836231], length 1
00:00:00.000007 IP host.56618 > clnt.10123: Flags [.], ack 2, win 92, options [nop,nop,TS val 1987836231 ecr 6070753], length 0
00:00:00.000502 IP clnt.10123 > host.56618: Flags [P.], seq 2:94, ack 15, win 114, options [nop,nop,TS val 6070753 ecr 1987836231], length 92
00:00:00.000008 IP host.56618 > clnt.10123: Flags [.], ack 94, win 92, options [nop,nop,TS val 1987836232 ecr 6070753], length 0
00:00:00.000022 IP host.56618 > clnt.10123: Flags [P.], seq 15:16, ack 94, win 92, options [nop,nop,TS val 1987836232 ecr 6070753], length 1
00:00:00.000030 IP host.56618 > clnt.10123: Flags [P.], seq 16:56, ack 94, win 92, options [nop,nop,TS val 1987836232 ecr 6070753], length 40
00:00:00.000507 IP clnt.10123 > host.56618: Flags [.], ack 56, win 114, options [nop,nop,TS val 6070753 ecr 1987836232], length 0
00:00:00.000009 IP clnt.10123 > host.56618: Flags [P.], seq 94:95, ack 56, win 114, options [nop,nop,TS val 6070753 ecr 1987836232], length 1
00:00:00.039314 IP host.56618 > clnt.10123: Flags [.], ack 95, win 92, options [nop,nop,TS val 1987836272 ecr 6070753], length 0
00:00:00.000502 IP clnt.10123 > host.56618: Flags [P.], seq 95:150, ack 56, win 114, options [nop,nop,TS val 6070757 ecr 1987836272], length 55
00:00:00.000031 IP host.56618 > clnt.10123: Flags [.], ack 150, win 92, options [nop,nop,TS val 1987836272 ecr 6070757], length 0


Using the following patch fixed the problem, but maybe there was a
reason why it wasn't implemented like this in the first place?


diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 03d26b8..9408a0d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -697,8 +697,11 @@ static void tcp_tsq_handler(struct sock *sk)
 {
 	if ((1 << sk->sk_state) &
 	    (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
-	     TCPF_CLOSE_WAIT  | TCPF_LAST_ACK))
-		tcp_write_xmit(sk, tcp_current_mss(sk), 0, 0, GFP_ATOMIC);
+	     TCPF_CLOSE_WAIT  | TCPF_LAST_ACK)) {
+		struct tcp_sock *tp = tcp_sk(sk);
+		tcp_write_xmit(sk, tcp_current_mss(sk), tp->nonagle, 0,
+			       GFP_ATOMIC);
+	}
 }
 /*
  * One tasklet per cpu tries to send more skbs.


John Ogness

^ permalink raw reply related

* Re: nonagle flags for TSQ
From: Eric Dumazet @ 2014-02-07 17:05 UTC (permalink / raw)
  To: John Ogness; +Cc: netdev
In-Reply-To: <878utm3mq9.fsf@linutronix.de>

On Fri, 2014-02-07 at 17:34 +0100, John Ogness wrote:
> Hi Eric,

> 
> We are talking about a scenario where Nagle is not supposed to be used
> (i.e. TCP_NODELAY is set on the socket).
> 
> Here are TCP dumps taken when initiating a gdb debug session from host
> to clnt. One from a 2.6 kernel, where the delay is not present. One from
> a 3.10 kernel, where a 39ms(!) delay is present. The packets are the
> same except the 3.10 kernel shows an extra ACK packet (14th packet from
> 3.10 dump).
> 
> On 2.6 Kernel:
> 00:00:00.000000 IP host.51922 > clnt.10123: Flags [S], seq 2979238116, win 5840, options [mss 1460,sackOK,TS val 1988280883 ecr 0,nop,wscale 6], length 0
> 00:00:00.000587 IP clnt.10123 > host.51922: Flags [S.], seq 3460146061, ack 2979238117, win 5792, options [mss 1460,sackOK,TS val 4294688455 ecr 1988280883,nop,wscale 6], length 0
> 00:00:00.000022 IP host.51922 > clnt.10123: Flags [.], ack 1, win 92, options [nop,nop,TS val 1988280883 ecr 4294688455], length 0
> 00:00:00.000064 IP host.51922 > clnt.10123: Flags [P.], seq 1:15, ack 1, win 92, options [nop,nop,TS val 1988280883 ecr 4294688455], length 14
> 00:00:00.000507 IP clnt.10123 > host.51922: Flags [.], ack 15, win 91, options [nop,nop,TS val 4294688456 ecr 1988280883], length 0
> 00:00:00.000010 IP clnt.10123 > host.51922: Flags [P.], seq 1:2, ack 15, win 91, options [nop,nop,TS val 4294688456 ecr 1988280883], length 1
> 00:00:00.000010 IP host.51922 > clnt.10123: Flags [.], ack 2, win 92, options [nop,nop,TS val 1988280884 ecr 4294688456], length 0
> 00:00:00.000005 IP clnt.10123 > host.51922: Flags [P.], seq 2:94, ack 15, win 91, options [nop,nop,TS val 4294688456 ecr 1988280883], length 92
> 00:00:00.000006 IP host.51922 > clnt.10123: Flags [.], ack 94, win 92, options [nop,nop,TS val 1988280884 ecr 4294688456], length 0
> 00:00:00.000027 IP host.51922 > clnt.10123: Flags [P.], seq 15:16, ack 94, win 92, options [nop,nop,TS val 1988280884 ecr 4294688456], length 1
> 00:00:00.000030 IP host.51922 > clnt.10123: Flags [P.], seq 16:56, ack 94, win 92, options [nop,nop,TS val 1988280884 ecr 4294688456], length 40
> 00:00:00.000450 IP clnt.10123 > host.51922: Flags [.], ack 56, win 91, options [nop,nop,TS val 4294688456 ecr 1988280884], length 0
> 00:00:00.000010 IP clnt.10123 > host.51922: Flags [P.], seq 94:95, ack 56, win 91, options [nop,nop,TS val 4294688456 ecr 1988280884], length 1
> 00:00:00.000006 IP clnt.10123 > host.51922: Flags [P.], seq 95:150, ack 56, win 91, options [nop,nop,TS val 4294688456 ecr 1988280884], length 55
> 00:00:00.000053 IP host.51922 > clnt.10123: Flags [.], ack 150, win 92, options [nop,nop,TS val 1988280885 ecr 4294688456], length 0
> 
> On 3.10 Kernel:
> 00:00:00.000000 IP host.56618 > clnt.10123: Flags [S], seq 315694752, win 5840, options [mss 1460,sackOK,TS val 1987836230 ecr 0,nop,wscale 6], length 0
> 00:00:00.000267 IP clnt.10123 > host.56618: Flags [S.], seq 469364385, ack 315694753, win 14480, options [mss 1460,sackOK,TS val 6070753 ecr 1987836230,nop,wscale 7], length 0
> 00:00:00.000022 IP host.56618 > clnt.10123: Flags [.], ack 1, win 92, options [nop,nop,TS val 1987836231 ecr 6070753], length 0
> 00:00:00.000071 IP host.56618 > clnt.10123: Flags [P.], seq 1:15, ack 1, win 92, options [nop,nop,TS val 1987836231 ecr 6070753], length 14
> 00:00:00.000501 IP clnt.10123 > host.56618: Flags [.], ack 15, win 114, options [nop,nop,TS val 6070753 ecr 1987836231], length 0
> 00:00:00.000011 IP clnt.10123 > host.56618: Flags [P.], seq 1:2, ack 15, win 114, options [nop,nop,TS val 6070753 ecr 1987836231], length 1
> 00:00:00.000007 IP host.56618 > clnt.10123: Flags [.], ack 2, win 92, options [nop,nop,TS val 1987836231 ecr 6070753], length 0
> 00:00:00.000502 IP clnt.10123 > host.56618: Flags [P.], seq 2:94, ack 15, win 114, options [nop,nop,TS val 6070753 ecr 1987836231], length 92
> 00:00:00.000008 IP host.56618 > clnt.10123: Flags [.], ack 94, win 92, options [nop,nop,TS val 1987836232 ecr 6070753], length 0
> 00:00:00.000022 IP host.56618 > clnt.10123: Flags [P.], seq 15:16, ack 94, win 92, options [nop,nop,TS val 1987836232 ecr 6070753], length 1
> 00:00:00.000030 IP host.56618 > clnt.10123: Flags [P.], seq 16:56, ack 94, win 92, options [nop,nop,TS val 1987836232 ecr 6070753], length 40
> 00:00:00.000507 IP clnt.10123 > host.56618: Flags [.], ack 56, win 114, options [nop,nop,TS val 6070753 ecr 1987836232], length 0
> 00:00:00.000009 IP clnt.10123 > host.56618: Flags [P.], seq 94:95, ack 56, win 114, options [nop,nop,TS val 6070753 ecr 1987836232], length 1
> 00:00:00.039314 IP host.56618 > clnt.10123: Flags [.], ack 95, win 92, options [nop,nop,TS val 1987836272 ecr 6070753], length 0
> 00:00:00.000502 IP clnt.10123 > host.56618: Flags [P.], seq 95:150, ack 56, win 114, options [nop,nop,TS val 6070757 ecr 1987836272], length 55
> 00:00:00.000031 IP host.56618 > clnt.10123: Flags [.], ack 150, win 92, options [nop,nop,TS val 1987836272 ecr 6070757], length 0
> 
> 
> Using the following patch fixed the problem, but maybe there was a
> reason why it wasn't implemented like this in the first place?
> 

I cant see how TSQ would even trigger with this test.

What value do you have on /proc/sys/net/ipv4/tcp_limit_output_bytes ?

Normal value is 128KB, meaning you have to queue about 56 small frames
before hitting the TSQ condition. (skb->truesize of small frames being
2304 bytes)

I think your patch might makes sense if we remembered (at the time we
decide to throttle the write) the value of nonagle parameter passed to
tcp_write_xmit() : Not sure it really should the current tp->nonagle
value...

Thanks !

^ permalink raw reply

* Re: [PATCH 19/26] drivers: isdn: Move prototype declaration to header file platform.h from diva_didd.c
From: Josh Triplett @ 2014-02-07 17:15 UTC (permalink / raw)
  To: David Laight
  Cc: 'Rashika Kheria', linux-kernel@vger.kernel.org,
	Armin Schindler, Karsten Keil, netdev@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6BA76D@AcuExch.aculab.com>

On Fri, Feb 07, 2014 at 01:33:46PM +0000, David Laight wrote:
> From: Rashika Kheria
> > Move prototype declarations of function to header file
> > hardware/eicon/platform.h because they are used by more than one file.
> > 
> > This eliminates the following warnings in hardware/eicon/diddfunc.c:
> > drivers/isdn/hardware/eicon/diddfunc.c:95:12: warning: no previous prototype for diddfunc_init [-
> > Wmissing-prototypes]
> > drivers/isdn/hardware/eicon/diddfunc.c:110:13: warning: no previous prototype for diddfunc_finit [-
> > Wmissing-prototypes]
> ...
> > diff --git a/drivers/isdn/hardware/eicon/diva_didd.c b/drivers/isdn/hardware/eicon/diva_didd.c
> > index fab6ccf..56d32a7 100644
> > --- a/drivers/isdn/hardware/eicon/diva_didd.c
> > +++ b/drivers/isdn/hardware/eicon/diva_didd.c
> > @@ -39,9 +39,6 @@ MODULE_LICENSE("GPL");
> >  #define DBG_MINIMUM  (DL_LOG + DL_FTL + DL_ERR)
> >  #define DBG_DEFAULT  (DBG_MINIMUM + DL_XLOG + DL_REG)
> > 
> > -extern int diddfunc_init(void);
> > -extern void diddfunc_finit(void);
> > -
> >  extern void DIVA_DIDD_Read(void *, int);
> 
> You should move that one as well.
> There really shouldn't be 'extern' definitions for any function in
> any C files since you want the compiler to check they are correct
> when the function itself is compiled.

Absolutely, but as far as I can tell Rashika is doing this
incrementally, organized more by header than by source file, so I'd
expect a few externs in a source file to disappear at a time rather than
all in one patch.

- Josh Triplett

^ permalink raw reply

* RE: [PATCH 19/26] drivers: isdn: Move prototype declaration to header file platform.h from diva_didd.c
From: David Laight @ 2014-02-07 17:22 UTC (permalink / raw)
  To: 'Josh Triplett'
  Cc: 'Rashika Kheria', linux-kernel@vger.kernel.org,
	Armin Schindler, Karsten Keil, netdev@vger.kernel.org
In-Reply-To: <20140207171516.GA7298@jtriplet-mobl1>

From: Josh Triplett
> On Fri, Feb 07, 2014 at 01:33:46PM +0000, David Laight wrote:
> > From: Rashika Kheria
> > > Move prototype declarations of function to header file
> > > hardware/eicon/platform.h because they are used by more than one file.
> > >
> > > This eliminates the following warnings in hardware/eicon/diddfunc.c:
> > > drivers/isdn/hardware/eicon/diddfunc.c:95:12: warning: no previous prototype for diddfunc_init [-
> > > Wmissing-prototypes]
> > > drivers/isdn/hardware/eicon/diddfunc.c:110:13: warning: no previous prototype for diddfunc_finit
> [-
> > > Wmissing-prototypes]
> > ...
> > > diff --git a/drivers/isdn/hardware/eicon/diva_didd.c b/drivers/isdn/hardware/eicon/diva_didd.c
> > > index fab6ccf..56d32a7 100644
> > > --- a/drivers/isdn/hardware/eicon/diva_didd.c
> > > +++ b/drivers/isdn/hardware/eicon/diva_didd.c
> > > @@ -39,9 +39,6 @@ MODULE_LICENSE("GPL");
> > >  #define DBG_MINIMUM  (DL_LOG + DL_FTL + DL_ERR)
> > >  #define DBG_DEFAULT  (DBG_MINIMUM + DL_XLOG + DL_REG)
> > >
> > > -extern int diddfunc_init(void);
> > > -extern void diddfunc_finit(void);
> > > -
> > >  extern void DIVA_DIDD_Read(void *, int);
> >
> > You should move that one as well.
> > There really shouldn't be 'extern' definitions for any function in
> > any C files since you want the compiler to check they are correct
> > when the function itself is compiled.
> 
> Absolutely, but as far as I can tell Rashika is doing this
> incrementally, organized more by header than by source file, so I'd
> expect a few externs in a source file to disappear at a time rather than
> all in one patch.

Unless any actual bugs are found, I'd have thought a single patch for
each driver would be enough, maybe even one for the whole lot - depending
on how they are maintained.
The 26 patches already posted are a little excessive.

	David

^ permalink raw reply

* [PATCH 01/18] ipvs: fix AF assignment in ip_vs_conn_new()
From: Pablo Neira Ayuso @ 2014-02-07 17:41 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1391794904-4017-1-git-send-email-pablo@netfilter.org>

From: Michal Kubecek <mkubecek@suse.cz>

If a fwmark is passed to ip_vs_conn_new(), it is passed in
vaddr, not daddr. Therefore we should set AF to AF_UNSPEC in
vaddr assignment (like we do in ip_vs_ct_in_get()), otherwise we
may copy only first 4 bytes of an IPv6 address into cp->daddr.

Signed-off-by: Bogdano Arendartchuk <barendartchuk@suse.com>
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_conn.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index 59a1a85..a8eb0a8 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -871,11 +871,11 @@ ip_vs_conn_new(const struct ip_vs_conn_param *p,
 	cp->protocol	   = p->protocol;
 	ip_vs_addr_set(p->af, &cp->caddr, p->caddr);
 	cp->cport	   = p->cport;
-	ip_vs_addr_set(p->af, &cp->vaddr, p->vaddr);
-	cp->vport	   = p->vport;
-	/* proto should only be IPPROTO_IP if d_addr is a fwmark */
+	/* proto should only be IPPROTO_IP if p->vaddr is a fwmark */
 	ip_vs_addr_set(p->protocol == IPPROTO_IP ? AF_UNSPEC : p->af,
-		       &cp->daddr, daddr);
+		       &cp->vaddr, p->vaddr);
+	cp->vport	   = p->vport;
+	ip_vs_addr_set(p->af, &cp->daddr, daddr);
 	cp->dport          = dport;
 	cp->flags	   = flags;
 	cp->fwmark         = fwmark;
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 02/18] netfilter: nft_ct: fix unconditional dump of 'dir' attr
From: Pablo Neira Ayuso @ 2014-02-07 17:41 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1391794904-4017-1-git-send-email-pablo@netfilter.org>

From: Arturo Borrero <arturo.borrero.glez@gmail.com>

We want to make sure that the information that we get from the kernel can
be reinjected without troubles. The kernel shouldn't return an attribute
that is not required, or even prohibited.

Dumping unconditionally NFTA_CT_DIRECTION could lead an application in
userspace to interpret that the attribute was originally set, while it
was not.

Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_ct.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index 917052e..feaf0f3 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -311,8 +311,19 @@ static int nft_ct_get_dump(struct sk_buff *skb, const struct nft_expr *expr)
 		goto nla_put_failure;
 	if (nla_put_be32(skb, NFTA_CT_KEY, htonl(priv->key)))
 		goto nla_put_failure;
-	if (nla_put_u8(skb, NFTA_CT_DIRECTION, priv->dir))
-		goto nla_put_failure;
+
+	switch (priv->key) {
+	case NFT_CT_PROTOCOL:
+	case NFT_CT_SRC:
+	case NFT_CT_DST:
+	case NFT_CT_PROTO_SRC:
+	case NFT_CT_PROTO_DST:
+		if (nla_put_u8(skb, NFTA_CT_DIRECTION, priv->dir))
+			goto nla_put_failure;
+	default:
+		break;
+	}
+
 	return 0;
 
 nla_put_failure:
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 03/18] netfilter: nf_tables: fix oops when deleting a chain with references
From: Pablo Neira Ayuso @ 2014-02-07 17:41 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1391794904-4017-1-git-send-email-pablo@netfilter.org>

From: Patrick McHardy <kaber@trash.net>

The following commands trigger an oops:

 # nft -i
 nft> add table filter
 nft> add chain filter input { type filter hook input priority 0; }
 nft> add chain filter test
 nft> add rule filter input jump test
 nft> delete chain filter test

We need to check the chain use counter before allowing destruction since
we might have references from sets or jump rules.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=69341
Reported-by: Matthew Ife <deleriux1@gmail.com>
Tested-by: Matthew Ife <deleriux1@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_api.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 117bbaa..9ce3053 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1045,7 +1045,7 @@ static int nf_tables_delchain(struct sock *nlsk, struct sk_buff *skb,
 	if (IS_ERR(chain))
 		return PTR_ERR(chain);
 
-	if (!list_empty(&chain->rules))
+	if (!list_empty(&chain->rules) || chain->use > 0)
 		return -EBUSY;
 
 	list_del(&chain->list);
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 04/18] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
From: Pablo Neira Ayuso @ 2014-02-07 17:41 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1391794904-4017-1-git-send-email-pablo@netfilter.org>

From: Andrey Vagin <avagin@openvz.org>

Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
	kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1			task 2			task 3
			nf_conntrack_find_get
			 ____nf_conntrack_find
destroy_conntrack
 hlist_nulls_del_rcu
 nf_conntrack_free
 kmem_cache_free
						__nf_conntrack_alloc
						 kmem_cache_alloc
						 memset(&ct->tuplehash[IP_CT_DIR_MAX],
			 if (nf_ct_is_dying(ct))
			 if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>]  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622]  [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
<4>[46267.085697]  [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770]  [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843]  [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919]  [<ffffffff814841b9>] nf_iterate+0x69/0xb0
<4>[46267.085991]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063]  [<ffffffff81484374>] nf_hook_slow+0x74/0x110
<4>[46267.086133]  [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207]  [<ffffffff814b5890>] ? dst_output+0x0/0x20
<4>[46267.086277]  [<ffffffff81495204>] ip_output+0xa4/0xc0
<4>[46267.086346]  [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
<4>[46267.086419]  [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
<4>[46267.086491]  [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
<4>[46267.086562]  [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
<4>[46267.086638]  [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712]  [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785]  [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858]  [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
<4>[46267.086936]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006]  [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081]  [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151]  [<ffffffff81445599>] sys_sendto+0x139/0x190
<4>[46267.087229]  [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303]  [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378]  [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454]  [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531]  [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP  [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_conntrack_core.c |   21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 8824ed0..4d1fb5d 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -312,6 +312,21 @@ static void death_by_timeout(unsigned long ul_conntrack)
 	nf_ct_delete((struct nf_conn *)ul_conntrack, 0, 0);
 }
 
+static inline bool
+nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
+			const struct nf_conntrack_tuple *tuple,
+			u16 zone)
+{
+	struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+
+	/* A conntrack can be recreated with the equal tuple,
+	 * so we need to check that the conntrack is confirmed
+	 */
+	return nf_ct_tuple_equal(tuple, &h->tuple) &&
+		nf_ct_zone(ct) == zone &&
+		nf_ct_is_confirmed(ct);
+}
+
 /*
  * Warning :
  * - Caller must take a reference on returned object
@@ -333,8 +348,7 @@ ____nf_conntrack_find(struct net *net, u16 zone,
 	local_bh_disable();
 begin:
 	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[bucket], hnnode) {
-		if (nf_ct_tuple_equal(tuple, &h->tuple) &&
-		    nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
+		if (nf_ct_key_equal(h, tuple, zone)) {
 			NF_CT_STAT_INC(net, found);
 			local_bh_enable();
 			return h;
@@ -372,8 +386,7 @@ begin:
 			     !atomic_inc_not_zero(&ct->ct_general.use)))
 			h = NULL;
 		else {
-			if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
-				     nf_ct_zone(ct) != zone)) {
+			if (unlikely(!nf_ct_key_equal(h, tuple, zone))) {
 				nf_ct_put(ct);
 				goto begin;
 			}
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 09/18] netfilter: nft_ct: fix missing NFT_CT_L3PROTOCOL key in validity checks
From: Pablo Neira Ayuso @ 2014-02-07 17:41 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1391794904-4017-1-git-send-email-pablo@netfilter.org>

From: Patrick McHardy <kaber@trash.net>

The key was missing in the list of valid keys, add it.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_ct.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index feaf0f3..46e2754 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -226,6 +226,7 @@ static int nft_ct_init_validate_get(const struct nft_expr *expr,
 		if (tb[NFTA_CT_DIRECTION] != NULL)
 			return -EINVAL;
 		break;
+	case NFT_CT_L3PROTOCOL:
 	case NFT_CT_PROTOCOL:
 	case NFT_CT_SRC:
 	case NFT_CT_DST:
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 10/18] netfilter: nf_tables: add AF specific expression support
From: Pablo Neira Ayuso @ 2014-02-07 17:41 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1391794904-4017-1-git-send-email-pablo@netfilter.org>

From: Patrick McHardy <kaber@trash.net>

For the reject module, we need to add AF-specific implementations to
get rid of incorrect module dependencies. Try to load an AF-specific
module first and fall back to generic modules.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h |    5 +++++
 net/netfilter/nf_tables_api.c     |   22 ++++++++++++++++------
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index 57c8ff7..0f68e47 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -252,6 +252,7 @@ void nf_tables_unbind_set(const struct nft_ctx *ctx, struct nft_set *set,
  *	@owner: module reference
  *	@policy: netlink attribute policy
  *	@maxattr: highest netlink attribute number
+ *	@family: address family for AF-specific types
  */
 struct nft_expr_type {
 	const struct nft_expr_ops	*(*select_ops)(const struct nft_ctx *,
@@ -262,6 +263,7 @@ struct nft_expr_type {
 	struct module			*owner;
 	const struct nla_policy		*policy;
 	unsigned int			maxattr;
+	u8				family;
 };
 
 /**
@@ -529,6 +531,9 @@ void nft_unregister_expr(struct nft_expr_type *);
 #define MODULE_ALIAS_NFT_CHAIN(family, name) \
 	MODULE_ALIAS("nft-chain-" __stringify(family) "-" name)
 
+#define MODULE_ALIAS_NFT_AF_EXPR(family, name) \
+	MODULE_ALIAS("nft-expr-" __stringify(family) "-" name)
+
 #define MODULE_ALIAS_NFT_EXPR(name) \
 	MODULE_ALIAS("nft-expr-" name)
 
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 3c5a219..113c469 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1114,35 +1114,45 @@ void nft_unregister_expr(struct nft_expr_type *type)
 }
 EXPORT_SYMBOL_GPL(nft_unregister_expr);
 
-static const struct nft_expr_type *__nft_expr_type_get(struct nlattr *nla)
+static const struct nft_expr_type *__nft_expr_type_get(u8 family,
+						       struct nlattr *nla)
 {
 	const struct nft_expr_type *type;
 
 	list_for_each_entry(type, &nf_tables_expressions, list) {
-		if (!nla_strcmp(nla, type->name))
+		if (!nla_strcmp(nla, type->name) &&
+		    (!type->family || type->family == family))
 			return type;
 	}
 	return NULL;
 }
 
-static const struct nft_expr_type *nft_expr_type_get(struct nlattr *nla)
+static const struct nft_expr_type *nft_expr_type_get(u8 family,
+						     struct nlattr *nla)
 {
 	const struct nft_expr_type *type;
 
 	if (nla == NULL)
 		return ERR_PTR(-EINVAL);
 
-	type = __nft_expr_type_get(nla);
+	type = __nft_expr_type_get(family, nla);
 	if (type != NULL && try_module_get(type->owner))
 		return type;
 
 #ifdef CONFIG_MODULES
 	if (type == NULL) {
 		nfnl_unlock(NFNL_SUBSYS_NFTABLES);
+		request_module("nft-expr-%u-%.*s", family,
+			       nla_len(nla), (char *)nla_data(nla));
+		nfnl_lock(NFNL_SUBSYS_NFTABLES);
+		if (__nft_expr_type_get(family, nla))
+			return ERR_PTR(-EAGAIN);
+
+		nfnl_unlock(NFNL_SUBSYS_NFTABLES);
 		request_module("nft-expr-%.*s",
 			       nla_len(nla), (char *)nla_data(nla));
 		nfnl_lock(NFNL_SUBSYS_NFTABLES);
-		if (__nft_expr_type_get(nla))
+		if (__nft_expr_type_get(family, nla))
 			return ERR_PTR(-EAGAIN);
 	}
 #endif
@@ -1193,7 +1203,7 @@ static int nf_tables_expr_parse(const struct nft_ctx *ctx,
 	if (err < 0)
 		return err;
 
-	type = nft_expr_type_get(tb[NFTA_EXPR_NAME]);
+	type = nft_expr_type_get(ctx->afi->family, tb[NFTA_EXPR_NAME]);
 	if (IS_ERR(type))
 		return PTR_ERR(type);
 
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 11/18] netfilter: nft_reject: split up reject module into IPv4 and IPv6 specifc parts
From: Pablo Neira Ayuso @ 2014-02-07 17:41 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1391794904-4017-1-git-send-email-pablo@netfilter.org>

From: Patrick McHardy <kaber@trash.net>

Currently the nft_reject module depends on symbols from ipv6. This is
wrong since no generic module should force IPv6 support to be loaded.
Split up the module into AF-specific and a generic part.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nft_reject.h   |   17 +++++++
 net/ipv4/netfilter/Kconfig           |    5 ++
 net/ipv4/netfilter/Makefile          |    1 +
 net/ipv4/netfilter/nft_reject_ipv4.c |   74 ++++++++++++++++++++++++++++
 net/ipv6/netfilter/Kconfig           |    5 ++
 net/ipv6/netfilter/Makefile          |    1 +
 net/ipv6/netfilter/nft_reject_ipv6.c |   75 ++++++++++++++++++++++++++++
 net/netfilter/Kconfig                |    1 -
 net/netfilter/nft_reject.c           |   89 ++++------------------------------
 9 files changed, 187 insertions(+), 81 deletions(-)
 create mode 100644 include/net/netfilter/nft_reject.h
 create mode 100644 net/ipv4/netfilter/nft_reject_ipv4.c
 create mode 100644 net/ipv6/netfilter/nft_reject_ipv6.c

diff --git a/include/net/netfilter/nft_reject.h b/include/net/netfilter/nft_reject.h
new file mode 100644
index 0000000..ecda759
--- /dev/null
+++ b/include/net/netfilter/nft_reject.h
@@ -0,0 +1,17 @@
+#ifndef _NFT_REJECT_H_
+#define _NFT_REJECT_H_
+
+struct nft_reject {
+	enum nft_reject_types	type:8;
+	u8			icmp_code;
+};
+
+extern const struct nla_policy nft_reject_policy[];
+
+int nft_reject_init(const struct nft_ctx *ctx,
+		    const struct nft_expr *expr,
+		    const struct nlattr * const tb[]);
+
+int nft_reject_dump(struct sk_buff *skb, const struct nft_expr *expr);
+
+#endif
diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig
index 81c6910..a26ce03 100644
--- a/net/ipv4/netfilter/Kconfig
+++ b/net/ipv4/netfilter/Kconfig
@@ -61,6 +61,11 @@ config NFT_CHAIN_NAT_IPV4
 	  packet transformations such as the source, destination address and
 	  source and destination ports.
 
+config NFT_REJECT_IPV4
+	depends on NF_TABLES_IPV4
+	default NFT_REJECT
+	tristate
+
 config NF_TABLES_ARP
 	depends on NF_TABLES
 	tristate "ARP nf_tables support"
diff --git a/net/ipv4/netfilter/Makefile b/net/ipv4/netfilter/Makefile
index c16be9d..90b8240 100644
--- a/net/ipv4/netfilter/Makefile
+++ b/net/ipv4/netfilter/Makefile
@@ -30,6 +30,7 @@ obj-$(CONFIG_NF_NAT_PROTO_GRE) += nf_nat_proto_gre.o
 obj-$(CONFIG_NF_TABLES_IPV4) += nf_tables_ipv4.o
 obj-$(CONFIG_NFT_CHAIN_ROUTE_IPV4) += nft_chain_route_ipv4.o
 obj-$(CONFIG_NFT_CHAIN_NAT_IPV4) += nft_chain_nat_ipv4.o
+obj-$(CONFIG_NFT_REJECT_IPV4) += nft_reject_ipv4.o
 obj-$(CONFIG_NF_TABLES_ARP) += nf_tables_arp.o
 
 # generic IP tables 
diff --git a/net/ipv4/netfilter/nft_reject_ipv4.c b/net/ipv4/netfilter/nft_reject_ipv4.c
new file mode 100644
index 0000000..e935d8d
--- /dev/null
+++ b/net/ipv4/netfilter/nft_reject_ipv4.c
@@ -0,0 +1,74 @@
+/*
+ * Copyright (c) 2008-2009 Patrick McHardy <kaber@trash.net>
+ * Copyright (c) 2013 Eric Leblond <eric@regit.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Development of this code funded by Astaro AG (http://www.astaro.com/)
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/netlink.h>
+#include <linux/netfilter.h>
+#include <linux/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables.h>
+#include <net/icmp.h>
+#include <net/netfilter/ipv4/nf_reject.h>
+#include <net/netfilter/nft_reject.h>
+
+static void nft_reject_ipv4_eval(const struct nft_expr *expr,
+				 struct nft_data data[NFT_REG_MAX + 1],
+				 const struct nft_pktinfo *pkt)
+{
+	struct nft_reject *priv = nft_expr_priv(expr);
+
+	switch (priv->type) {
+	case NFT_REJECT_ICMP_UNREACH:
+		nf_send_unreach(pkt->skb, priv->icmp_code);
+		break;
+	case NFT_REJECT_TCP_RST:
+		nf_send_reset(pkt->skb, pkt->ops->hooknum);
+		break;
+	}
+
+	data[NFT_REG_VERDICT].verdict = NF_DROP;
+}
+
+static struct nft_expr_type nft_reject_ipv4_type;
+static const struct nft_expr_ops nft_reject_ipv4_ops = {
+	.type		= &nft_reject_ipv4_type,
+	.size		= NFT_EXPR_SIZE(sizeof(struct nft_reject)),
+	.eval		= nft_reject_ipv4_eval,
+	.init		= nft_reject_init,
+	.dump		= nft_reject_dump,
+};
+
+static struct nft_expr_type nft_reject_ipv4_type __read_mostly = {
+	.family		= NFPROTO_IPV4,
+	.name		= "reject",
+	.ops		= &nft_reject_ipv4_ops,
+	.policy		= nft_reject_policy,
+	.maxattr	= NFTA_REJECT_MAX,
+	.owner		= THIS_MODULE,
+};
+
+static int __init nft_reject_ipv4_module_init(void)
+{
+	return nft_register_expr(&nft_reject_ipv4_type);
+}
+
+static void __exit nft_reject_ipv4_module_exit(void)
+{
+	nft_unregister_expr(&nft_reject_ipv4_type);
+}
+
+module_init(nft_reject_ipv4_module_init);
+module_exit(nft_reject_ipv4_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Patrick McHardy <kaber@trash.net>");
+MODULE_ALIAS_NFT_AF_EXPR(AF_INET, "reject");
diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig
index 35750df..4bff1f2 100644
--- a/net/ipv6/netfilter/Kconfig
+++ b/net/ipv6/netfilter/Kconfig
@@ -50,6 +50,11 @@ config NFT_CHAIN_NAT_IPV6
 	  packet transformations such as the source, destination address and
 	  source and destination ports.
 
+config NFT_REJECT_IPV6
+	depends on NF_TABLES_IPV6
+	default NFT_REJECT
+	tristate
+
 config IP6_NF_IPTABLES
 	tristate "IP6 tables support (required for filtering)"
 	depends on INET && IPV6
diff --git a/net/ipv6/netfilter/Makefile b/net/ipv6/netfilter/Makefile
index d1b4928..70d3dd6 100644
--- a/net/ipv6/netfilter/Makefile
+++ b/net/ipv6/netfilter/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_NF_DEFRAG_IPV6) += nf_defrag_ipv6.o
 obj-$(CONFIG_NF_TABLES_IPV6) += nf_tables_ipv6.o
 obj-$(CONFIG_NFT_CHAIN_ROUTE_IPV6) += nft_chain_route_ipv6.o
 obj-$(CONFIG_NFT_CHAIN_NAT_IPV6) += nft_chain_nat_ipv6.o
+obj-$(CONFIG_NFT_REJECT_IPV6) += nft_reject_ipv6.o
 
 # matches
 obj-$(CONFIG_IP6_NF_MATCH_AH) += ip6t_ah.o
diff --git a/net/ipv6/netfilter/nft_reject_ipv6.c b/net/ipv6/netfilter/nft_reject_ipv6.c
new file mode 100644
index 0000000..f732859
--- /dev/null
+++ b/net/ipv6/netfilter/nft_reject_ipv6.c
@@ -0,0 +1,75 @@
+/*
+ * Copyright (c) 2008-2009 Patrick McHardy <kaber@trash.net>
+ * Copyright (c) 2013 Eric Leblond <eric@regit.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Development of this code funded by Astaro AG (http://www.astaro.com/)
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/netlink.h>
+#include <linux/netfilter.h>
+#include <linux/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables.h>
+#include <net/netfilter/nft_reject.h>
+#include <net/netfilter/ipv6/nf_reject.h>
+
+static void nft_reject_ipv6_eval(const struct nft_expr *expr,
+				 struct nft_data data[NFT_REG_MAX + 1],
+				 const struct nft_pktinfo *pkt)
+{
+	struct nft_reject *priv = nft_expr_priv(expr);
+	struct net *net = dev_net((pkt->in != NULL) ? pkt->in : pkt->out);
+
+	switch (priv->type) {
+	case NFT_REJECT_ICMP_UNREACH:
+		nf_send_unreach6(net, pkt->skb, priv->icmp_code,
+				 pkt->ops->hooknum);
+		break;
+	case NFT_REJECT_TCP_RST:
+		nf_send_reset6(net, pkt->skb, pkt->ops->hooknum);
+		break;
+	}
+
+	data[NFT_REG_VERDICT].verdict = NF_DROP;
+}
+
+static struct nft_expr_type nft_reject_ipv6_type;
+static const struct nft_expr_ops nft_reject_ipv6_ops = {
+	.type		= &nft_reject_ipv6_type,
+	.size		= NFT_EXPR_SIZE(sizeof(struct nft_reject)),
+	.eval		= nft_reject_ipv6_eval,
+	.init		= nft_reject_init,
+	.dump		= nft_reject_dump,
+};
+
+static struct nft_expr_type nft_reject_ipv6_type __read_mostly = {
+	.family		= NFPROTO_IPV6,
+	.name		= "reject",
+	.ops		= &nft_reject_ipv6_ops,
+	.policy		= nft_reject_policy,
+	.maxattr	= NFTA_REJECT_MAX,
+	.owner		= THIS_MODULE,
+};
+
+static int __init nft_reject_ipv6_module_init(void)
+{
+	return nft_register_expr(&nft_reject_ipv6_type);
+}
+
+static void __exit nft_reject_ipv6_module_exit(void)
+{
+	nft_unregister_expr(&nft_reject_ipv6_type);
+}
+
+module_init(nft_reject_ipv6_module_init);
+module_exit(nft_reject_ipv6_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Patrick McHardy <kaber@trash.net>");
+MODULE_ALIAS_NFT_AF_EXPR(AF_INET6, "reject");
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index c374675..ed8b50e 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -513,7 +513,6 @@ config NFT_QUEUE
 
 config NFT_REJECT
 	depends on NF_TABLES
-	depends on NF_TABLES_IPV6 || !NF_TABLES_IPV6
 	default m if NETFILTER_ADVANCED=n
 	tristate "Netfilter nf_tables reject support"
 	help
diff --git a/net/netfilter/nft_reject.c b/net/netfilter/nft_reject.c
index 5e204711..f3448c2 100644
--- a/net/netfilter/nft_reject.c
+++ b/net/netfilter/nft_reject.c
@@ -16,65 +16,23 @@
 #include <linux/netfilter.h>
 #include <linux/netfilter/nf_tables.h>
 #include <net/netfilter/nf_tables.h>
-#include <net/icmp.h>
-#include <net/netfilter/ipv4/nf_reject.h>
+#include <net/netfilter/nft_reject.h>
 
-#if IS_ENABLED(CONFIG_NF_TABLES_IPV6)
-#include <net/netfilter/ipv6/nf_reject.h>
-#endif
-
-struct nft_reject {
-	enum nft_reject_types	type:8;
-	u8			icmp_code;
-	u8			family;
-};
-
-static void nft_reject_eval(const struct nft_expr *expr,
-			      struct nft_data data[NFT_REG_MAX + 1],
-			      const struct nft_pktinfo *pkt)
-{
-	struct nft_reject *priv = nft_expr_priv(expr);
-#if IS_ENABLED(CONFIG_NF_TABLES_IPV6)
-	struct net *net = dev_net((pkt->in != NULL) ? pkt->in : pkt->out);
-#endif
-	switch (priv->type) {
-	case NFT_REJECT_ICMP_UNREACH:
-		if (priv->family == NFPROTO_IPV4)
-			nf_send_unreach(pkt->skb, priv->icmp_code);
-#if IS_ENABLED(CONFIG_NF_TABLES_IPV6)
-		else if (priv->family == NFPROTO_IPV6)
-			nf_send_unreach6(net, pkt->skb, priv->icmp_code,
-				      pkt->ops->hooknum);
-#endif
-		break;
-	case NFT_REJECT_TCP_RST:
-		if (priv->family == NFPROTO_IPV4)
-			nf_send_reset(pkt->skb, pkt->ops->hooknum);
-#if IS_ENABLED(CONFIG_NF_TABLES_IPV6)
-		else if (priv->family == NFPROTO_IPV6)
-			nf_send_reset6(net, pkt->skb, pkt->ops->hooknum);
-#endif
-		break;
-	}
-
-	data[NFT_REG_VERDICT].verdict = NF_DROP;
-}
-
-static const struct nla_policy nft_reject_policy[NFTA_REJECT_MAX + 1] = {
+const struct nla_policy nft_reject_policy[NFTA_REJECT_MAX + 1] = {
 	[NFTA_REJECT_TYPE]		= { .type = NLA_U32 },
 	[NFTA_REJECT_ICMP_CODE]		= { .type = NLA_U8 },
 };
+EXPORT_SYMBOL_GPL(nft_reject_policy);
 
-static int nft_reject_init(const struct nft_ctx *ctx,
-			   const struct nft_expr *expr,
-			   const struct nlattr * const tb[])
+int nft_reject_init(const struct nft_ctx *ctx,
+		    const struct nft_expr *expr,
+		    const struct nlattr * const tb[])
 {
 	struct nft_reject *priv = nft_expr_priv(expr);
 
 	if (tb[NFTA_REJECT_TYPE] == NULL)
 		return -EINVAL;
 
-	priv->family = ctx->afi->family;
 	priv->type = ntohl(nla_get_be32(tb[NFTA_REJECT_TYPE]));
 	switch (priv->type) {
 	case NFT_REJECT_ICMP_UNREACH:
@@ -89,8 +47,9 @@ static int nft_reject_init(const struct nft_ctx *ctx,
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(nft_reject_init);
 
-static int nft_reject_dump(struct sk_buff *skb, const struct nft_expr *expr)
+int nft_reject_dump(struct sk_buff *skb, const struct nft_expr *expr)
 {
 	const struct nft_reject *priv = nft_expr_priv(expr);
 
@@ -109,37 +68,7 @@ static int nft_reject_dump(struct sk_buff *skb, const struct nft_expr *expr)
 nla_put_failure:
 	return -1;
 }
-
-static struct nft_expr_type nft_reject_type;
-static const struct nft_expr_ops nft_reject_ops = {
-	.type		= &nft_reject_type,
-	.size		= NFT_EXPR_SIZE(sizeof(struct nft_reject)),
-	.eval		= nft_reject_eval,
-	.init		= nft_reject_init,
-	.dump		= nft_reject_dump,
-};
-
-static struct nft_expr_type nft_reject_type __read_mostly = {
-	.name		= "reject",
-	.ops		= &nft_reject_ops,
-	.policy		= nft_reject_policy,
-	.maxattr	= NFTA_REJECT_MAX,
-	.owner		= THIS_MODULE,
-};
-
-static int __init nft_reject_module_init(void)
-{
-	return nft_register_expr(&nft_reject_type);
-}
-
-static void __exit nft_reject_module_exit(void)
-{
-	nft_unregister_expr(&nft_reject_type);
-}
-
-module_init(nft_reject_module_init);
-module_exit(nft_reject_module_exit);
+EXPORT_SYMBOL_GPL(nft_reject_dump);
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Patrick McHardy <kaber@trash.net>");
-MODULE_ALIAS_NFT_EXPR("reject");
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 15/18] netfilter: nf_tables: do not allow NFT_SET_ELEM_INTERVAL_END flag and data
From: Pablo Neira Ayuso @ 2014-02-07 17:41 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1391794904-4017-1-git-send-email-pablo@netfilter.org>

This combination is not allowed since end interval elements cannot
contain data.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
---
 net/netfilter/nf_tables_api.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 3a2e480..d0c790e3e 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -2741,6 +2741,9 @@ static int nft_add_set_elem(const struct nft_ctx *ctx, struct nft_set *set,
 		if (nla[NFTA_SET_ELEM_DATA] == NULL &&
 		    !(elem.flags & NFT_SET_ELEM_INTERVAL_END))
 			return -EINVAL;
+		if (nla[NFTA_SET_ELEM_DATA] != NULL &&
+		    elem.flags & NFT_SET_ELEM_INTERVAL_END)
+			return -EINVAL;
 	} else {
 		if (nla[NFTA_SET_ELEM_DATA] != NULL)
 			return -EINVAL;
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 17/18] netfilter: nf_tables: fix loop checking with end interval elements
From: Pablo Neira Ayuso @ 2014-02-07 17:41 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1391794904-4017-1-git-send-email-pablo@netfilter.org>

Fix access to uninitialized data for end interval elements. The
element data part is uninitialized in interval end elements.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_api.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index d0c790e3e..adce01e 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -2998,6 +2998,9 @@ static int nf_tables_loop_check_setelem(const struct nft_ctx *ctx,
 					const struct nft_set_iter *iter,
 					const struct nft_set_elem *elem)
 {
+	if (elem->flags & NFT_SET_ELEM_INTERVAL_END)
+		return 0;
+
 	switch (elem->data.verdict) {
 	case NFT_JUMP:
 	case NFT_GOTO:
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 18/18] netfilter: nf_tables: unininline nft_trace_packet()
From: Pablo Neira Ayuso @ 2014-02-07 17:41 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1391794904-4017-1-git-send-email-pablo@netfilter.org>

From: Patrick McHardy <kaber@trash.net>

It makes no sense to inline a rarely used function meant for debugging
only that is called a total of five times in the main evaluation loop.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_core.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index 0d879fc..90998a6 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -103,9 +103,9 @@ static struct nf_loginfo trace_loginfo = {
 	},
 };
 
-static inline void nft_trace_packet(const struct nft_pktinfo *pkt,
-				    const struct nft_chain *chain,
-				    int rulenum, enum nft_trace type)
+static void nft_trace_packet(const struct nft_pktinfo *pkt,
+			     const struct nft_chain *chain,
+			     int rulenum, enum nft_trace type)
 {
 	struct net *net = dev_net(pkt->in ? pkt->in : pkt->out);
 
-- 
1.7.10.4


^ permalink raw reply related

* Re: [PATCH 19/26] drivers: isdn: Move prototype declaration to header file platform.h from diva_didd.c
From: Josh Triplett @ 2014-02-07 17:41 UTC (permalink / raw)
  To: David Laight
  Cc: 'Rashika Kheria', linux-kernel@vger.kernel.org,
	Armin Schindler, Karsten Keil, netdev@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6BAC1C@AcuExch.aculab.com>

On Fri, Feb 07, 2014 at 05:22:58PM +0000, David Laight wrote:
> From: Josh Triplett
> > On Fri, Feb 07, 2014 at 01:33:46PM +0000, David Laight wrote:
> > > From: Rashika Kheria
> > > > Move prototype declarations of function to header file
> > > > hardware/eicon/platform.h because they are used by more than one file.
> > > >
> > > > This eliminates the following warnings in hardware/eicon/diddfunc.c:
> > > > drivers/isdn/hardware/eicon/diddfunc.c:95:12: warning: no previous prototype for diddfunc_init [-
> > > > Wmissing-prototypes]
> > > > drivers/isdn/hardware/eicon/diddfunc.c:110:13: warning: no previous prototype for diddfunc_finit
> > [-
> > > > Wmissing-prototypes]
> > > ...
> > > > diff --git a/drivers/isdn/hardware/eicon/diva_didd.c b/drivers/isdn/hardware/eicon/diva_didd.c
> > > > index fab6ccf..56d32a7 100644
> > > > --- a/drivers/isdn/hardware/eicon/diva_didd.c
> > > > +++ b/drivers/isdn/hardware/eicon/diva_didd.c
> > > > @@ -39,9 +39,6 @@ MODULE_LICENSE("GPL");
> > > >  #define DBG_MINIMUM  (DL_LOG + DL_FTL + DL_ERR)
> > > >  #define DBG_DEFAULT  (DBG_MINIMUM + DL_XLOG + DL_REG)
> > > >
> > > > -extern int diddfunc_init(void);
> > > > -extern void diddfunc_finit(void);
> > > > -
> > > >  extern void DIVA_DIDD_Read(void *, int);
> > >
> > > You should move that one as well.
> > > There really shouldn't be 'extern' definitions for any function in
> > > any C files since you want the compiler to check they are correct
> > > when the function itself is compiled.
> > 
> > Absolutely, but as far as I can tell Rashika is doing this
> > incrementally, organized more by header than by source file, so I'd
> > expect a few externs in a source file to disappear at a time rather than
> > all in one patch.
> 
> Unless any actual bugs are found, I'd have thought a single patch for
> each driver would be enough, maybe even one for the whole lot - depending
> on how they are maintained.
> The 26 patches already posted are a little excessive.

These types of patches often seem to generate a non-trivial amount of
feedback (for instance, due to driver-specific organizational issues),
and breaking them up by groups of warnings has tended to avoid excessive
churn and review difficulty on a larger patch.  Certainly as a reviewer
unfamiliar with isdn, I found this patch series far easier to review
than a larger patch would have been.

- Josh Triplett

^ permalink raw reply

* [PATCH 00/18] Netfilter/nftables/IPVS fixes for net
From: Pablo Neira Ayuso @ 2014-02-07 17:41 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

The following patchset contains Netfilter/IPVS fixes, mostly nftables
fixes, most relevantly they are:

* Fix a crash in the h323 conntrack NAT helper due to expectation list
  corruption, from Alexey Dobriyan.

* A couple of RCU race fixes for conntrack, one manifests by hitting BUG_ON
  in nf_nat_setup_info() and the destroy path, patches from Andrey Vagin and
  me.

* Dump direction attribute in nft_ct only if it is set, from Arturo
  Borrero.

* Fix IPVS bug in its own connection tracking system that may lead to
  copying only 4 bytes of the IPv6 address when initializing the
  ip_vs_conn object, from Michal Kubecek.

* Fix -EBUSY errors in nftables when deleting the rules, chain and tables
  in a row due mixture of asynchronous and synchronous object releasing,
  from me.

* Three fixes for the nf_tables set infrastructure when using intervals and
  mappings, from me.

* Four patches to fixing the nf_tables log, reject and ct expressions from
  the new inet table, from Patrick McHardy.

* Fix memory overrun in the map that is used to dynamically allocate names
  from anonymous sets, also from Patrick.

* Fix a potential oops if you dump a set with NFPROTO_UNSPEC and a table
  name, from Patrick McHardy.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git master

Thanks!

----------------------------------------------------------------

The following changes since commit d922e1cb1ea17ac7f0a5c3c2be98d4bd80d055b8:

  net: Document promote_secondaries (2014-01-27 20:39:21 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git master

for you to fetch changes up to 6d8c00d58e9e484fdc41aaaf62e5d8364efe375a:

  netfilter: nf_tables: unininline nft_trace_packet() (2014-02-07 17:50:27 +0100)

----------------------------------------------------------------
Alexey Dobriyan (1):
      netfilter: nf_nat_h323: fix crash in nf_ct_unlink_expect_report()

Andrey Vagin (1):
      netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

Arturo Borrero (1):
      netfilter: nft_ct: fix unconditional dump of 'dir' attr

Michal Kubecek (1):
      ipvs: fix AF assignment in ip_vs_conn_new()

Pablo Neira Ayuso (5):
      netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt
      netfilter: nf_tables: fix racy rule deletion
      netfilter: nf_tables: do not allow NFT_SET_ELEM_INTERVAL_END flag and data
      netfilter: nft_rbtree: fix data handling of end interval elements
      netfilter: nf_tables: fix loop checking with end interval elements

Patrick McHardy (9):
      netfilter: nf_tables: fix oops when deleting a chain with references
      netfilter: nf_tables: fix overrun in nf_tables_set_alloc_name()
      netfilter: nf_tables: fix potential oops when dumping sets
      netfilter: nft_ct: fix missing NFT_CT_L3PROTOCOL key in validity checks
      netfilter: nf_tables: add AF specific expression support
      netfilter: nft_reject: split up reject module into IPv4 and IPv6 specifc parts
      netfilter: nf_tables: add reject module for NFPROTO_INET
      netfilter: nf_tables: fix log/queue expressions for NFPROTO_INET
      netfilter: nf_tables: unininline nft_trace_packet()

 include/net/netfilter/nf_conntrack.h |    2 +
 include/net/netfilter/nf_tables.h    |    9 ++--
 include/net/netfilter/nft_reject.h   |   25 ++++++++++
 net/ipv4/netfilter/Kconfig           |    5 ++
 net/ipv4/netfilter/Makefile          |    1 +
 net/ipv4/netfilter/nf_nat_h323.c     |    5 +-
 net/ipv4/netfilter/nft_reject_ipv4.c |   75 ++++++++++++++++++++++++++++
 net/ipv6/netfilter/Kconfig           |    5 ++
 net/ipv6/netfilter/Makefile          |    1 +
 net/ipv6/netfilter/nft_reject_ipv6.c |   76 +++++++++++++++++++++++++++++
 net/netfilter/Kconfig                |    6 ++-
 net/netfilter/Makefile               |    1 +
 net/netfilter/ipvs/ip_vs_conn.c      |    8 +--
 net/netfilter/nf_conntrack_core.c    |   55 +++++++++++++++++----
 net/netfilter/nf_synproxy_core.c     |    5 +-
 net/netfilter/nf_tables_api.c        |   82 ++++++++++++++++++++-----------
 net/netfilter/nf_tables_core.c       |    6 +--
 net/netfilter/nft_ct.c               |   16 +++++-
 net/netfilter/nft_log.c              |    5 +-
 net/netfilter/nft_queue.c            |    4 +-
 net/netfilter/nft_rbtree.c           |   16 ++++--
 net/netfilter/nft_reject.c           |   89 ++++------------------------------
 net/netfilter/nft_reject_inet.c      |   63 ++++++++++++++++++++++++
 net/netfilter/xt_CT.c                |    7 +--
 24 files changed, 413 insertions(+), 154 deletions(-)
 create mode 100644 include/net/netfilter/nft_reject.h
 create mode 100644 net/ipv4/netfilter/nft_reject_ipv4.c
 create mode 100644 net/ipv6/netfilter/nft_reject_ipv6.c
 create mode 100644 net/netfilter/nft_reject_inet.c

^ permalink raw reply

* [PATCH 07/18] netfilter: nf_tables: fix overrun in nf_tables_set_alloc_name()
From: Pablo Neira Ayuso @ 2014-02-07 17:41 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1391794904-4017-1-git-send-email-pablo@netfilter.org>

From: Patrick McHardy <kaber@trash.net>

The map that is used to allocate anonymous sets is indeed
BITS_PER_BYTE * PAGE_SIZE long.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_api.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 9ce3053..2a22a18 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1989,13 +1989,13 @@ static int nf_tables_set_alloc_name(struct nft_ctx *ctx, struct nft_set *set,
 
 			if (!sscanf(i->name, name, &tmp))
 				continue;
-			if (tmp < 0 || tmp > BITS_PER_LONG * PAGE_SIZE)
+			if (tmp < 0 || tmp >= BITS_PER_BYTE * PAGE_SIZE)
 				continue;
 
 			set_bit(tmp, inuse);
 		}
 
-		n = find_first_zero_bit(inuse, BITS_PER_LONG * PAGE_SIZE);
+		n = find_first_zero_bit(inuse, BITS_PER_BYTE * PAGE_SIZE);
 		free_page((unsigned long)inuse);
 	}
 
-- 
1.7.10.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox