Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 02/13] phy: add the mvebu cp110 comphy driver
From: Antoine Tenart @ 2017-08-24 13:44 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Antoine Tenart, davem, kishon, jason, sebastian.hesselbarth,
	gregory.clement, thomas.petazzoni, nadavh, linux, linux-kernel,
	mw, stefanc, miquel.raynal, netdev
In-Reply-To: <20170824133922.GC8022@lunn.ch>

[-- Attachment #1: Type: text/plain, Size: 1685 bytes --]

Hi Andrew,

On Thu, Aug 24, 2017 at 03:39:22PM +0200, Andrew Lunn wrote:
> > +static const struct mvebu_comhy_conf mvebu_comphy_modes[] = {
> > +	/* lane 0 */
> > +	MVEBU_COMPHY_CONF(0, 1, PHY_MODE_SGMII, 0x1),
> > +	/* lane 1 */
> > +	MVEBU_COMPHY_CONF(1, 2, PHY_MODE_SGMII, 0x1),
> > +	/* lane 2 */
> > +	MVEBU_COMPHY_CONF(2, 0, PHY_MODE_SGMII, 0x1),
> > +	MVEBU_COMPHY_CONF(2, 0, PHY_MODE_10GKR, 0x1),
> > +	/* lane 3 */
> > +	MVEBU_COMPHY_CONF(3, 1, PHY_MODE_SGMII, 0x2),
> > +	/* lane 4 */
> > +	MVEBU_COMPHY_CONF(4, 0, PHY_MODE_SGMII, 0x2),
> > +	MVEBU_COMPHY_CONF(4, 0, PHY_MODE_10GKR, 0x2),
> > +	MVEBU_COMPHY_CONF(4, 1, PHY_MODE_SGMII, 0x1),
> > +	/* lane 5 */
> > +	MVEBU_COMPHY_CONF(5, 2, PHY_MODE_SGMII, 0x1),
> > +};
> 
> Do other Marvell SoCs re-use this IP? Maybe add cp110 to the name here
> to indicate what SoC this configuration belongs to? I guess at some
> point, the compatible string will be used to select the correct table
> for the hardware variant.

OK, I'll rename the variable mvebu_comphy_cp110_modes.

> > +static const struct of_device_id mvebu_comphy_of_match_table[] = {
> > +	{ .compatible = "marvell,comphy-cp110" },
> 
> Is that specific enough? It seems like this table is easy to change in
> the VHDL. Could there be another cp110 with a different configuration?

As far as I understand CP110 is the name of the CP, should there be
another one it should be named differently. But I can't be 100% sure,
you never know what comes next :)

How would you name it if not "comphy-cp110"?

Thanks!
Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 02/13] phy: add the mvebu cp110 comphy driver
From: Andrew Lunn @ 2017-08-24 13:45 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kishon, jason, sebastian.hesselbarth, gregory.clement,
	thomas.petazzoni, nadavh, linux, linux-kernel, mw, stefanc,
	miquel.raynal, netdev
In-Reply-To: <20170824083823.16826-3-antoine.tenart@free-electrons.com>

> +	for_each_available_child_of_node(pdev->dev.of_node, child) {
> +		struct mvebu_comphy_lane *lane;
> +		struct phy *phy;
> +		int ret;
> +		u32 val;
> +
> +		ret = of_property_read_u32(child, "reg", &val);
> +		if (ret < 0) {
> +			dev_err(&pdev->dev, "missing 'reg' property (%d)\n",
> +				ret);
> +			continue;
> +		}

I'm just wondering why we need this. We know how many lanes there are
from the table. So just create a generic PHY for each lane?

     Andrew

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Michael S. Tsirkin @ 2017-08-24 13:50 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Network Development, Koichiro Den, virtualization
In-Reply-To: <CAF=yD-KSek+LmZu0X0TXetmFEQ59iF3NpjZ4KugbwLo1BGfhaA@mail.gmail.com>

On Wed, Aug 23, 2017 at 11:28:24PM -0400, Willem de Bruijn wrote:
> >> > * as a generic solution, if we were to somehow overcome the safety issue, track
> >> > the delay and do copy if some threshold is reached could be an answer, but it's
> >> > hard for now.> * so things like the current vhost-net implementation of deciding whether or not
> >> > to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
> >> > practical compromise.
> >>
> >> The fragility of this mechanism is another argument for switching to tx napi
> >> as default.
> >>
> >> Is there any more data about the windows guest issues when completions
> >> are not queued within a reasonable timeframe? What is this timescale and
> >> do we really need to work around this.
> >
> > I think it's pretty large, many milliseconds.
> >
> > But I wonder what do you mean by "work around". Using buffers within
> > limited time frame sounds like a reasonable requirement to me.
> 
> Vhost-net zerocopy delays completions until the skb is really
> sent.

This is fundamental in any solution. Guest/application can not
write over a memory buffer as long as hardware might be reading it.

> Traffic shaping can introduce msec timescale latencies.
> 
> The delay may actually be a useful signal. If the guest does not
> orphan skbs early, TSQ will throttle the socket causing host
> queue build up.
> 
> But, if completions are queued in-order, unrelated flows may be
> throttled as well. Allowing out of order completions would resolve
> this HoL blocking.

We can allow out of order, no guests that follow virtio spec
will break. But this won't help in all cases
- a single slow flow can occupy the whole ring, you will not
  be able to make any new buffers available for the fast flow
- what host considers a single flow can be multiple flows for guest

There are many other examples.

> > Neither
> > do I see why would using tx interrupts within guest be a work around -
> > AFAIK windows driver uses tx interrupts.
> 
> It does not address completion latency itself. What I meant was
> that in an interrupt-driver model, additional starvation issues,
> such as the potential deadlock raised at the start of this thread,
> or the timer delay observed before packets were orphaned in
> virtio-net in commit b0c39dbdc204, are mitigated.
> 
> Specifically, it breaks the potential deadlock where sockets are
> blocked waiting for completions (to free up budget in sndbuf, tsq, ..),
> yet completion handling is blocked waiting for a new packet to
> trigger free_old_xmit_skbs from start_xmit.

This talk of potential deadlock confuses me - I think you mean we would
deadlock if we did not orphan skbs in !use_napi - is that right?  If you
mean that you can drop skb orphan and this won't lead to a deadlock if
free skbs upon a tx interrupt, I agree, for sure.

> >> That is the only thing keeping us from removing the HoL blocking in vhost-net zerocopy.
> >
> > We don't enable network watchdog on virtio but we could and maybe
> > should.
> 
> Can you elaborate?

The issue is that holding onto buffers for very long times makes guests
think they are stuck. This is funamentally because from guest point of
view this is a NIC, so it is supposed to transmit things out in
a timely manner. If host backs the virtual NIC by something that is not
a NIC, with traffic shaping etc introducing unbounded latencies,
guest will be confused.

For example, we could set ndo_tx_timeout within guest. Then
if tx queue is stopped for too long, a watchdog would fire.

We worked around most of the issues by introducing guest/host
copy. This copy, done by vhost-net, allows us to pretend that
a not-nic backend (e.g. a qdisc) is a nic (virtio-net).
This way you can both do traffic shaping in host with
unbounded latencies and limit latency from guest point of view.

Cost is both data copies and loss of end to end credit accounting.

Changing Linux as a host to limit latencies while not doing copies will
not be an easy task but that's the only fix that comes to mind.

-- 
MST

^ permalink raw reply

* Re: [PATCH net-next 02/13] phy: add the mvebu cp110 comphy driver
From: Andrew Lunn @ 2017-08-24 13:51 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kishon, jason, sebastian.hesselbarth, gregory.clement,
	thomas.petazzoni, nadavh, linux, linux-kernel, mw, stefanc,
	miquel.raynal, netdev
In-Reply-To: <20170824134444.GA28443@kwain>

Hi Antione.

> How would you name it if not "comphy-cp110"?

Good question...

'7000-cpmphy-cp110'
'8000-cpmphy-cp110'

??

	Andrew

^ permalink raw reply

* Re: Question about ip_defrag
From: Jesper Dangaard Brouer @ 2017-08-24 13:53 UTC (permalink / raw)
  To: liujian (CE)
  Cc: davem@davemloft.net, kuznet@ms2.inr.ac.ru,
	yoshfuji@linux-ipv6.org, elena.reshetova@intel.com,
	edumazet@google.com, netdev@vger.kernel.org, brouer
In-Reply-To: <4F88C5DDA1E80143B232E89585ACE27D018F07E2@DGGEMA502-MBX.china.huawei.com>


On Thu, 24 Aug 2017 13:15:33 +0000 "liujian (CE)" <liujian56@huawei.com> wrote:
> Hello,
> 
> With below patch we met one issue.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.13-rc6&id=6d7b857d541e
> 
> the issue:
> Ip_defrag fail caused by frag_mem_limit reached 4M(frags.high_thresh).
> At this moment,sum_frag_mem_limit is about 10K.
> and my test machine's cpu num is 64.
> 
> Can i only change frag_mem_limit to sum_ frag_mem_limit?
> 
> 
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 96e95e8..f09c00b 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -120,7 +120,7 @@ static void inet_frag_secret_rebuild(struct inet_frags *f)
>  static bool inet_fragq_should_evict(const struct inet_frag_queue *q)
>  {
>         return q->net->low_thresh == 0 ||
> -              frag_mem_limit(q->net) >= q->net->low_thresh;
> +              sum_frag_mem_limit(q->net) >= q->net->low_thresh;
>  }
> 
>  static unsigned int
> @@ -355,7 +355,7 @@ static struct inet_frag_queue *inet_frag_alloc(struct netns_frags *nf,
>  {
>         struct inet_frag_queue *q;
> 
> -       if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) {
> +       if (!nf->high_thresh || sum_frag_mem_limit(nf) > nf->high_thresh) {
>                 inet_frag_schedule_worker(f);
>                 return NULL;
>         }
> @@ -396,7 +396,7 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
>         struct inet_frag_queue *q;
>         int depth = 0;
> 
> -       if (frag_mem_limit(nf) > nf->low_thresh)
> +       if (sum_frag_mem_limit(nf) > nf->low_thresh)
>                 inet_frag_schedule_worker(f);
> 
>         hash &= (INETFRAGS_HASHSZ - 1);
> --
> 
> Thank you for your time.

What kernel version have you seen this issue with?

As far as I remember, this issue have been fixed before...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* RE: [PATCH net-next 02/13] phy: add the mvebu cp110 comphy driver
From: Stefan Chulski @ 2017-08-24 13:57 UTC (permalink / raw)
  To: Andrew Lunn, Antoine Tenart
  Cc: davem@davemloft.net, kishon@ti.com, jason@lakedaemon.net,
	sebastian.hesselbarth@gmail.com,
	gregory.clement@free-electrons.com,
	thomas.petazzoni@free-electrons.com, Nadav Haklai,
	linux@armlinux.org.uk, linux-kernel@vger.kernel.org,
	mw@semihalf.com, miquel.raynal@free-electrons.com,
	netdev@vger.kernel.org
In-Reply-To: <20170824135127.GE8022@lunn.ch>



> -----Original Message-----
> From: Andrew Lunn [mailto:andrew@lunn.ch]
> Sent: Thursday, August 24, 2017 4:51 PM
> To: Antoine Tenart <antoine.tenart@free-electrons.com>
> Cc: davem@davemloft.net; kishon@ti.com; jason@lakedaemon.net;
> sebastian.hesselbarth@gmail.com; gregory.clement@free-electrons.com;
> thomas.petazzoni@free-electrons.com; Nadav Haklai <nadavh@marvell.com>;
> linux@armlinux.org.uk; linux-kernel@vger.kernel.org; mw@semihalf.com;
> Stefan Chulski <stefanc@marvell.com>; miquel.raynal@free-electrons.com;
> netdev@vger.kernel.org
> Subject: Re: [PATCH net-next 02/13] phy: add the mvebu cp110 comphy driver
> 
> Hi Antione.
> 
> > How would you name it if not "comphy-cp110"?
> 
> Good question...
> 
> '7000-cpmphy-cp110'
> '8000-cpmphy-cp110'
> 
> ??
> 
> 	Andrew

A8K Marvell SoC has two South Bridge communication controllers
and A7K only one communication controllers, but its identical
communication controllers 110. Next generation will has another number 1xx.

Stefan,
Regards.

^ permalink raw reply

* Re: [PATCH net-next 02/13] phy: add the mvebu cp110 comphy driver
From: Antoine Tenart @ 2017-08-24 13:58 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Antoine Tenart, davem, kishon, jason, sebastian.hesselbarth,
	gregory.clement, thomas.petazzoni, nadavh, linux, linux-kernel,
	mw, stefanc, miquel.raynal, netdev
In-Reply-To: <20170824134504.GD8022@lunn.ch>

[-- Attachment #1: Type: text/plain, Size: 1033 bytes --]

Hi Andrew,

On Thu, Aug 24, 2017 at 03:45:04PM +0200, Andrew Lunn wrote:
> > +	for_each_available_child_of_node(pdev->dev.of_node, child) {
> > +		struct mvebu_comphy_lane *lane;
> > +		struct phy *phy;
> > +		int ret;
> > +		u32 val;
> > +
> > +		ret = of_property_read_u32(child, "reg", &val);
> > +		if (ret < 0) {
> > +			dev_err(&pdev->dev, "missing 'reg' property (%d)\n",
> > +				ret);
> > +			continue;
> > +		}
> 
> I'm just wondering why we need this. We know how many lanes there are
> from the table. So just create a generic PHY for each lane?

At first I did this statically. I moved to this solution because:
1. It represents the h/w correctly (there are 6 lanes duplicated here).
2. It eases the dt representation, we would have something like:
   <&cpm_comphy 0 1>; otherwise.
3. If the number of lanes changes in future revisions it'll be quite
   easy to handle.

Thanks!
Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 12/13] arm64: dts: marvell: mcbin: add comphy references to Ethernet ports
From: Andrew Lunn @ 2017-08-24 13:58 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kishon, jason, sebastian.hesselbarth, gregory.clement,
	thomas.petazzoni, nadavh, linux, linux-kernel, mw, stefanc,
	miquel.raynal, netdev
In-Reply-To: <20170824083823.16826-13-antoine.tenart@free-electrons.com>

> @@ -189,6 +191,7 @@
>  	status = "okay";
>  	phy = <&ge_phy>;
>  	phy-mode = "sgmii";
> +	phys = <&cps_comphy0 1>;

Does the binding document describe the meaning of the specifier?

     Andrew

^ permalink raw reply

* Re: [PATCH net-next 02/13] phy: add the mvebu cp110 comphy driver
From: Antoine Tenart @ 2017-08-24 14:01 UTC (permalink / raw)
  To: Stefan Chulski
  Cc: Andrew Lunn, Antoine Tenart, davem@davemloft.net, kishon@ti.com,
	jason@lakedaemon.net, sebastian.hesselbarth@gmail.com,
	gregory.clement@free-electrons.com,
	thomas.petazzoni@free-electrons.com, Nadav Haklai,
	linux@armlinux.org.uk, linux-kernel@vger.kernel.org,
	mw@semihalf.com, miquel.raynal@free-electrons.com,
	netdev@vger.kernel.org
In-Reply-To: <e83030e8499843e8af9004c2562380dc@IL-EXCH01.marvell.com>

[-- Attachment #1: Type: text/plain, Size: 642 bytes --]

Hi Stefan,

On Thu, Aug 24, 2017 at 01:57:04PM +0000, Stefan Chulski wrote:
> 
> > > How would you name it if not "comphy-cp110"?
> > 
> > Good question...
> > 
> > '7000-cpmphy-cp110'
> > '8000-cpmphy-cp110'
> > 
> > ??
> > 
> > 	Andrew
> 
> A8K Marvell SoC has two South Bridge communication controllers
> and A7K only one communication controllers, but its identical
> communication controllers 110. Next generation will has another number 1xx.

OK, so I guess we can keep 'comphy-cp110' then.

Thanks!
Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 02/13] phy: add the mvebu cp110 comphy driver
From: Andrew Lunn @ 2017-08-24 14:02 UTC (permalink / raw)
  To: Stefan Chulski
  Cc: Antoine Tenart, davem@davemloft.net, kishon@ti.com,
	jason@lakedaemon.net, sebastian.hesselbarth@gmail.com,
	gregory.clement@free-electrons.com,
	thomas.petazzoni@free-electrons.com, Nadav Haklai,
	linux@armlinux.org.uk, linux-kernel@vger.kernel.org,
	mw@semihalf.com, miquel.raynal@free-electrons.com,
	netdev@vger.kernel.org
In-Reply-To: <e83030e8499843e8af9004c2562380dc@IL-EXCH01.marvell.com>

> > -----Original Message-----
> > From: Andrew Lunn [mailto:andrew@lunn.ch]
> > Sent: Thursday, August 24, 2017 4:51 PM
> > To: Antoine Tenart <antoine.tenart@free-electrons.com>
> > Cc: davem@davemloft.net; kishon@ti.com; jason@lakedaemon.net;
> > sebastian.hesselbarth@gmail.com; gregory.clement@free-electrons.com;
> > thomas.petazzoni@free-electrons.com; Nadav Haklai <nadavh@marvell.com>;
> > linux@armlinux.org.uk; linux-kernel@vger.kernel.org; mw@semihalf.com;
> > Stefan Chulski <stefanc@marvell.com>; miquel.raynal@free-electrons.com;
> > netdev@vger.kernel.org
> > Subject: Re: [PATCH net-next 02/13] phy: add the mvebu cp110 comphy driver
> > 
> > Hi Antione.
> > 
> > > How would you name it if not "comphy-cp110"?
> > 
> > Good question...
> > 
> > '7000-cpmphy-cp110'
> > '8000-cpmphy-cp110'
> > 
> > ??
> > 
> > 	Andrew
> 
> A8K Marvell SoC has two South Bridge communication controllers
> and A7K only one communication controllers, but its identical
> communication controllers 110. Next generation will has another number 1xx.

Save this email, so we know who to blame when Marvell does something
different :-)

	  Andrew

^ permalink raw reply

* Re: [PATCH net-next 12/13] arm64: dts: marvell: mcbin: add comphy references to Ethernet ports
From: Antoine Tenart @ 2017-08-24 14:03 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Antoine Tenart, davem, kishon, jason, sebastian.hesselbarth,
	gregory.clement, thomas.petazzoni, nadavh, linux, linux-kernel,
	mw, stefanc, miquel.raynal, netdev
In-Reply-To: <20170824135813.GF8022@lunn.ch>

[-- Attachment #1: Type: text/plain, Size: 576 bytes --]

Hi Andrew,

On Thu, Aug 24, 2017 at 03:58:13PM +0200, Andrew Lunn wrote:
> > @@ -189,6 +191,7 @@
> >  	status = "okay";
> >  	phy = <&ge_phy>;
> >  	phy-mode = "sgmii";
> > +	phys = <&cps_comphy0 1>;
> 
> Does the binding document describe the meaning of the specifier?

Ahhh no you're right! It's the port number i.e. there are multiple
inputs, each of which can support different modes. So say the input is
GoP#0, it can support 10G and SGMII.

Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 07/13] net: mvpp2: improve the link management function
From: Andrew Lunn @ 2017-08-24 14:06 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kishon, jason, sebastian.hesselbarth, gregory.clement,
	thomas.petazzoni, nadavh, linux, linux-kernel, mw, stefanc,
	miquel.raynal, netdev
In-Reply-To: <20170824083823.16826-8-antoine.tenart@free-electrons.com>

On Thu, Aug 24, 2017 at 10:38:17AM +0200, Antoine Tenart wrote:
> When the link status changes, the phylib calls the link_event function
> in the mvpp2 driver. Before this patch only the egress/ingress transmit
> was enabled/disabled. This patch adds more functionality to the link
> status management code by enabling/disabling the port per-cpu
> interrupts, and the port itself. The queues are now stopped as well, and
> the netif carrier helpers are called.
> 
> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> ---
>  drivers/net/ethernet/marvell/mvpp2.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
> index ebcc89b8f792..99847fec1c5a 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -5753,14 +5753,24 @@ static void mvpp2_link_event(struct net_device *dev)
>  		port->link = phydev->link;
>  
>  		if (phydev->link) {
> +			mvpp2_interrupts_enable(port);
> +			mvpp2_port_enable(port);
> +
>  			mvpp2_egress_enable(port);
>  			mvpp2_ingress_enable(port);
> +			netif_carrier_on(dev);

Hi Antoine

Have you seen cases where it is required to change the carrier state?
The phy state machine should be doing this. e.g. when autoneg has
completed, force link configuration, the link goes down etc.

	   Andrew

^ permalink raw reply

* Re: [PATCH] netfilter: SYNPROXY: fix process non tcp packet bug in {ipv4,ipv6}_synproxy_hook
From: Pablo Neira Ayuso @ 2017-08-24 14:07 UTC (permalink / raw)
  To: Lin Zhang
  Cc: davem, kadlec, fw, kuznet, yoshfuji, netdev, netfilter-devel,
	coreteam
In-Reply-To: <1501221784-18226-1-git-send-email-xiaolou4617@gmail.com>

On Fri, Jul 28, 2017 at 02:03:04PM +0800, Lin Zhang wrote:
> In function {ipv4,ipv6}_synproxy_hook we expect a normal tcp packet,
> but the real server maybe reply an icmp error packet related to the 
> exist tcp conntrack, so we will access wrong tcp data.
> 
> For fix it, we simply pass IP_CT_RELATED_REPLY packets.
> 
> Signed-off-by: Lin Zhang <xiaolou4617@gmail.com>
> ---
>  net/ipv4/netfilter/ipt_SYNPROXY.c  | 2 +-
>  net/ipv6/netfilter/ip6t_SYNPROXY.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c b/net/ipv4/netfilter/ipt_SYNPROXY.c
> index f1528f7..3971fd9 100644
> --- a/net/ipv4/netfilter/ipt_SYNPROXY.c
> +++ b/net/ipv4/netfilter/ipt_SYNPROXY.c
> @@ -330,7 +330,7 @@ static unsigned int ipv4_synproxy_hook(void *priv,
>  	if (synproxy == NULL)
>  		return NF_ACCEPT;
>  
> -	if (nf_is_loopback_packet(skb))
> +	if (nf_is_loopback_packet(skb) || ctinfo == IP_CT_RELATED_REPLY)

If the intention is to inspect TCP traffic only, I would suggest you
just check for the protocol field here instead. So we are sure we only
deal with TCP traffic indeed.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next v1] net: gso: Add GSO support for NSH (Network Service Header)
From: Jiri Benc @ 2017-08-24 14:13 UTC (permalink / raw)
  To: Yi Yang; +Cc: netdev, simon.horman
In-Reply-To: <1503567376-64933-1-git-send-email-yi.y.yang@intel.com>

On Thu, 24 Aug 2017 17:36:16 +0800, Yi Yang wrote:
>  include/net/nsh.h             | 307 ++++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/if_ether.h |   1 +
>  net/Kconfig                   |   1 +
>  net/Makefile                  |   1 +
>  net/nsh/Kconfig               |  10 ++
>  net/nsh/Makefile              |   4 +
>  net/nsh/nsh_gso.c             | 116 ++++++++++++++++

Please consider making this a patchset, with a patch adding the NSH
structures and helper functions, a patch adding GSO and a patch adding
OVS support. That way, everything can be reviewed together, yet the
patches are reasonably contained.

> +config NET_NSH_GSO
> +	bool "NSH GSO support"
> +	depends on INET
> +	default y
> +	---help---
> +	 This allows segmentation of GSO packet that have had NSH header pushed         onto them and thus become NSH GSO packets.

Seems you're missing a newline here.

> +static struct sk_buff *nsh_gso_segment(struct sk_buff *skb,
> +				       netdev_features_t features)
> +{
> +	struct sk_buff *segs = ERR_PTR(-EINVAL);
> +	__be16 protocol = skb->protocol;
> +	__be16 inner_proto;
> +	u16 mac_offset = skb->mac_header;
> +	u16 mac_len = skb->mac_len;
> +	struct nsh_hdr *nsh;
> +	unsigned int nsh_hlen;
> +	const struct net_offload *ops;
> +	struct sk_buff *(*gso_inner_segment)(struct sk_buff *skb,
> +					     netdev_features_t features);
> +
> +	skb_reset_network_header(skb);
> +	nsh = (struct nsh_hdr *)skb_network_header(skb);
> +	nsh_hlen = nsh_hdr_len(nsh);
> +	if (unlikely(!pskb_may_pull(skb, nsh_hlen)))
> +		goto out;

You have to reload nsh after this.

> +
> +	__skb_pull(skb, nsh_hlen);
> +
> +	skb_reset_mac_header(skb);
> +	skb_reset_mac_len(skb);
> +
> +	rcu_read_lock();
> +	switch (nsh->next_proto) {
> +	case NSH_P_ETHERNET:
> +		inner_proto = htons(ETH_P_TEB);
> +		gso_inner_segment = skb_mac_gso_segment;
> +		break;
> +	case NSH_P_IPV4:
> +		inner_proto = htons(ETH_P_IP);
> +		ops = rcu_dereference(inet_offloads[inner_proto]);
> +		if (!ops || !ops->callbacks.gso_segment)
> +			goto out;
> +		gso_inner_segment = ops->callbacks.gso_segment;
> +		break;
> +	case NSH_P_IPV6:
> +		inner_proto = htons(ETH_P_IPV6);
> +		ops = rcu_dereference(inet6_offloads[inner_proto]);
> +		if (!ops || !ops->callbacks.gso_segment)
> +			goto out;
> +		gso_inner_segment = ops->callbacks.gso_segment;
> +		break;
> +	case NSH_P_NSH:
> +		inner_proto = htons(ETH_P_NSH);
> +		gso_inner_segment = nsh_gso_segment;

This doesn't look correct. Recursive call to nsh_gso_segment will reset
mac header, causing skb_segment not to copy the previous NSH header(s)
to the new segments.

> +		break;
> +	default:
> +		skb_gso_error_unwind(skb, protocol, nsh_hlen, mac_offset,
> +				     mac_len);
> +		goto out;
> +	}
> +
> +	skb->protocol = inner_proto;
> +	segs = gso_inner_segment(skb, features);
> +	if (IS_ERR_OR_NULL(segs)) {
> +		skb_gso_error_unwind(skb, protocol, nsh_hlen, mac_offset,
> +				     mac_len);
> +		goto out;
> +	}
> +
> +	do {
> +		skb->mac_len = mac_len;
> +		skb->protocol = protocol;
> +
> +		skb_reset_inner_network_header(skb);

This looks superfluous.

> +
> +		__skb_push(skb, nsh_hlen);
> +
> +		skb_reset_mac_header(skb);
> +		skb_set_network_header(skb, mac_len);
> +	} while ((skb = skb->next));
> +
> +out:
> +	rcu_read_unlock();
> +	return segs;
> +}
> +
> +static struct packet_offload nsh_offload __read_mostly = {
> +	.type = cpu_to_be16(ETH_P_NSH),
> +	.priority = 15,
> +	.callbacks = {
> +		.gso_segment    =	nsh_gso_segment,
> +	},
> +};
> +
> +static int __init nsh_gso_init(void)
> +{
> +	dev_add_offload(&nsh_offload);
> +
> +	return 0;
> +}
> +
> +fs_initcall(nsh_gso_init);

device_initcall should be enough. I doubt we'll be doing NFS over
NSH ;-)

 Jiri

^ permalink raw reply

* Re: [PATCH net-next 07/13] net: mvpp2: improve the link management function
From: Antoine Tenart @ 2017-08-24 14:14 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Antoine Tenart, davem, kishon, jason, sebastian.hesselbarth,
	gregory.clement, thomas.petazzoni, nadavh, linux, linux-kernel,
	mw, stefanc, miquel.raynal, netdev
In-Reply-To: <20170824140625.GH8022@lunn.ch>

[-- Attachment #1: Type: text/plain, Size: 994 bytes --]

Hi Andrew,

On Thu, Aug 24, 2017 at 04:06:25PM +0200, Andrew Lunn wrote:
> On Thu, Aug 24, 2017 at 10:38:17AM +0200, Antoine Tenart wrote:
> > @@ -5753,14 +5753,24 @@ static void mvpp2_link_event(struct net_device *dev)
> >  		port->link = phydev->link;
> >  
> >  		if (phydev->link) {
> > +			mvpp2_interrupts_enable(port);
> > +			mvpp2_port_enable(port);
> > +
> >  			mvpp2_egress_enable(port);
> >  			mvpp2_ingress_enable(port);
> > +			netif_carrier_on(dev);
> 
> Have you seen cases where it is required to change the carrier state?
> The phy state machine should be doing this. e.g. when autoneg has
> completed, force link configuration, the link goes down etc.

I don't think I saw a case where this was needed. And if phylib already
handle this I think it should be removed from here as the
mvpp2_link_event only is called by phylib.

Thanks!
Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH net v1 0/3] tipc: buffer reassignment fixes
From: Parthasarathy Bhuvaragan @ 2017-08-24 14:31 UTC (permalink / raw)
  To: davem; +Cc: netdev, tipc-discussion, jon.maloy, maloy, ying.xue

This series contains fixes for buffer reassignments and a context imbalance.

Parthasarathy Bhuvaragan (3):
  tipc: perform skb_linearize() before parsing the inner header
  tipc: reassign pointers after skb reallocation / linearization
  tipc: context imbalance at node read unlock

 net/tipc/msg.c  | 7 +++++--
 net/tipc/node.c | 4 +++-
 2 files changed, 8 insertions(+), 3 deletions(-)

-- 
2.1.4

^ permalink raw reply

* [PATCH net v1 1/3] tipc: perform skb_linearize() before parsing the inner header
From: Parthasarathy Bhuvaragan @ 2017-08-24 14:31 UTC (permalink / raw)
  To: davem; +Cc: netdev, tipc-discussion, jon.maloy, maloy, ying.xue
In-Reply-To: <1503585084-14079-1-git-send-email-parthasarathy.bhuvaragan@ericsson.com>

In tipc_rcv(), we linearize only the header and usually the packets
are consumed as the nodes permit direct reception. However, if the
skb contains tunnelled message due to fail over or synchronization
we parse it in tipc_node_check_state() without performing
linearization. This will cause link disturbances if the skb was
non linear.

In this commit, we perform linearization for the above messages.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/node.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index 9b4dcb6a16b5..b113a52f8914 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1557,6 +1557,8 @@ void tipc_rcv(struct net *net, struct sk_buff *skb, struct tipc_bearer *b)
 
 	/* Check/update node state before receiving */
 	if (unlikely(skb)) {
+		if (unlikely(skb_linearize(skb)))
+			goto discard;
 		tipc_node_write_lock(n);
 		if (tipc_node_check_state(n, skb, bearer_id, &xmitq)) {
 			if (le->link) {
-- 
2.1.4

^ permalink raw reply related

* [PATCH net v1 2/3] tipc: reassign pointers after skb reallocation / linearization
From: Parthasarathy Bhuvaragan @ 2017-08-24 14:31 UTC (permalink / raw)
  To: davem; +Cc: netdev, tipc-discussion, jon.maloy, maloy, ying.xue
In-Reply-To: <1503585084-14079-1-git-send-email-parthasarathy.bhuvaragan@ericsson.com>

In tipc_msg_reverse(), we assign skb attributes to local pointers
in stack at startup. This is followed by skb_linearize() and for
cloned buffers we perform skb relocation using pskb_expand_head().
Both these methods may update the skb attributes and thus making
the pointers incorrect.

In this commit, we fix this error by ensuring that the pointers
are re-assigned after any of these skb operations.

Fixes: 29042e19f2c60 ("tipc: let function tipc_msg_reverse() expand header
when needed")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/msg.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index dcd90e6fa7c3..6ef379f004ac 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -479,13 +479,14 @@ bool tipc_msg_make_bundle(struct sk_buff **skb,  struct tipc_msg *msg,
 bool tipc_msg_reverse(u32 own_node,  struct sk_buff **skb, int err)
 {
 	struct sk_buff *_skb = *skb;
-	struct tipc_msg *hdr = buf_msg(_skb);
+	struct tipc_msg *hdr;
 	struct tipc_msg ohdr;
-	int dlen = min_t(uint, msg_data_sz(hdr), MAX_FORWARD_SIZE);
+	int dlen;
 
 	if (skb_linearize(_skb))
 		goto exit;
 	hdr = buf_msg(_skb);
+	dlen = min_t(uint, msg_data_sz(hdr), MAX_FORWARD_SIZE);
 	if (msg_dest_droppable(hdr))
 		goto exit;
 	if (msg_errcode(hdr))
@@ -511,6 +512,8 @@ bool tipc_msg_reverse(u32 own_node,  struct sk_buff **skb, int err)
 	    pskb_expand_head(_skb, BUF_HEADROOM, BUF_TAILROOM, GFP_ATOMIC))
 		goto exit;
 
+	/* reassign after skb header modifications */
+	hdr = buf_msg(_skb);
 	/* Now reverse the concerned fields */
 	msg_set_errcode(hdr, err);
 	msg_set_non_seq(hdr, 0);
-- 
2.1.4

^ permalink raw reply related

* [PATCH net v1 3/3] tipc: context imbalance at node read unlock
From: Parthasarathy Bhuvaragan @ 2017-08-24 14:31 UTC (permalink / raw)
  To: davem; +Cc: netdev, tipc-discussion, jon.maloy, maloy, ying.xue
In-Reply-To: <1503585084-14079-1-git-send-email-parthasarathy.bhuvaragan@ericsson.com>

If we fail to find a valid bearer in tipc_node_get_linkname(),
node_read_unlock() is called without holding the node read lock.

This commit fixes this error.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
---
 net/tipc/node.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index b113a52f8914..7dd22330a6b4 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1126,8 +1126,8 @@ int tipc_node_get_linkname(struct net *net, u32 bearer_id, u32 addr,
 		strncpy(linkname, tipc_link_name(link), len);
 		err = 0;
 	}
-exit:
 	tipc_node_read_unlock(node);
+exit:
 	tipc_node_put(node);
 	return err;
 }
-- 
2.1.4

^ permalink raw reply related

* [PATCH 1/5] netfilter: ipt_CLUSTERIP: fix use-after-free of proc entry
From: Pablo Neira Ayuso @ 2017-08-24 14:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1503585811-13447-1-git-send-email-pablo@netfilter.org>

From: Sabrina Dubroca <sd@queasysnail.net>

When we delete a netns with a CLUSTERIP rule, clusterip_net_exit() is
called first, removing /proc/net/ipt_CLUSTERIP.
Then clusterip_config_entry_put() is called from clusterip_tg_destroy(),
and tries to remove its entry under /proc/net/ipt_CLUSTERIP/.

Fix this by checking that the parent directory of the entry to remove
hasn't already been deleted.

The following triggers a KASAN splat (stealing the reproducer from
202f59afd441, thanks to Jianlin Shi and Xin Long):

    ip netns add test
    ip link add veth0_in type veth peer name veth0_out
    ip link set veth0_in netns test
    ip netns exec test ip link set lo up
    ip netns exec test ip link set veth0_in up
    ip netns exec test iptables -I INPUT -d 1.2.3.4 -i veth0_in -j     \
        CLUSTERIP --new --clustermac 89:d4:47:eb:9a:fa --total-nodes 3 \
        --local-node 1 --hashmode sourceip-sourceport
    ip netns del test

Fixes: ce4ff76c15a8 ("netfilter: ipt_CLUSTERIP: make proc directory per net namespace")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/ipt_CLUSTERIP.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 7d72decb80f9..efaa04dcc80e 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -117,7 +117,8 @@ clusterip_config_entry_put(struct net *net, struct clusterip_config *c)
 		 * functions are also incrementing the refcount on their own,
 		 * so it's safe to remove the entry even if it's in use. */
 #ifdef CONFIG_PROC_FS
-		proc_remove(c->pde);
+		if (cn->procdir)
+			proc_remove(c->pde);
 #endif
 		return;
 	}
@@ -815,6 +816,7 @@ static void clusterip_net_exit(struct net *net)
 #ifdef CONFIG_PROC_FS
 	struct clusterip_net *cn = net_generic(net, clusterip_net_id);
 	proc_remove(cn->procdir);
+	cn->procdir = NULL;
 #endif
 	nf_unregister_net_hook(net, &cip_arp_ops);
 }
-- 
2.1.4



^ permalink raw reply related

* [PATCH 2/5] netfilter: nft_compat: check extension hook mask only if set
From: Pablo Neira Ayuso @ 2017-08-24 14:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1503585811-13447-1-git-send-email-pablo@netfilter.org>

If the x_tables extension comes with no hook mask, skip this validation.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_compat.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nft_compat.c b/net/netfilter/nft_compat.c
index f5a7cb68694e..b89f4f65b2a0 100644
--- a/net/netfilter/nft_compat.c
+++ b/net/netfilter/nft_compat.c
@@ -305,7 +305,7 @@ static int nft_target_validate(const struct nft_ctx *ctx,
 		const struct nf_hook_ops *ops = &basechain->ops[0];
 
 		hook_mask = 1 << ops->hooknum;
-		if (!(hook_mask & target->hooks))
+		if (target->hooks && !(hook_mask & target->hooks))
 			return -EINVAL;
 
 		ret = nft_compat_chain_validate_dependency(target->table,
@@ -484,7 +484,7 @@ static int nft_match_validate(const struct nft_ctx *ctx,
 		const struct nf_hook_ops *ops = &basechain->ops[0];
 
 		hook_mask = 1 << ops->hooknum;
-		if (!(hook_mask & match->hooks))
+		if (match->hooks && !(hook_mask & match->hooks))
 			return -EINVAL;
 
 		ret = nft_compat_chain_validate_dependency(match->table,
-- 
2.1.4



^ permalink raw reply related

* [PATCH 3/5] netfilter: x_tables: Fix use-after-free in ipt_do_table.
From: Pablo Neira Ayuso @ 2017-08-24 14:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1503585811-13447-1-git-send-email-pablo@netfilter.org>

From: Taehee Yoo <ap420073@gmail.com>

If verdict is NF_STOLEN in the SYNPROXY target,
the skb is consumed.
However, ipt_do_table() always tries to get ip header from the skb.
So that, KASAN triggers the use-after-free message.

We can reproduce this message using below command.
  # iptables -I INPUT -p tcp -j SYNPROXY --mss 1460

[ 193.542265] BUG: KASAN: use-after-free in ipt_do_table+0x1405/0x1c10
[ ... ]
[ 193.578603] Call Trace:
[ 193.581590] <IRQ>
[ 193.584107] dump_stack+0x68/0xa0
[ 193.588168] print_address_description+0x78/0x290
[ 193.593828] ? ipt_do_table+0x1405/0x1c10
[ 193.598690] kasan_report+0x230/0x340
[ 193.603194] __asan_report_load2_noabort+0x19/0x20
[ 193.608950] ipt_do_table+0x1405/0x1c10
[ 193.613591] ? rcu_read_lock_held+0xae/0xd0
[ 193.618631] ? ip_route_input_rcu+0x27d7/0x4270
[ 193.624348] ? ipt_do_table+0xb68/0x1c10
[ 193.629124] ? do_add_counters+0x620/0x620
[ 193.634234] ? iptable_filter_net_init+0x60/0x60
[ ... ]

After this patch, only when verdict is XT_CONTINUE,
ipt_do_table() tries to get ip header.
Also arpt_do_table() is modified because it has same bug.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/arp_tables.c | 10 +++++-----
 net/ipv4/netfilter/ip_tables.c  |  9 +++++----
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 0bc3c3d73e61..9e9d9afd18f7 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -268,14 +268,14 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 		acpar.targinfo = t->data;
 		verdict = t->u.kernel.target->target(skb, &acpar);
 
-		/* Target might have changed stuff. */
-		arp = arp_hdr(skb);
-
-		if (verdict == XT_CONTINUE)
+		if (verdict == XT_CONTINUE) {
+			/* Target might have changed stuff. */
+			arp = arp_hdr(skb);
 			e = arpt_next_entry(e);
-		else
+		} else {
 			/* Verdict */
 			break;
+		}
 	} while (!acpar.hotdrop);
 	xt_write_recseq_end(addend);
 	local_bh_enable();
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 2a55a40211cb..622ed2887cd5 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -352,13 +352,14 @@ ipt_do_table(struct sk_buff *skb,
 		acpar.targinfo = t->data;
 
 		verdict = t->u.kernel.target->target(skb, &acpar);
-		/* Target might have changed stuff. */
-		ip = ip_hdr(skb);
-		if (verdict == XT_CONTINUE)
+		if (verdict == XT_CONTINUE) {
+			/* Target might have changed stuff. */
+			ip = ip_hdr(skb);
 			e = ipt_next_entry(e);
-		else
+		} else {
 			/* Verdict */
 			break;
+		}
 	} while (!acpar.hotdrop);
 
 	xt_write_recseq_end(addend);
-- 
2.1.4



^ permalink raw reply related

* [PATCH 0/5] Netfilter fixes for net
From: Pablo Neira Ayuso @ 2017-08-24 14:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Fix use after free of struct proc_dir_entry in ipt_CLUSTERIP, patch
   from Sabrina Dubroca.

2) Fix spurious EINVAL errors from iptables over nft compatibility layer.

3) Reload pointer to ip header only if there is non-terminal verdict,
   ie. XT_CONTINUE, otherwise invalid memory access may happen, patch
   from Taehee Yoo.

4) Fix interaction between SYNPROXY and NAT, SYNPROXY adds sequence
   adjustment already, however from nf_nat_setup() assumes there's not.
   Patch from Xin Long.

5) Fix burst arithmetics in nft_limit as Joe Stringer mentioned during
   NFWS in Faro. Patch from Andy Zhou.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit 073dd5ad34b1d3aaadaa7e5e8cbe576d9545f163:

  netfilter: fix netfilter_net_init() return (2017-07-18 14:50:28 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD

for you to fetch changes up to c26844eda9d4fdbd266660e3b3de2d0270e3a1ed:

  netfilter: nf_tables: Fix nft limit burst handling (2017-08-24 16:23:17 +0200)

----------------------------------------------------------------
Pablo Neira Ayuso (1):
      netfilter: nft_compat: check extension hook mask only if set

Sabrina Dubroca (1):
      netfilter: ipt_CLUSTERIP: fix use-after-free of proc entry

Taehee Yoo (1):
      netfilter: x_tables: Fix use-after-free in ipt_do_table.

Xin Long (1):
      netfilter: check for seqadj ext existence before adding it in nf_nat_setup_info

andy zhou (1):
      netfilter: nf_tables: Fix nft limit burst handling

 net/ipv4/netfilter/arp_tables.c    | 10 +++++-----
 net/ipv4/netfilter/ip_tables.c     |  9 +++++----
 net/ipv4/netfilter/ipt_CLUSTERIP.c |  4 +++-
 net/netfilter/nf_nat_core.c        |  2 +-
 net/netfilter/nft_compat.c         |  4 ++--
 net/netfilter/nft_limit.c          | 25 ++++++++++++++-----------
 6 files changed, 30 insertions(+), 24 deletions(-)

^ permalink raw reply

* [PATCH 4/5] netfilter: check for seqadj ext existence before adding it in nf_nat_setup_info
From: Pablo Neira Ayuso @ 2017-08-24 14:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1503585811-13447-1-git-send-email-pablo@netfilter.org>

From: Xin Long <lucien.xin@gmail.com>

Commit 4440a2ab3b9f ("netfilter: synproxy: Check oom when adding synproxy
and seqadj ct extensions") wanted to drop the packet when it fails to add
seqadj ext due to no memory by checking if nfct_seqadj_ext_add returns
NULL.

But that nfct_seqadj_ext_add returns NULL can also happen when seqadj ext
already exists in a nf_conn. It will cause that userspace protocol doesn't
work when both dnat and snat are configured.

Li Shuang found this issue in the case:

Topo:
   ftp client                   router                  ftp server
  10.167.131.2  <-> 10.167.131.254  10.167.141.254 <-> 10.167.141.1

Rules:
  # iptables -t nat -A PREROUTING -i eth1 -p tcp -m tcp --dport 21 -j \
    DNAT --to-destination 10.167.141.1
  # iptables -t nat -A POSTROUTING -o eth2 -p tcp -m tcp --dport 21 -j \
    SNAT --to-source 10.167.141.254

In router, when both dnat and snat are added, nf_nat_setup_info will be
called twice. The packet can be dropped at the 2nd time for DNAT due to
seqadj ext is already added at the 1st time for SNAT.

This patch is to fix it by checking for seqadj ext existence before adding
it, so that the packet will not be dropped if seqadj ext already exists.

Note that as Florian mentioned, as a long term, we should review ext_add()
behaviour, it's better to return a pointer to the existing ext instead.

Fixes: 4440a2ab3b9f ("netfilter: synproxy: Check oom when adding synproxy and seqadj ct extensions")
Reported-by: Li Shuang <shuali@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_nat_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index eb541786ccb7..b1d3740ae36a 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -441,7 +441,7 @@ nf_nat_setup_info(struct nf_conn *ct,
 		else
 			ct->status |= IPS_DST_NAT;
 
-		if (nfct_help(ct))
+		if (nfct_help(ct) && !nfct_seqadj(ct))
 			if (!nfct_seqadj_ext_add(ct))
 				return NF_DROP;
 	}
-- 
2.1.4

^ permalink raw reply related

* [PATCH 5/5] netfilter: nf_tables: Fix nft limit burst handling
From: Pablo Neira Ayuso @ 2017-08-24 14:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1503585811-13447-1-git-send-email-pablo@netfilter.org>

From: andy zhou <azhou@ovn.org>

Current implementation treats the burst configuration the same as
rate configuration. This can cause the per packet cost to be lower
than configured. In effect, this bug causes the token bucket to be
refilled at a higher rate than what user has specified.

This patch changes the implementation so that the token bucket size
is controlled by "rate + burst", while maintain the token bucket
refill rate the same as user specified.

Fixes: 96518518cc41 ("netfilter: add nftables")
Signed-off-by: Andy Zhou <azhou@ovn.org>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_limit.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/net/netfilter/nft_limit.c b/net/netfilter/nft_limit.c
index 18dd57a52651..14538b1d4d11 100644
--- a/net/netfilter/nft_limit.c
+++ b/net/netfilter/nft_limit.c
@@ -65,19 +65,23 @@ static int nft_limit_init(struct nft_limit *limit,
 	limit->nsecs = unit * NSEC_PER_SEC;
 	if (limit->rate == 0 || limit->nsecs < unit)
 		return -EOVERFLOW;
-	limit->tokens = limit->tokens_max = limit->nsecs;
-
-	if (tb[NFTA_LIMIT_BURST]) {
-		u64 rate;
 
+	if (tb[NFTA_LIMIT_BURST])
 		limit->burst = ntohl(nla_get_be32(tb[NFTA_LIMIT_BURST]));
+	else
+		limit->burst = 0;
+
+	if (limit->rate + limit->burst < limit->rate)
+		return -EOVERFLOW;
 
-		rate = limit->rate + limit->burst;
-		if (rate < limit->rate)
-			return -EOVERFLOW;
+	/* The token bucket size limits the number of tokens can be
+	 * accumulated. tokens_max specifies the bucket size.
+	 * tokens_max = unit * (rate + burst) / rate.
+	 */
+	limit->tokens = div_u64(limit->nsecs * (limit->rate + limit->burst),
+				limit->rate);
+	limit->tokens_max = limit->tokens;
 
-		limit->rate = rate;
-	}
 	if (tb[NFTA_LIMIT_FLAGS]) {
 		u32 flags = ntohl(nla_get_be32(tb[NFTA_LIMIT_FLAGS]));
 
@@ -95,9 +99,8 @@ static int nft_limit_dump(struct sk_buff *skb, const struct nft_limit *limit,
 {
 	u32 flags = limit->invert ? NFT_LIMIT_F_INV : 0;
 	u64 secs = div_u64(limit->nsecs, NSEC_PER_SEC);
-	u64 rate = limit->rate - limit->burst;
 
-	if (nla_put_be64(skb, NFTA_LIMIT_RATE, cpu_to_be64(rate),
+	if (nla_put_be64(skb, NFTA_LIMIT_RATE, cpu_to_be64(limit->rate),
 			 NFTA_LIMIT_PAD) ||
 	    nla_put_be64(skb, NFTA_LIMIT_UNIT, cpu_to_be64(secs),
 			 NFTA_LIMIT_PAD) ||
-- 
2.1.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox