* Re: [PATCH] bonding: rejoin multicast groups on VLANs
From: Andy Gospodarek @ 2010-09-29 19:54 UTC (permalink / raw)
To: Flavio Leitner; +Cc: netdev
In-Reply-To: <20100929193539.GC2864@redhat.com>
On Wed, Sep 29, 2010 at 04:35:39PM -0300, Flavio Leitner wrote:
> On Wed, Sep 29, 2010 at 02:44:13PM -0400, Andy Gospodarek wrote:
> > On Wed, Sep 29, 2010 at 04:12:24AM -0300, Flavio Leitner wrote:
> > > It fixes bonding to rejoin multicast groups added
> > > to VLAN devices on top of bonding when a failover
> > > happens.
> > >
> > > The first packet may be discarded, so the timer
> > > assure that at least 3 Reports are sent.
> > >
> >
> > Good find, Flavio. Clearly the fact that multicast membership is broken
> > needs to be fixed, but I would rather not see timers used at all. We
> > worked hard in the past to eliminate timers for several reasons, so I
> > would rather see a workqueue used.
>
> I noticed that the code is using workqueues now, just thought a
> simple thing which may run couple times would fit perfectly with
> a simple timer.
>
Timers runs in softirq context, so I'd rather not add code that takes
locks and runs in softirq context.
>
> > I also don't like retransmitting the membership report 3 times when it
> > may not be needed. Though many switches can handle it, the cost of
> > receiving and processing what might be a large list of multicast
> > addresses every 200ms for 600ms doesn't seem ideal. It also feels like
> > a hack. :)
>
> Definitely a parameter is much better, but I wasn't sure about
> the patch approach so I was expecting a review like this and then
> do the refinements needed. Better to post early, right? :)
>
> I see your point to change the default to one membership report,
> but we can't assure during a failover if everything has been
> received. Also, it isn't supposed to keep failing flooding the
> network, so I would rather have couple membership reports being
> send than watch an important multicast application failing.
>
> Perhaps 3 is too much, but one sounds too few to me.
>
> what you think?
>
Adding a tunable parameter allows the administrator to decide how many
is enough. I would rather keep the default at one and add the tunable
parameter (which needs to be added to bond_sysfs.c to be effective).
I have not heard loud complaints about only sending one since the code
to send retransmits of membership reports was added a few years ago, so
I'm inclined to think it is working well for most users (or no one is
using bonding).
Maybe it would be best to break this into 2 patches. One that simply
fixes the failover code so it works with VLANs (that could be done
easily today) and another patch that can add the code to send multiple
retransmits. Would you be willing to do that?
> Flavio
>
> >
> > If retransmission of the membership reports is a requirement for some
> > users, I would rather see it as a configuration option.
> >
> > Maybe something like this?
> >
> > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> > index 3b16f62..b7b4a74 100644
> > --- a/drivers/net/bonding/bond_main.c
> > +++ b/drivers/net/bonding/bond_main.c
> > @@ -109,6 +109,7 @@ static char *arp_validate;
> > static char *fail_over_mac;
> > static int all_slaves_active = 0;
> > static struct bond_params bonding_defaults;
> > +static int resend_igmp = BOND_DEFAULT_RESEND_IGMP;
> >
> > module_param(max_bonds, int, 0);
> > MODULE_PARM_DESC(max_bonds, "Max number of bonded devices");
> > @@ -163,6 +164,8 @@ module_param(all_slaves_active, int, 0);
> > MODULE_PARM_DESC(all_slaves_active, "Keep all frames received on an interface"
> > "by setting active flag for all slaves. "
> > "0 for never (default), 1 for always.");
> > +module_param(resend_igmp, int, 0);
> > +MODULE_PARM_DESC(resend_igmp, "Number of IGMP membership reports to send on link failure");
> >
> > /*----------------------------- Global variables ----------------------------*/
> >
> > @@ -865,18 +868,14 @@ static void bond_mc_del(struct bonding *bond, void *addr)
> > }
> >
> >
> > -/*
> > - * Retrieve the list of registered multicast addresses for the bonding
> > - * device and retransmit an IGMP JOIN request to the current active
> > - * slave.
> > - */
> > -static void bond_resend_igmp_join_requests(struct bonding *bond)
> > +static void __bond_resend_igmp_join_requests(struct net_device *dev)
> > {
> > struct in_device *in_dev;
> > struct ip_mc_list *im;
> >
> > rcu_read_lock();
> > - in_dev = __in_dev_get_rcu(bond->dev);
> > +
> > + in_dev = __in_dev_get_rcu(dev);
> > if (in_dev) {
> > for (im = in_dev->mc_list; im; im = im->next)
> > ip_mc_rejoin_group(im);
> > @@ -885,6 +884,48 @@ static void bond_resend_igmp_join_requests(struct bonding *bond)
> > rcu_read_unlock();
> > }
> >
> > +
> > +/*
> > + * Retrieve the list of registered multicast addresses for the bonding
> > + * device and retransmit an IGMP JOIN request to the current active
> > + * slave.
> > + */
> > +static void bond_resend_igmp_join_requests(struct bonding *bond)
> > +{
> > + struct net_device *vlan_dev;
> > + struct vlan_entry *vlan;
> > +
> > + read_lock(&bond->lock);
> > + if (bond->kill_timers)
> > + goto out;
> > +
> > + /* rejoin all groups on bond device */
> > + __bond_resend_igmp_join_requests(bond->dev);
> > +
> > + /* rejoin all groups on vlan devices */
> > + if (bond->vlgrp) {
> > + list_for_each_entry(vlan, &bond->vlan_list, vlan_list) {
> > + vlan_dev = vlan_group_get_device(bond->vlgrp,
> > + vlan->vlan_id);
> > + if (vlan_dev)
> > + __bond_resend_igmp_join_requests(vlan_dev);
> > + }
> > + }
> > +
> > + if (--bond->igmp_retrans > 0)
> > + queue_delayed_work(bond->wq, &bond->mcast_work, HZ/5);
> > +
> > +out:
> > + read_unlock(&bond->lock);
> > +}
> > +
> > +void bond_resend_igmp_join_requests_delayed(struct work_struct *work)
> > +{
> > + struct bonding *bond = container_of(work, struct bonding,
> > + mcast_work.work);
> > + bond_resend_igmp_join_requests(bond);
> > +}
> > +
> > /*
> > * flush all members of flush->mc_list from device dev->mc_list
> > */
> > @@ -944,7 +985,10 @@ static void bond_mc_swap(struct bonding *bond, struct slave *new_active,
> >
> > netdev_for_each_mc_addr(ha, bond->dev)
> > dev_mc_add(new_active->dev, ha->addr);
> > - bond_resend_igmp_join_requests(bond);
> > +
> > + /* rejoin multicast groups */
> > + bond->igmp_retrans = bond->params.resend_igmp;
> > + queue_delayed_work(bond->wq, &bond->mcast_work, 0);
> > }
> > }
> >
> > @@ -3744,6 +3788,9 @@ static int bond_open(struct net_device *bond_dev)
> >
> > bond->kill_timers = 0;
> >
> > + /* multicast retrans */
> > + INIT_DELAYED_WORK(&bond->mcast_work, bond_resend_igmp_join_requests_delayed);
> > +
> > if (bond_is_lb(bond)) {
> > /* bond_alb_initialize must be called before the timer
> > * is started.
> > @@ -3828,6 +3875,8 @@ static int bond_close(struct net_device *bond_dev)
> > break;
> > }
> >
> > + if (delayed_work_pending(&bond->mcast_work))
> > + cancel_delayed_work(&bond->ad_work);
> >
> > if (bond_is_lb(bond)) {
> > /* Must be called only after all
> > @@ -4699,6 +4748,9 @@ static void bond_work_cancel_all(struct bonding *bond)
> > if (bond->params.mode == BOND_MODE_8023AD &&
> > delayed_work_pending(&bond->ad_work))
> > cancel_delayed_work(&bond->ad_work);
> > +
> > + if (delayed_work_pending(&bond->mcast_work))
> > + cancel_delayed_work(&bond->ad_work);
> > }
> >
> > /*
> > @@ -4891,6 +4943,12 @@ static int bond_check_params(struct bond_params *params)
> > all_slaves_active = 0;
> > }
> >
> > + if (resend_igmp < 0 || resend_igmp > 255) {
> > + pr_warning("Warning: resend_igmp (%d) should be between "
> > + "0 and 255, resetting to %d\n",
> > + resend_igmp, BOND_DEFAULT_RESEND_IGMP);
> > + resend_igmp = BOND_DEFAULT_RESEND_IGMP;
> > + }
> > /* reset values for TLB/ALB */
> > if ((bond_mode == BOND_MODE_TLB) ||
> > (bond_mode == BOND_MODE_ALB)) {
> > @@ -5063,6 +5121,7 @@ static int bond_check_params(struct bond_params *params)
> > params->fail_over_mac = fail_over_mac_value;
> > params->tx_queues = tx_queues;
> > params->all_slaves_active = all_slaves_active;
> > + params->resend_igmp = resend_igmp;
> >
> > if (primary) {
> > strncpy(params->primary, primary, IFNAMSIZ);
> > diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
> > index c6fdd85..c49bdb0 100644
> > --- a/drivers/net/bonding/bonding.h
> > +++ b/drivers/net/bonding/bonding.h
> > @@ -136,6 +136,7 @@ struct bond_params {
> > __be32 arp_targets[BOND_MAX_ARP_TARGETS];
> > int tx_queues;
> > int all_slaves_active;
> > + int resend_igmp;
> > };
> >
> > struct bond_parm_tbl {
> > @@ -198,6 +199,7 @@ struct bonding {
> > s32 slave_cnt; /* never change this value outside the attach/detach wrappers */
> > rwlock_t lock;
> > rwlock_t curr_slave_lock;
> > + s8 igmp_retrans;
> > s8 kill_timers;
> > s8 send_grat_arp;
> > s8 send_unsol_na;
> > @@ -223,6 +225,7 @@ struct bonding {
> > struct delayed_work arp_work;
> > struct delayed_work alb_work;
> > struct delayed_work ad_work;
> > + struct delayed_work mcast_work;
> > #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
> > struct in6_addr master_ipv6;
> > #endif
> > diff --git a/include/linux/if_bonding.h b/include/linux/if_bonding.h
> > index 2c79943..d2f78b7 100644
> > --- a/include/linux/if_bonding.h
> > +++ b/include/linux/if_bonding.h
> > @@ -84,6 +84,9 @@
> > #define BOND_DEFAULT_MAX_BONDS 1 /* Default maximum number of devices to support */
> >
> > #define BOND_DEFAULT_TX_QUEUES 16 /* Default number of tx queues per device */
> > +
> > +#define BOND_DEFAULT_RESEND_IGMP 1 /* Default number of IGMP membership reports
> > + to resend on link failure. */
> > /* hashing types */
> > #define BOND_XMIT_POLICY_LAYER2 0 /* layer 2 (MAC only), default */
> > #define BOND_XMIT_POLICY_LAYER34 1 /* layer 3+4 (IP ^ (TCP || UDP)) */
>
> --
> Flavio
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: add a recursion limit in xmit path
From: David Miller @ 2010-09-29 20:23 UTC (permalink / raw)
To: eric.dumazet; +Cc: jesse, netdev
In-Reply-To: <1285787072.2813.333.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 29 Sep 2010 21:04:32 +0200
> [PATCH net-next-2.6] net: add a recursion limit in xmit path
>
> As tunnel devices are going to be lockless, we need to make sure a
> misconfigured machine wont enter an infinite loop.
>
> Add a percpu variable, and limit to three the number of stacked xmits.
>
> Reported-by: Jesse Gross <jesse@nicira.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] dummy: percpu stats and lockless xmit
From: David Miller @ 2010-09-29 20:26 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1285656633.10438.35.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 28 Sep 2010 08:50:33 +0200
> Converts dummy network device driver to :
>
> - percpu stats
>
> - 64bit stats
>
> - lockless xmit (NETIF_F_LLTX)
>
> - performance features added (NETIF_F_SG | NETIF_F_FRAGLIST |
> NETIF_F_TSO | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA)
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] ipip: fix percpu stats accounting
From: David Miller @ 2010-09-29 20:26 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1285667806.3154.8.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 28 Sep 2010 11:56:46 +0200
> commit 3c97af99a5aa1 (ipip: percpu stats accounting) forgot the fallback
> tunnel case (tunl0), and can crash pretty fast.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] ipip: enable lockless xmits
From: David Miller @ 2010-09-29 20:27 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1285669037.3154.14.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 28 Sep 2010 12:17:17 +0200
> IPIP tunnels can benefit from lockless xmits, using NETIF_F_LLTX
...
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] sit: fix percpu stats accounting
From: David Miller @ 2010-09-29 20:27 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1285676278.3154.21.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 28 Sep 2010 14:17:58 +0200
> commit 15fc1f7056ebd (sit: percpu stats accounting) forgot the fallback
> tunnel case (sit0), and can crash pretty fast.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] sit: enable lockless xmits
From: David Miller @ 2010-09-29 20:27 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1285678421.3154.56.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 28 Sep 2010 14:53:41 +0200
> SIT tunnels can benefit from lockless xmits, using NETIF_F_LLTX
...
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next-2.6] ip6tnl: percpu stats accounting
From: David Miller @ 2010-09-29 20:27 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1285680214.3154.65.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 28 Sep 2010 15:23:34 +0200
> Maintain per_cpu tx_bytes, tx_packets, rx_bytes, rx_packets.
>
> Other seldom used fields are kept in netdev->stats structure, possibly
> unsafe.
>
> This is a preliminary work to support lockless transmit path, and
> correct RX stats, that are already unsafe.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next2.6] net: rename netdev rx_queue to ingress_queue
From: David Miller @ 2010-09-29 20:27 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, jarkao2
In-Reply-To: <1285689517.3154.76.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 28 Sep 2010 17:58:37 +0200
> There is some confusion with rx_queue name after RPS, and net drivers
> private rx_queue fields.
>
> I suggest to rename "struct net_device"->rx_queue to ingress_queue.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH] bonding: rejoin multicast groups on VLANs
From: Flavio Leitner @ 2010-09-29 20:38 UTC (permalink / raw)
To: Andy Gospodarek; +Cc: netdev
In-Reply-To: <20100929195411.GZ7497@gospo.rdu.redhat.com>
On Wed, Sep 29, 2010 at 03:54:11PM -0400, Andy Gospodarek wrote:
> On Wed, Sep 29, 2010 at 04:35:39PM -0300, Flavio Leitner wrote:
> > On Wed, Sep 29, 2010 at 02:44:13PM -0400, Andy Gospodarek wrote:
> > > On Wed, Sep 29, 2010 at 04:12:24AM -0300, Flavio Leitner wrote:
> > > > It fixes bonding to rejoin multicast groups added
> > > > to VLAN devices on top of bonding when a failover
> > > > happens.
> > > >
> > > > The first packet may be discarded, so the timer
> > > > assure that at least 3 Reports are sent.
> > > >
> > >
> > > Good find, Flavio. Clearly the fact that multicast membership is broken
> > > needs to be fixed, but I would rather not see timers used at all. We
> > > worked hard in the past to eliminate timers for several reasons, so I
> > > would rather see a workqueue used.
> >
> > I noticed that the code is using workqueues now, just thought a
> > simple thing which may run couple times would fit perfectly with
> > a simple timer.
> >
>
> Timers runs in softirq context, so I'd rather not add code that takes
> locks and runs in softirq context.
>
> >
> > > I also don't like retransmitting the membership report 3 times when it
> > > may not be needed. Though many switches can handle it, the cost of
> > > receiving and processing what might be a large list of multicast
> > > addresses every 200ms for 600ms doesn't seem ideal. It also feels like
> > > a hack. :)
> >
> > Definitely a parameter is much better, but I wasn't sure about
> > the patch approach so I was expecting a review like this and then
> > do the refinements needed. Better to post early, right? :)
> >
> > I see your point to change the default to one membership report,
> > but we can't assure during a failover if everything has been
> > received. Also, it isn't supposed to keep failing flooding the
> > network, so I would rather have couple membership reports being
> > send than watch an important multicast application failing.
> >
> > Perhaps 3 is too much, but one sounds too few to me.
> >
> > what you think?
> >
>
> Adding a tunable parameter allows the administrator to decide how many
> is enough. I would rather keep the default at one and add the tunable
> parameter (which needs to be added to bond_sysfs.c to be effective).
>
> I have not heard loud complaints about only sending one since the code
> to send retransmits of membership reports was added a few years ago, so
> I'm inclined to think it is working well for most users (or no one is
> using bonding).
>
> Maybe it would be best to break this into 2 patches. One that simply
> fixes the failover code so it works with VLANs (that could be done
> easily today) and another patch that can add the code to send multiple
> retransmits. Would you be willing to do that?
Sure, I can do it and then start another testing session here.
--
Flavio
^ permalink raw reply
* Re: [PATCH] [XFRM]: Don't dereference error pointer dst1
From: Eric Dumazet @ 2010-09-29 20:45 UTC (permalink / raw)
To: Roel Kluin; +Cc: David S. Miller, netdev, Andrew Morton, LKML
In-Reply-To: <4CA3BFA3.1060204@gmail.com>
Le jeudi 30 septembre 2010 à 00:37 +0200, Roel Kluin a écrit :
> Don't dereference dst1 when it's an error pointer.
>
> Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
> ---
> net/xfrm/xfrm_policy.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> I just noticed this by code analysis. It wasn't tested in any way.
>
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index cbab6e1..b186c3d 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -1414,13 +1414,14 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy,
>
> for (; i < nx; i++) {
> struct xfrm_dst *xdst = xfrm_alloc_dst(net, family);
> - struct dst_entry *dst1 = &xdst->u.dst;
> + struct dst_entry *dst1;
>
> err = PTR_ERR(xdst);
> if (IS_ERR(xdst)) {
> dst_release(dst);
> goto put_states;
> }
> + dst1 = &xdst->u.dst;
>
> if (!dst_prev)
> dst0 = dst1;
>
This is not a dereference, but a cast from "struct xfrm_dst *" to
"struct dst_entry *"
^ permalink raw reply
* Re: [PATCH] [XFRM]: Don't dereference error pointer dst1
From: David Miller @ 2010-09-29 20:49 UTC (permalink / raw)
To: eric.dumazet; +Cc: roel.kluin, netdev, akpm, linux-kernel
In-Reply-To: <1285793113.5211.8.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 29 Sep 2010 22:45:13 +0200
> This is not a dereference, but a cast from "struct xfrm_dst *" to
> "struct dst_entry *"
Right, please teach whatever tool caught this that taking an address
of a member of an object referenced by pointer is not an error like a
true dereference is.
^ permalink raw reply
* Re: Packet time delays on multi-core systems
From: Eric Dumazet @ 2010-09-29 21:45 UTC (permalink / raw)
To: Alexey Vlasov; +Cc: Linux Kernel Mailing List, netdev
In-Reply-To: <20100929191851.GC86786@beaver.vrungel.ru>
Le mercredi 29 septembre 2010 à 23:18 +0400, Alexey Vlasov a écrit :
> Hi.
>
> I'm not sure actually that I should write here, may be I should ask in
> netfilter maillist, but if is something wrong please correct me.
>
CC netdev
> I've got rather large linux shared hosting, and on my new servers I
> noticed some strange singularity, that this simple rule:
>
> # iptables -A OUTPUT -p tcp -m tcp --dport 80 --tcp-flags
> FIN,SYN,RST,ACK SYN -j LOG --log-prefix "ipsec:SYN-OUTPUT "
> --log-uid
>
> gives essential time delays simply at ping from the adjacent server
> on a local area network. I don't know precisely what's wrong whether the
> reason is in the bad support by a kernel of new hardware, or it concerns
> generally the new kernel, but now it leads to the situation that even at simple
> DDOS attacks to client sites, it becomes difficult to make something, and in
> general all works only worse.
>
> It seems to me that with the increase of CPU cores' amount, it only becomes
> worse and worse, and, obviously, iptables uses resources of only one processor,
> which resources to it for any reason doesn't suffice.
>
Its not true. iptables can run on all cpus in //
> newbox # iptables -F
> otherbox # ping -c 100 newbox
> ...
> 100 packets transmitted, 100 received, 0% packet loss, time 100044ms
> rtt min/avg/max/mdev = 0.133/2.637/17.172/3.736 ms
>
> OK.
>
> newbox # iptables -A OUTPUT -p tcp -m tcp --dport 80 --tcp-flags FIN,SYN,RST,ACK SYN
> -j LOG --log-prefix "ipsec:SYN-OUTPUT " --log-uid
> otherbox # ping -c 100 newbox
> ...
> 64 bytes from (newbox): icmp_seq=3 ttl=64 time=1.58 ms
> 64 bytes from (newbox): icmp_seq=4 ttl=64 time=98.7 ms
> 64 bytes from (newbox): icmp_seq=5 ttl=64 time=18.2 ms
> 64 bytes from (newbox): icmp_seq=6 ttl=64 time=6.13 ms
> 64 bytes from (newbox): icmp_seq=7 ttl=64 time=108 ms
> ...
> 64 bytes from (newbox): icmp_seq=55 ttl=64 time=2.30 ms
> 64 bytes from (newbox): icmp_seq=56 ttl=64 time=59.9 ms
> 64 bytes from (newbox): icmp_seq=57 ttl=64 time=0.155 ms
> ...
> 64 bytes from (newbox): icmp_seq=61 ttl=64 time=13.4 ms
> 64 bytes from (newbox): icmp_seq=62 ttl=64 time=55.0 ms
> 64 bytes from (newbox): icmp_seq=63 ttl=64 time=0.233 ms
> ...
> 100 packets transmitted, 100 received, 0% packet loss, time 99957ms
> rtt min/avg/max/mdev = 0.111/7.519/108.061/18.478 ms
>
> newbox # iptables -L -v -n
> Chain INPUT (policy ACCEPT 346K packets, 213M bytes)
> pkts bytes target prot opt in out source destination
>
> Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
> pkts bytes target prot opt in out source destination
>
> Chain OUTPUT (policy ACCEPT 296K packets, 290M bytes)
> pkts bytes target prot opt in out source destination
> 234 14040 LOG tcp -- * * 0.0.0.0/0 0.0.0.0/0
> tcp dpt:80 flags:0x17/0x02 LOG flags 8 level 4 prefix `ipsec:SYN-OUTPUT- '
>
> My old server: Intel SR1500, Xeon 5430, kernel 2.6.24 - 2.6.28
> Newbox: SR1620UR, 5650, kernel 2.6.32
>
> Thanks in advance.
>
Seems strange indeed, since the LOG you add should not slowdown icmp
trafic that much.
But if you send SYN packets in the same time, (logged), this might slow
down the reception (and answers) of ICMP frames. LOG target can be quite
expensive...
Is using other rules gives same problem ?
iptables -A OUTPUT -p tcp -m tcp --dport 80 --tcp-flags FIN,SYN,RST,ACK SYN
iptables -A OUTPUT -p tcp -m tcp --dport 80 --tcp-flags FIN,SYN,RST,ACK SYN
iptables -A OUTPUT -p tcp -m tcp --dport 80 --tcp-flags FIN,SYN,RST,ACK SYN
iptables -A OUTPUT -p tcp -m tcp --dport 80 --tcp-flags FIN,SYN,RST,ACK SYN
^ permalink raw reply
* [PATCH net-next] ipv4: __mkroute_output() speedup
From: Eric Dumazet @ 2010-09-29 21:53 UTC (permalink / raw)
To: David Miller; +Cc: netdev
While doing stress tests with a disabled IP route cache, I found
__mkroute_output() was touching three times in_device atomic refcount.
Use RCU to touch it once to reduce cache line ping pongs.
Before patch
time to perform the test
real 1m42.009s
user 0m12.545s
sys 25m0.726s
Profile :
16109.00 26.4% ip_route_output_slow vmlinux
7434.00 12.2% dst_destroy vmlinux
3280.00 5.4% fib_rules_lookup vmlinux
3252.00 5.3% fib_semantic_match vmlinux
2622.00 4.3% fib_table_lookup vmlinux
2535.00 4.1% dst_alloc vmlinux
1750.00 2.9% _raw_read_lock vmlinux
1532.00 2.5% rt_set_nexthop vmlinux
After patch
real 1m36.503s
user 0m12.977s
sys 23m25.608s
14234.00 22.4% ip_route_output_slow vmlinux
8717.00 13.7% dst_destroy vmlinux
4052.00 6.4% fib_rules_lookup vmlinux
3951.00 6.2% fib_semantic_match vmlinux
3191.00 5.0% dst_alloc vmlinux
1764.00 2.8% fib_table_lookup vmlinux
1692.00 2.7% _raw_read_lock vmlinux
1605.00 2.5% rt_set_nexthop vmlinux
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
net/ipv4/route.c | 33 +++++++++++++++------------------
1 file changed, 15 insertions(+), 18 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 98beda4..ea89500 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2358,9 +2358,8 @@ static int __mkroute_output(struct rtable **result,
struct rtable *rth;
struct in_device *in_dev;
u32 tos = RT_FL_TOS(oldflp);
- int err = 0;
- if (ipv4_is_loopback(fl->fl4_src) && !(dev_out->flags&IFF_LOOPBACK))
+ if (ipv4_is_loopback(fl->fl4_src) && !(dev_out->flags & IFF_LOOPBACK))
return -EINVAL;
if (fl->fl4_dst == htonl(0xFFFFFFFF))
@@ -2373,11 +2372,12 @@ static int __mkroute_output(struct rtable **result,
if (dev_out->flags & IFF_LOOPBACK)
flags |= RTCF_LOCAL;
- /* get work reference to inet device */
- in_dev = in_dev_get(dev_out);
- if (!in_dev)
+ rcu_read_lock();
+ in_dev = __in_dev_get_rcu(dev_out);
+ if (!in_dev) {
+ rcu_read_unlock();
return -EINVAL;
-
+ }
if (res->type == RTN_BROADCAST) {
flags |= RTCF_BROADCAST | RTCF_LOCAL;
if (res->fi) {
@@ -2385,13 +2385,13 @@ static int __mkroute_output(struct rtable **result,
res->fi = NULL;
}
} else if (res->type == RTN_MULTICAST) {
- flags |= RTCF_MULTICAST|RTCF_LOCAL;
+ flags |= RTCF_MULTICAST | RTCF_LOCAL;
if (!ip_check_mc(in_dev, oldflp->fl4_dst, oldflp->fl4_src,
oldflp->proto))
flags &= ~RTCF_LOCAL;
/* If multicast route do not exist use
- default one, but do not gateway in this case.
- Yes, it is hack.
+ * default one, but do not gateway in this case.
+ * Yes, it is hack.
*/
if (res->fi && res->prefixlen < 4) {
fib_info_put(res->fi);
@@ -2402,9 +2402,12 @@ static int __mkroute_output(struct rtable **result,
rth = dst_alloc(&ipv4_dst_ops);
if (!rth) {
- err = -ENOBUFS;
- goto cleanup;
+ rcu_read_unlock();
+ return -ENOBUFS;
}
+ in_dev_hold(in_dev);
+ rcu_read_unlock();
+ rth->idev = in_dev;
atomic_set(&rth->dst.__refcnt, 1);
rth->dst.flags= DST_HOST;
@@ -2425,7 +2428,6 @@ static int __mkroute_output(struct rtable **result,
cache entry */
rth->dst.dev = dev_out;
dev_hold(dev_out);
- rth->idev = in_dev_get(dev_out);
rth->rt_gateway = fl->fl4_dst;
rth->rt_spec_dst= fl->fl4_src;
@@ -2460,13 +2462,8 @@ static int __mkroute_output(struct rtable **result,
rt_set_nexthop(rth, res, 0);
rth->rt_flags = flags;
-
*result = rth;
- cleanup:
- /* release work reference to inet device */
- in_dev_put(in_dev);
-
- return err;
+ return 0;
}
static int ip_mkroute_output(struct rtable **rp,
^ permalink raw reply related
* [PATCH] [XFRM]: Don't dereference error pointer dst1
From: Roel Kluin @ 2010-09-29 22:37 UTC (permalink / raw)
To: David S. Miller, netdev, Andrew Morton, LKML
Don't dereference dst1 when it's an error pointer.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
---
net/xfrm/xfrm_policy.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
I just noticed this by code analysis. It wasn't tested in any way.
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index cbab6e1..b186c3d 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1414,13 +1414,14 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy,
for (; i < nx; i++) {
struct xfrm_dst *xdst = xfrm_alloc_dst(net, family);
- struct dst_entry *dst1 = &xdst->u.dst;
+ struct dst_entry *dst1;
err = PTR_ERR(xdst);
if (IS_ERR(xdst)) {
dst_release(dst);
goto put_states;
}
+ dst1 = &xdst->u.dst;
if (!dst_prev)
dst0 = dst1;
^ permalink raw reply related
* Re: [patch v2] ipvs: IPv6 tunnel mode
From: Simon Horman @ 2010-09-30 1:22 UTC (permalink / raw)
To: lvs-devel, netfilter-devel, netdev
Cc: Hans Schillstrom, Julian Anastasov, Julius Volz, Wensong Zhang,
Patrick McHardy
In-Reply-To: <20100927135913.GA21785@verge.net.au>
On Mon, Sep 27, 2010 at 10:59:14PM +0900, Simon Horman wrote:
> From: Julian Anastasov <ja@ssi.bg>
>
> Tunnel mode for IPv6 doesn't work.
Patrick, can you please drop this patch for now.
Hans has found some problems with it.
http://archive.linuxvirtualserver.org/html/lvs-devel/2010-09/msg00073.html
^ permalink raw reply
* [PATCH] net: code cleanups
From: Changli Gao @ 2010-09-30 2:24 UTC (permalink / raw)
To: David S. Miller
Cc: Alexey Kuznetsov, Pekka Savola (ipv6), James Morris,
Hideaki YOSHIFUJI, Patrick McHardy, netdev, Changli Gao
Compare operations are more readable, and compilers generate the same code
for the both.
Use the macros fl4_* to shrink the length of the lines.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
---
net/ipv4/af_inet.c | 7 +++----
net/ipv4/route.c | 27 ++++++++++++---------------
2 files changed, 15 insertions(+), 19 deletions(-)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index f581f77..ef26640 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1338,10 +1338,9 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
iph2 = ip_hdr(p);
- if ((iph->protocol ^ iph2->protocol) |
- (iph->tos ^ iph2->tos) |
- ((__force u32)iph->saddr ^ (__force u32)iph2->saddr) |
- ((__force u32)iph->daddr ^ (__force u32)iph2->daddr)) {
+ if (iph->protocol != iph2->protocol || iph->tos != iph2->tos ||
+ (__force u32)iph->saddr != (__force u32)iph2->saddr ||
+ (__force u32)iph->daddr != (__force u32)iph2->daddr) {
NAPI_GRO_CB(p)->same_flow = 0;
continue;
}
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 98beda4..6b00fde 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -683,19 +683,18 @@ static inline bool rt_caching(const struct net *net)
static inline bool compare_hash_inputs(const struct flowi *fl1,
const struct flowi *fl2)
{
- return ((((__force u32)fl1->nl_u.ip4_u.daddr ^ (__force u32)fl2->nl_u.ip4_u.daddr) |
- ((__force u32)fl1->nl_u.ip4_u.saddr ^ (__force u32)fl2->nl_u.ip4_u.saddr) |
- (fl1->iif ^ fl2->iif)) == 0);
+ return (__force u32)fl1->fl4_dst == (__force u32)fl2->fl4_dst &&
+ (__force u32)fl1->fl4_src == (__force u32)fl2->fl4_src &&
+ fl1->iif == fl2->iif;
}
static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
{
- return (((__force u32)fl1->nl_u.ip4_u.daddr ^ (__force u32)fl2->nl_u.ip4_u.daddr) |
- ((__force u32)fl1->nl_u.ip4_u.saddr ^ (__force u32)fl2->nl_u.ip4_u.saddr) |
- (fl1->mark ^ fl2->mark) |
- (*(u16 *)&fl1->nl_u.ip4_u.tos ^ *(u16 *)&fl2->nl_u.ip4_u.tos) |
- (fl1->oif ^ fl2->oif) |
- (fl1->iif ^ fl2->iif)) == 0;
+ return (__force u32)fl1->fl4_dst == (__force u32)fl2->fl4_dst &&
+ (__force u32)fl1->fl4_src == (__force u32)fl2->fl4_src &&
+ fl1->mark == fl2->mark &&
+ *(u16 *)&fl1->fl4_tos == *(u16 *)&fl2->fl4_tos &&
+ fl1->oif == fl2->oif && fl1->iif == fl2->iif;
}
static inline int compare_netns(struct rtable *rt1, struct rtable *rt2)
@@ -2286,12 +2285,10 @@ int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
rth = rcu_dereference(rth->dst.rt_next)) {
- if ((((__force u32)rth->fl.fl4_dst ^ (__force u32)daddr) |
- ((__force u32)rth->fl.fl4_src ^ (__force u32)saddr) |
- (rth->fl.iif ^ iif) |
- rth->fl.oif |
- (rth->fl.fl4_tos ^ tos)) == 0 &&
- rth->fl.mark == skb->mark &&
+ if ((__force u32)rth->fl.fl4_dst == (__force u32)daddr &&
+ (__force u32)rth->fl.fl4_src == (__force u32)saddr &&
+ rth->fl.iif == iif && rth->fl.oif == 0 &&
+ rth->fl.fl4_tos == tos && rth->fl.mark == skb->mark &&
net_eq(dev_net(rth->dst.dev), net) &&
!rt_is_expired(rth)) {
if (noref) {
^ permalink raw reply related
* Re: [PATCH] Phonet: Correct header retrieval after pskb_may_pull
From: David Miller @ 2010-09-30 2:42 UTC (permalink / raw)
To: remi.denis-courmont
Cc: kumar.sanghvi, netdev, eric.dumazet, gulshan.karmani,
linus.walleij
In-Reply-To: <201009290023.44717.remi.denis-courmont@nokia.com>
From: "Rémi Denis-Courmont" <remi.denis-courmont@nokia.com>
Date: Wed, 29 Sep 2010 00:23:44 +0300
> On Tuesday 28 September 2010 12:10:42 ext Kumar A Sanghvi, you wrote:
>> From: Kumar Sanghvi <kumar.sanghvi@stericsson.com>
>>
>> Retrieve the header after doing pskb_may_pull since, pskb_may_pull
>> could change the buffer structure.
>>
>> This is based on the comment given by Eric Dumazet on Phonet
>> Pipe controller patch for a similar problem.
>>
>> Signed-off-by: Kumar Sanghvi <kumar.sanghvi@stericsson.com>
>> Acked-by: Linus Walleij <linus.walleij@stericsson.com>
...
> Acked-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Applied, thanks everyone.
^ permalink raw reply
* Re: [PATCH net-next] arp: remove unnecessary export of arp_broken_ops
From: David Miller @ 2010-09-30 2:46 UTC (permalink / raw)
To: shemminger; +Cc: netdev
In-Reply-To: <20100929120802.2f642d0d@s6510>
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Wed, 29 Sep 2010 12:08:02 +0900
> arp_broken_ops is only used in arp.c
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Applied.
^ permalink raw reply
* Re: [PATCH net-next] tcp: tcp_enter_quickack_mode can be static
From: David Miller @ 2010-09-30 2:46 UTC (permalink / raw)
To: shemminger; +Cc: netdev
In-Reply-To: <20100929143014.7f32cec3@s6510>
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Wed, 29 Sep 2010 14:30:14 +0900
> Function only used in tcp_input.c
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Applied.
^ permalink raw reply
* Re: [PATCH] de2104x: disable media debug messages by default
From: David Miller @ 2010-09-30 2:47 UTC (permalink / raw)
To: jgarzik; +Cc: linux, netdev, linux-kernel
In-Reply-To: <4CA23941.2050803@pobox.com>
From: Jeff Garzik <jgarzik@pobox.com>
Date: Tue, 28 Sep 2010 14:51:45 -0400
> On 09/28/2010 02:18 PM, Ondrej Zary wrote:
>> Print media debug messages only when HW debug is enabled.
>>
>> Signed-off-by: Ondrej Zary<linux@rainbow-software.org>
>>
>> --- linux-2.6.36-rc3-/drivers/net/tulip/de2104x.c 2010-09-28
>> --- 19:50:51.000000000 +0200
>> +++ linux-2.6.36-rc3/drivers/net/tulip/de2104x.c 2010-09-28
>> 20:05:34.000000000 +0200
>> @@ -948,8 +948,9 @@ static void de_set_media (struct de_priv
>> else
>> macmode&= ~FullDuplex;
>>
>> - if (netif_msg_link(de)) {
>> + if (netif_msg_link(de))
>> dev_info(&de->dev->dev, "set link %s\n", media_name[media]);
>> + if (netif_msg_hw(de)) {
>> dev_info(&de->dev->dev, "mode 0x%x, sia 0x%x,0x%x,0x%x,0x%x\n",
>> dr32(MacMode), dr32(SIAStatus),
>> dr32(CSR13), dr32(CSR14), dr32(CSR15));
>
> Acked-by: Jeff Garzik <jgarzik@redhat.com>
Applied to net-next-2.6
^ permalink raw reply
* Re: [PATCH] de2104x: remove experimental status
From: David Miller @ 2010-09-30 2:48 UTC (permalink / raw)
To: jgarzik; +Cc: linux, netdev, linux-kernel
In-Reply-To: <4CA23963.8040409@pobox.com>
From: Jeff Garzik <jgarzik@pobox.com>
Date: Tue, 28 Sep 2010 14:52:19 -0400
> On 09/28/2010 02:46 PM, Ondrej Zary wrote:
>> It should be ready after 8 years...remove the experimental dependency.
>>
>> Signed-off-by: Ondrej Zary<linux@rainbow-software.org>
>>
>> --- linux-2.6.36-rc3-/drivers/net/tulip/Kconfig 2010-08-29
>> --- 17:36:04.000000000 +0200
>> +++ linux-2.6.36-rc3/drivers/net/tulip/Kconfig 2010-09-28
>> 19:49:46.000000000 +0200
>> @@ -11,8 +11,8 @@ menuconfig NET_TULIP
>> if NET_TULIP
>>
>> config DE2104X
>> - tristate "Early DECchip Tulip (dc2104x) PCI support (EXPERIMENTAL)"
>> - depends on PCI&& EXPERIMENTAL
>> + tristate "Early DECchip Tulip (dc2104x) PCI support"
>> + depends on PCI
>> select CRC32
>> ---help---
>> This driver is developed for the SMC EtherPower series Ethernet
>
> Well... it's not the years, it's the quality... which I think has
> been sufficiently increased.
>
> Acked-by: Jeff Garzik <jgarzik@redhat.com>
Also applied to net-next-2.6, thanks.
^ permalink raw reply
* Re: [PATCH net-next] bnx2x: Moved enabling of MSI to the bnx2x_set_num_queues()
From: David Miller @ 2010-09-30 2:48 UTC (permalink / raw)
To: dmitry; +Cc: netdev, vladz, eilong
In-Reply-To: <1285758337.7908.7.camel@lb-tlvb-dmitry>
From: "Dmitry Kravkov" <dmitry@broadcom.com>
Date: Wed, 29 Sep 2010 13:05:37 +0200
> Moved enabling of MSI to the bnx2x_set_num_queues() - the same functions that
> handles the initialization of the MSI-X.
>
> From: Vladislav Zolotarov <vladz@broadcom.com>
> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
> Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
> Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
> ---
> Respin of the patch originally prepared by Vladislav Zolotarov.
> This patch is required for the integration of Ben Hutchings bnx2x patch from
> the "netif_set_real_num_{rx,tx}_queues" patch series. Since falling from MSI-X
> to MSI mode due to lack of memory is broken.
Applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next 2.6] myri10ge: DCA update (resubmit)
From: David Miller @ 2010-09-30 2:49 UTC (permalink / raw)
To: gallatin; +Cc: netdev, loic
In-Reply-To: <4CA23038.3030808@myri.com>
From: Andrew Gallatin <gallatin@myri.com>
Date: Tue, 28 Sep 2010 14:13:12 -0400
> This patch contains the following DCA improvements to myri10ge:
>
> 1) Finally move myri10ge to use dca3 API
>
> 2) Disable PCIe relaxed ordering when enabling DCA on
> myri10ge. This provides a performance boost on Nehalem
> based Xeons
>
> 3) Make sure to properly initialize NIC's DCA state when it is
> enabled,
> rather than giving the NIC a bogus tag (0) and waiting for
> the first received packet to trigger an update. Not using a
> real tag can cause hardware exceptions on some motherboards
> when a CPU socket is empty.
>
> 3) Always update the cached CPU when our interrupt affinity changes
> so as to avoid excessive calls to dca3_get_tag()
>
> Signed-off-by: Andrew Gallatin <gallatin@myri.com>
> Signed-off-by: Loic Prylli <loic@myri.com>
Applied.
^ permalink raw reply
* Re: [PATCH] net: code cleanups
From: Joe Perches @ 2010-09-30 2:49 UTC (permalink / raw)
To: Changli Gao
Cc: David S. Miller, Alexey Kuznetsov, Pekka Savola (ipv6),
James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev
In-Reply-To: <1285813497-7384-1-git-send-email-xiaosuo@gmail.com>
On Thu, 2010-09-30 at 10:24 +0800, Changli Gao wrote:
> Compare operations are more readable, and compilers generate the same code
> for the both.
As far as I know, not all supported versions of gcc
generate the same code.
Also, you could probably now remove the (__force u32) casts.
> Use the macros fl4_* to shrink the length of the lines.
>
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> ---
> net/ipv4/af_inet.c | 7 +++----
> net/ipv4/route.c | 27 ++++++++++++---------------
> 2 files changed, 15 insertions(+), 19 deletions(-)
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index f581f77..ef26640 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1338,10 +1338,9 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
>
> iph2 = ip_hdr(p);
>
> - if ((iph->protocol ^ iph2->protocol) |
> - (iph->tos ^ iph2->tos) |
> - ((__force u32)iph->saddr ^ (__force u32)iph2->saddr) |
> - ((__force u32)iph->daddr ^ (__force u32)iph2->daddr)) {
> + if (iph->protocol != iph2->protocol || iph->tos != iph2->tos ||
> + (__force u32)iph->saddr != (__force u32)iph2->saddr ||
> + (__force u32)iph->daddr != (__force u32)iph2->daddr) {
> NAPI_GRO_CB(p)->same_flow = 0;
> continue;
> }
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 98beda4..6b00fde 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -683,19 +683,18 @@ static inline bool rt_caching(const struct net *net)
> static inline bool compare_hash_inputs(const struct flowi *fl1,
> const struct flowi *fl2)
> {
> - return ((((__force u32)fl1->nl_u.ip4_u.daddr ^ (__force u32)fl2->nl_u.ip4_u.daddr) |
> - ((__force u32)fl1->nl_u.ip4_u.saddr ^ (__force u32)fl2->nl_u.ip4_u.saddr) |
> - (fl1->iif ^ fl2->iif)) == 0);
> + return (__force u32)fl1->fl4_dst == (__force u32)fl2->fl4_dst &&
> + (__force u32)fl1->fl4_src == (__force u32)fl2->fl4_src &&
> + fl1->iif == fl2->iif;
> }
>
> static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
> {
> - return (((__force u32)fl1->nl_u.ip4_u.daddr ^ (__force u32)fl2->nl_u.ip4_u.daddr) |
> - ((__force u32)fl1->nl_u.ip4_u.saddr ^ (__force u32)fl2->nl_u.ip4_u.saddr) |
> - (fl1->mark ^ fl2->mark) |
> - (*(u16 *)&fl1->nl_u.ip4_u.tos ^ *(u16 *)&fl2->nl_u.ip4_u.tos) |
> - (fl1->oif ^ fl2->oif) |
> - (fl1->iif ^ fl2->iif)) == 0;
> + return (__force u32)fl1->fl4_dst == (__force u32)fl2->fl4_dst &&
> + (__force u32)fl1->fl4_src == (__force u32)fl2->fl4_src &&
> + fl1->mark == fl2->mark &&
> + *(u16 *)&fl1->fl4_tos == *(u16 *)&fl2->fl4_tos &&
> + fl1->oif == fl2->oif && fl1->iif == fl2->iif;
> }
>
> static inline int compare_netns(struct rtable *rt1, struct rtable *rt2)
> @@ -2286,12 +2285,10 @@ int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
>
> for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
> rth = rcu_dereference(rth->dst.rt_next)) {
> - if ((((__force u32)rth->fl.fl4_dst ^ (__force u32)daddr) |
> - ((__force u32)rth->fl.fl4_src ^ (__force u32)saddr) |
> - (rth->fl.iif ^ iif) |
> - rth->fl.oif |
> - (rth->fl.fl4_tos ^ tos)) == 0 &&
> - rth->fl.mark == skb->mark &&
> + if ((__force u32)rth->fl.fl4_dst == (__force u32)daddr &&
> + (__force u32)rth->fl.fl4_src == (__force u32)saddr &&
> + rth->fl.iif == iif && rth->fl.oif == 0 &&
> + rth->fl.fl4_tos == tos && rth->fl.mark == skb->mark &&
> net_eq(dev_net(rth->dst.dev), net) &&
> !rt_is_expired(rth)) {
> if (noref) {
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox