* Re: [PATCH net-next] tcp: implement RFC 5961 4.2
From: Eric Dumazet @ 2012-07-17 22:21 UTC (permalink / raw)
To: Vijay Subramanian; +Cc: David Miller, netdev, Kiran Kumar Kella
In-Reply-To: <CAGK4HS8MvP6L5Rvy4wJx2KhdTSSHfP7YPT44e9mrV_vsBZJ9jQ@mail.gmail.com>
On Tue, 2012-07-17 at 15:10 -0700, Vijay Subramanian wrote:
> Yes. This is what I had in mind (along with a change to make
> tcp_sequence() return bool). I am not sure if this patch is official
> (or will be picked up by patchwork) but
> for what its worth
>
> Acked-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Well, I am going to send an official patch with all credits ASAP,
Thanks !
^ permalink raw reply
* [PATCH net-next] bonding: refine IFF_XMIT_DST_RELEASE capability
From: Eric Dumazet @ 2012-07-17 22:19 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Jay Vosburgh, Andy Gospodarek, Tom Herbert
From: Eric Dumazet <edumazet@google.com>
Some workloads greatly benefit of IFF_XMIT_DST_RELEASE capability
on output net device, avoiding dirtying dst refcount.
bonding currently disables IFF_XMIT_DST_RELEASE unconditionally.
If all slaves have the IFF_XMIT_DST_RELEASE bit set, then
bonding master can also have it in its priv_flags
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Tom Herbert <therbert@google.com>
---
drivers/net/bonding/bond_main.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 1eb3979..3960b1b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1382,6 +1382,7 @@ static void bond_compute_features(struct bonding *bond)
netdev_features_t vlan_features = BOND_VLAN_FEATURES;
unsigned short max_hard_header_len = ETH_HLEN;
int i;
+ unsigned int flags, dst_release_flag = IFF_XMIT_DST_RELEASE;
read_lock(&bond->lock);
@@ -1392,6 +1393,7 @@ static void bond_compute_features(struct bonding *bond)
vlan_features = netdev_increment_features(vlan_features,
slave->dev->vlan_features, BOND_VLAN_FEATURES);
+ dst_release_flag &= slave->dev->priv_flags;
if (slave->dev->hard_header_len > max_hard_header_len)
max_hard_header_len = slave->dev->hard_header_len;
}
@@ -1400,6 +1402,9 @@ done:
bond_dev->vlan_features = vlan_features;
bond_dev->hard_header_len = max_hard_header_len;
+ flags = bond_dev->priv_flags & ~IFF_XMIT_DST_RELEASE;
+ bond_dev->priv_flags = flags | dst_release_flag;
+
read_unlock(&bond->lock);
netdev_change_features(bond_dev);
^ permalink raw reply related
* Re: That's pretty much it for 3.5.0
From: David Miller @ 2012-07-17 22:18 UTC (permalink / raw)
To: john.r.fastabend; +Cc: mark.d.rustad, netdev, linux-wireless, netfilter-devel
In-Reply-To: <5005E390.7020706@intel.com>
From: John Fastabend <john.r.fastabend@intel.com>
Date: Tue, 17 Jul 2012 15:13:36 -0700
> Perhaps the easiest way is to check net->count this should be zero
> until setup_net is called.
>
> if (!atomic_read(&init_net.count))
> return ret;
>
Won't work, setup_net() runs via a pure_initcall().
^ permalink raw reply
* Re: That's pretty much it for 3.5.0
From: John Fastabend @ 2012-07-17 22:13 UTC (permalink / raw)
To: David Miller; +Cc: mark.d.rustad, netdev, linux-wireless, netfilter-devel
In-Reply-To: <20120717.140241.1599386555723262095.davem@davemloft.net>
On 7/17/2012 2:02 PM, David Miller wrote:
> From: John Fastabend <john.r.fastabend@intel.com>
> Date: Tue, 17 Jul 2012 13:50:16 -0700
>
>> On 7/17/2012 12:24 PM, David Miller wrote:
>>> From: John Fastabend <john.r.fastabend@intel.com>
>>> Date: Tue, 17 Jul 2012 12:09:53 -0700
>>>
>>>> although we don't have an early_init hook for netprio_cgroup so this
>>>> is probably not correct.
>>>
>>> The dependency is actually on net_dev_init (a subsys_initcall) rather
>>> than a pure_initcall.
>>>
>>> net_dev_init is what registers the netdev_net_ops, which in turn
>>> initializes the netdev list in namespaces such as &init_net
>>>
>>
>> Ah right thanks sorry for the thrash. I guess we need to check if the
>> netdev list in the init_net namespace is initialized.
>
> It's a hack, but we could export and then test dev_boot_phase == 0,
> and if that test is true then skip the init_net device walk in the
> cgroup code.
>
> But I don't like that very much.
>
> The things this code cares about can't even be an issue until
> net_dev_init() runs.
>
> There is a comment warning not to do this in linux/init.h, but we
> could change the module_init() in netprio_cgroup.c to some level which
> runs after subsys_inticall(). When built as a module, linux/init.h
> will translate this into module_init() which is basically the behavior
> we want.
>
Perhaps the easiest way is to check net->count this should be zero
until setup_net is called.
if (!atomic_read(&init_net.count))
return ret;
^ permalink raw reply
* Re: [PATCH net-next] tcp: implement RFC 5961 4.2
From: Vijay Subramanian @ 2012-07-17 22:10 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, Kiran Kumar Kella
In-Reply-To: <1342560769.2626.1165.camel@edumazet-glaptop>
> But you probably are right, we could test th->syn here as well.
>
> Something like that ?
> - if (!th->rst)
> + if (!th->rst) {
> + if (th->syn)
> + goto syn_challenge;
> tcp_send_dupack(sk, skb);
> + }
> goto discard;
> }
>
> @@ -5327,6 +5330,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> * RFC 5691 4.2 : Send a challenge ack
> */
> if (th->syn) {
> +syn_challenge:
> if (syn_inerr)
> TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
> NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE);
>
Yes. This is what I had in mind (along with a change to make
tcp_sequence() return bool). I am not sure if this patch is official
(or will be picked up by patchwork) but
for what its worth
Acked-by: Vijay Subramanian <subramanian.vijay@gmail.com>
I will send a separate patch to make tcp_sequence() return bool.
Thanks!
Vijay
^ permalink raw reply
* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
From: David Miller @ 2012-07-17 22:09 UTC (permalink / raw)
To: ja; +Cc: netdev
In-Reply-To: <alpine.LFD.2.00.1207180052420.2128@ja.ssi.bg>
From: Julian Anastasov <ja@ssi.bg>
Date: Wed, 18 Jul 2012 01:14:04 +0300 (EEST)
> Aha, I see. Something around fnhe_oldest() and its
> daddr arg does not look good. If the goal is to hijack
> some entry, probably for another daddr and comparing it with
> tcpm_new(), may be we should remove this daddr arg and fully
> reset all parameters such as fnhe_pmtu, fnhe_gw, fnhe_expires
> because the find_or_create_fnhe() callers modify only specific
> fields, we should not end up with wrong gateway inherited from
> another daddr, for example.
Better would be to use a seqlock when reading it's values.
Either way, patches welcome :-)
^ permalink raw reply
* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
From: Julian Anastasov @ 2012-07-17 22:14 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20120717.134651.562831385960975623.davem@davemloft.net>
Hello,
On Tue, 17 Jul 2012, David Miller wrote:
> > IIRC, struct fib_info was shared by different
> > prefixes. It saves a lot of memory when thousands of
> > routes are created to same GW. Now if we end up with 1 or
> > 2 fib_info structures for default routes, the nh_exceptions list
> > can become very long. May be fib_info is not a good place
> > to hide such data.
>
> Your analysis of what fib_info is and how it's intended to
> work is accurate.
>
> But we don't use a linked list for the exceptions in the final
> version, we use a reclaiming RCU'd hash table like we use for TCP
> metrics.
>
> See the updated version of patch #5 and what I actually committed to
> net-next.
Aha, I see. Something around fnhe_oldest() and its
daddr arg does not look good. If the goal is to hijack
some entry, probably for another daddr and comparing it with
tcpm_new(), may be we should remove this daddr arg and fully
reset all parameters such as fnhe_pmtu, fnhe_gw, fnhe_expires
because the find_or_create_fnhe() callers modify only specific
fields, we should not end up with wrong gateway inherited from
another daddr, for example.
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply
* Re: [PATCH net-next] tcp: implement RFC 5961 4.2
From: Eric Dumazet @ 2012-07-17 21:32 UTC (permalink / raw)
To: Vijay Subramanian; +Cc: David Miller, netdev, Kiran Kumar Kella
In-Reply-To: <CAGK4HS-AMWZ0Ef7G1pvWJ7XADsvxqsiX3tMCe2=oq+_46won6g@mail.gmail.com>
On Tue, 2012-07-17 at 14:02 -0700, Vijay Subramanian wrote:
> On 17 July 2012 04:41, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Implement the RFC 5691 mitigation against Blind
> > Reset attack using SYN bit.
> >
> > Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop
> > incoming packet, instead of resetting the session.
>
> Eric,
> Section 4.2 has this to say:
> "If the SYN bit is set, irrespective of the sequence number, TCP
> MUST send an ACK (also referred to as challenge ACK) to the remote
> peer:"
>
> I believe your patch only sends challenge acks for in-window SYN packets.
> After this patch, the code for out of window packets is like this:
>
> /* Step 1: check sequence number */
> if (!tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) {
> /* RFC793, page 37: "In all states except SYN-SENT, all reset
> * (RST) segments are validated by checking their SEQ-fields."
> * And page 69: "If an incoming segment is not acceptable,
> * an acknowledgment should be sent in reply (unless the RST
> * bit is set, if so drop the segment and return)".
> */
> if (!th->rst)
> tcp_send_dupack(sk, skb);
> goto discard;
> }
>
>
> For SYN packets that are not in window, we do end up calling
> tcp_send_dupack() but not tcp_send_challenge_ack(). Will it be more
> appropriate to call the latter so that
> we do proper rate limiting of challenge acks and update SNMP counters correctly?
Well, I only wanted to avoid RST ;)
But you probably are right, we could test th->syn here as well.
Something like that ?
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8aaec55..fdd49f1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5296,8 +5296,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
* an acknowledgment should be sent in reply (unless the RST
* bit is set, if so drop the segment and return)".
*/
- if (!th->rst)
+ if (!th->rst) {
+ if (th->syn)
+ goto syn_challenge;
tcp_send_dupack(sk, skb);
+ }
goto discard;
}
@@ -5327,6 +5330,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
* RFC 5691 4.2 : Send a challenge ack
*/
if (th->syn) {
+syn_challenge:
if (syn_inerr)
TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE);
^ permalink raw reply related
* Re: [PATCH] cipso: don't follow a NULL pointer when setsockopt() is called
From: David Miller @ 2012-07-17 21:31 UTC (permalink / raw)
To: pmoore; +Cc: netdev
In-Reply-To: <16005876.GSA3ToWriD@sifl>
From: Paul Moore <pmoore@redhat.com>
Date: Tue, 17 Jul 2012 17:24:50 -0400
> David, if you don't queue this up for them, let me know and I'll resend it to
> stable once it hits Linus' tree.
I will be sure to queue it up to -stable when I apply it, as I always
do for appropriate patches.
^ permalink raw reply
* Re: [PATCH] cipso: don't follow a NULL pointer when setsockopt() is called
From: Paul Moore @ 2012-07-17 21:24 UTC (permalink / raw)
To: netdev
In-Reply-To: <20120717210738.22790.23522.stgit@sifl>
On Tuesday, July 17, 2012 05:07:47 PM Paul Moore wrote:
> As reported by Alan Cox, and verified by Lin Ming, when a user
> attempts to add a CIPSO option to a socket using the CIPSO_V4_TAG_LOCAL
> tag the kernel dies a terrible death when it attempts to follow a NULL
> pointer (the skb argument to cipso_v4_validate() is NULL when called via
> the setsockopt() syscall).
>
> This patch fixes this by first checking to ensure that the skb is
> non-NULL before using it to find the incoming network interface. In
> the unlikely case where the skb is NULL and the user attempts to add
> a CIPSO option with the _TAG_LOCAL tag we return an error as this is
> not something we want to allow.
...
> CC: Lin Ming <mlin@ss.pku.edu.cn>
> Reported-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
> Signed-off-by: Paul Moore <pmoore@redhat.com>
Argh, I just realized I forgot to CC the stable folks.
David, if you don't queue this up for them, let me know and I'll resend it to
stable once it hits Linus' tree.
> ---
> net/ipv4/cipso_ipv4.c | 6 ++++--
> 1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
> index c48adc5..667c1d4 100644
> --- a/net/ipv4/cipso_ipv4.c
> +++ b/net/ipv4/cipso_ipv4.c
> @@ -1725,8 +1725,10 @@ int cipso_v4_validate(const struct sk_buff *skb,
> unsigned char **option) case CIPSO_V4_TAG_LOCAL:
> /* This is a non-standard tag that we only allow for
> * local connections, so if the incoming interface is
> - * not the loopback device drop the packet. */
> - if (!(skb->dev->flags & IFF_LOOPBACK)) {
> + * not the loopback device drop the packet. Further,
> + * there is no legitimate reason for setting this from
> + * userspace so reject it if skb is NULL. */
> + if (skb == NULL || !(skb->dev->flags & IFF_LOOPBACK)) {
> err_offset = opt_iter;
> goto validate_return_locked;
> }
--
paul moore
security and virtualization @ redhat
^ permalink raw reply
* [PATCH net-next] net: qmi_wwan: make dynamic device IDs work
From: Bjørn Mork @ 2012-07-17 21:14 UTC (permalink / raw)
To: netdev-u79uwXL29TY76Z2rM5mHXA
Cc: linux-usb-u79uwXL29TY76Z2rM5mHXA, Bjørn Mork
The usbnet API use the device ID table to store a pointer to
a minidriver. Setting a generic pointer for dynamic device
IDs will in most cases make them work as expected. usbnet
will otherwise treat the dynamic IDs as blacklisted. That is
rarely useful.
There is no standard class describing devices supported by
this driver, and most vendors don't even provide enough
information to allow vendor specific wildcard matching. The
result is that most of the supported devices must be
explicitly listed in the device table. Allowing dynamic IDs
to work both simplifies testing and verification of new
devices, and provides a way for end users to use a device
before the ID is added to the driver.
Signed-off-by: Bjørn Mork <bjorn-yOkvZcmFvRU@public.gmane.org>
---
I gave up on the generic solution to this problem. Different
drivers will have different needs, and even the different usbnet
minidrivers are very different. Some do mostly class based
wildcard matching and rarely need an ID table update, while
others, like qmi_wwan, will need frequent updates as new devices
hit the market.
My conclusion was that it was best to add support like this
in the few drivers actually needing it. Each driver will have
to make a policy decision wrt which data they will use for
dynamic devices. It just can't be done wholesale.
drivers/net/usb/qmi_wwan.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index bc0469e..2ea126a 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -609,10 +609,27 @@ static const struct usb_device_id products[] = {
};
MODULE_DEVICE_TABLE(usb, products);
+static int qmi_wwan_probe(struct usb_interface *intf, const struct usb_device_id *prod)
+{
+ struct usb_device_id *id = (struct usb_device_id *)prod;
+
+ /* Workaround to enable dynamic IDs. This disables usbnet
+ * blacklisting functionality. Which, if required, can be
+ * reimplemented here by using a magic "blacklist" value
+ * instead of 0 in the static device id table
+ */
+ if (!id->driver_info) {
+ dev_dbg(&intf->dev, "setting defaults for dynamic device id\n");
+ id->driver_info = (unsigned long)&qmi_wwan_shared;
+ }
+
+ return usbnet_probe(intf, id);
+}
+
static struct usb_driver qmi_wwan_driver = {
.name = "qmi_wwan",
.id_table = products,
- .probe = usbnet_probe,
+ .probe = qmi_wwan_probe,
.disconnect = usbnet_disconnect,
.suspend = qmi_wwan_suspend,
.resume = qmi_wwan_resume,
--
1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: New commands to configure IOV features
From: David Miller @ 2012-07-17 21:11 UTC (permalink / raw)
To: chris.friesen
Cc: ddutile, yuvalmin, bhutchings, gregory.v.rose, netdev, linux-pci
In-Reply-To: <5005D45D.1040302@genband.com>
From: Chris Friesen <chris.friesen@genband.com>
Date: Tue, 17 Jul 2012 15:08:45 -0600
> From that perspective a sysfs-based interface is ideal since it is
> directly scriptable.
As is anything ethtool or netlink based, since we have 'ethtool'
and 'ip' for scripting.
^ permalink raw reply
* Re: New commands to configure IOV features
From: Chris Friesen @ 2012-07-17 21:08 UTC (permalink / raw)
To: Don Dutile
Cc: Yuval Mintz, davem@davemloft.net, Ben Hutchings, Greg Rose,
netdev@vger.kernel.org, linux-pci
In-Reply-To: <5005BD00.4090106@redhat.com>
On 07/17/2012 01:29 PM, Don Dutile wrote:
> WRT SRIOV-nic devices, the thinking goes that protocol-level
> parameters associated with VFs should use protocol-specific interfaces,
> e.g., ethtool, ip link set, etc. for Ethernet VFs.
> Thus, the various protocol control functions/tools should
> be used to control VF parameters, as one would for a physical device
> of that protocol/class.
It seems to me that the mere act of creating one or more VFs is
something generic, applicable to all devices that are capable of it.
The details of configuring those VFs can then be handled by
protocol-specific interfaces.
I'm not too worried about the exact mechanism of doing it, as long as
it's ultimately scriptable--that is, if it's a C API then it would be
appreciated if there was a standard tool to call that implements it.
From that perspective a sysfs-based interface is ideal since it is
directly scriptable.
Chris
^ permalink raw reply
* [PATCH] cipso: don't follow a NULL pointer when setsockopt() is called
From: Paul Moore @ 2012-07-17 21:07 UTC (permalink / raw)
To: netdev
As reported by Alan Cox, and verified by Lin Ming, when a user
attempts to add a CIPSO option to a socket using the CIPSO_V4_TAG_LOCAL
tag the kernel dies a terrible death when it attempts to follow a NULL
pointer (the skb argument to cipso_v4_validate() is NULL when called via
the setsockopt() syscall).
This patch fixes this by first checking to ensure that the skb is
non-NULL before using it to find the incoming network interface. In
the unlikely case where the skb is NULL and the user attempts to add
a CIPSO option with the _TAG_LOCAL tag we return an error as this is
not something we want to allow.
A simple reproducer, kindly supplied by Lin Ming, although you must
have the CIPSO DOI #3 configure on the system first or you will be
caught early in cipso_v4_validate():
#include <sys/types.h>
#include <sys/socket.h>
#include <linux/ip.h>
#include <linux/in.h>
#include <string.h>
struct local_tag {
char type;
char length;
char info[4];
};
struct cipso {
char type;
char length;
char doi[4];
struct local_tag local;
};
int main(int argc, char **argv)
{
int sockfd;
struct cipso cipso = {
.type = IPOPT_CIPSO,
.length = sizeof(struct cipso),
.local = {
.type = 128,
.length = sizeof(struct local_tag),
},
};
memset(cipso.doi, 0, 4);
cipso.doi[3] = 3;
sockfd = socket(AF_INET, SOCK_DGRAM, 0);
#define SOL_IP 0
setsockopt(sockfd, SOL_IP, IP_OPTIONS,
&cipso, sizeof(struct cipso));
return 0;
}
CC: Lin Ming <mlin@ss.pku.edu.cn>
Reported-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Paul Moore <pmoore@redhat.com>
---
net/ipv4/cipso_ipv4.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index c48adc5..667c1d4 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -1725,8 +1725,10 @@ int cipso_v4_validate(const struct sk_buff *skb, unsigned char **option)
case CIPSO_V4_TAG_LOCAL:
/* This is a non-standard tag that we only allow for
* local connections, so if the incoming interface is
- * not the loopback device drop the packet. */
- if (!(skb->dev->flags & IFF_LOOPBACK)) {
+ * not the loopback device drop the packet. Further,
+ * there is no legitimate reason for setting this from
+ * userspace so reject it if skb is NULL. */
+ if (skb == NULL || !(skb->dev->flags & IFF_LOOPBACK)) {
err_offset = opt_iter;
goto validate_return_locked;
}
^ permalink raw reply related
* Re: [PATCH net-next] tcp: implement RFC 5961 4.2
From: Vijay Subramanian @ 2012-07-17 21:02 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev, Kiran Kumar Kella
In-Reply-To: <1342525290.2626.459.camel@edumazet-glaptop>
On 17 July 2012 04:41, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Implement the RFC 5691 mitigation against Blind
> Reset attack using SYN bit.
>
> Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop
> incoming packet, instead of resetting the session.
Eric,
Section 4.2 has this to say:
"If the SYN bit is set, irrespective of the sequence number, TCP
MUST send an ACK (also referred to as challenge ACK) to the remote
peer:"
I believe your patch only sends challenge acks for in-window SYN packets.
After this patch, the code for out of window packets is like this:
/* Step 1: check sequence number */
if (!tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) {
/* RFC793, page 37: "In all states except SYN-SENT, all reset
* (RST) segments are validated by checking their SEQ-fields."
* And page 69: "If an incoming segment is not acceptable,
* an acknowledgment should be sent in reply (unless the RST
* bit is set, if so drop the segment and return)".
*/
if (!th->rst)
tcp_send_dupack(sk, skb);
goto discard;
}
For SYN packets that are not in window, we do end up calling
tcp_send_dupack() but not tcp_send_challenge_ack(). Will it be more
appropriate to call the latter so that
we do proper rate limiting of challenge acks and update SNMP counters correctly?
Thanks,
Vijay
^ permalink raw reply
* Re: That's pretty much it for 3.5.0
From: David Miller @ 2012-07-17 21:02 UTC (permalink / raw)
To: john.r.fastabend; +Cc: mark.d.rustad, netdev, linux-wireless, netfilter-devel
In-Reply-To: <5005D008.6060103@intel.com>
From: John Fastabend <john.r.fastabend@intel.com>
Date: Tue, 17 Jul 2012 13:50:16 -0700
> On 7/17/2012 12:24 PM, David Miller wrote:
>> From: John Fastabend <john.r.fastabend@intel.com>
>> Date: Tue, 17 Jul 2012 12:09:53 -0700
>>
>>> although we don't have an early_init hook for netprio_cgroup so this
>>> is probably not correct.
>>
>> The dependency is actually on net_dev_init (a subsys_initcall) rather
>> than a pure_initcall.
>>
>> net_dev_init is what registers the netdev_net_ops, which in turn
>> initializes the netdev list in namespaces such as &init_net
>>
>
> Ah right thanks sorry for the thrash. I guess we need to check if the
> netdev list in the init_net namespace is initialized.
It's a hack, but we could export and then test dev_boot_phase == 0,
and if that test is true then skip the init_net device walk in the
cgroup code.
But I don't like that very much.
The things this code cares about can't even be an issue until
net_dev_init() runs.
There is a comment warning not to do this in linux/init.h, but we
could change the module_init() in netprio_cgroup.c to some level which
runs after subsys_inticall(). When built as a module, linux/init.h
will translate this into module_init() which is basically the behavior
we want.
^ permalink raw reply
* Re: That's pretty much it for 3.5.0
From: John Fastabend @ 2012-07-17 20:50 UTC (permalink / raw)
To: David Miller; +Cc: mark.d.rustad, netdev, linux-wireless, netfilter-devel
In-Reply-To: <20120717.122459.2240133900020140698.davem@davemloft.net>
On 7/17/2012 12:24 PM, David Miller wrote:
> From: John Fastabend <john.r.fastabend@intel.com>
> Date: Tue, 17 Jul 2012 12:09:53 -0700
>
>> although we don't have an early_init hook for netprio_cgroup so this
>> is probably not correct.
>
> The dependency is actually on net_dev_init (a subsys_initcall) rather
> than a pure_initcall.
>
> net_dev_init is what registers the netdev_net_ops, which in turn
> initializes the netdev list in namespaces such as &init_net
>
Ah right thanks sorry for the thrash. I guess we need to check if the
netdev list in the init_net namespace is initialized.
^ permalink raw reply
* Re: [PATCH net-next] ipv4: fix rcu splat
From: David Miller @ 2012-07-17 20:49 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1342557733.2626.1103.camel@edumazet-glaptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 17 Jul 2012 22:42:13 +0200
> From: Eric Dumazet <edumazet@google.com>
>
> free_nh_exceptions() should use rcu_dereference_protected(..., 1)
> since its called after one RCU grace period.
>
> Also add some const-ification in recent code.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Applied, thanks Eric.
^ permalink raw reply
* Re: [PATCH v4] net: cgroup: fix access the unallocated memory in netprio cgroup
From: John Fastabend @ 2012-07-17 20:47 UTC (permalink / raw)
To: Gao feng
Cc: nhorman, eric.dumazet, linux-kernel, netdev, davem, Eric Dumazet,
Rustad, Mark D
In-Reply-To: <1342079415-9631-1-git-send-email-gaofeng@cn.fujitsu.com>
On 7/12/2012 12:50 AM, Gao feng wrote:
> there are some out of bound accesses in netprio cgroup.
>
> now before accessing the dev->priomap.priomap array,we only check
> if the dev->priomap exist.and because we don't want to see
> additional bound checkings in fast path, so we should make sure
> that dev->priomap is null or array size of dev->priomap.priomap
> is equal to max_prioidx + 1;
>
> so in write_priomap logic,we should call extend_netdev_table when
> dev->priomap is null and dev->priomap.priomap_len < max_len.
> and in cgrp_create->update_netdev_tables logic,we should call
> extend_netdev_table only when dev->priomap exist and
> dev->priomap.priomap_len < max_len.
>
> and it's not needed to call update_netdev_tables in write_priomap,
> we can only allocate the net device's priomap which we change through
> net_prio.ifpriomap.
>
> this patch also add a return value for update_netdev_tables &
> extend_netdev_table, so when new_priomap is allocated failed,
> write_priomap will stop to access the priomap,and return -ENOMEM
> back to the userspace to tell the user what happend.
>
> Change From v3:
> 1. add rtnl protect when reading max_prioidx in write_priomap.
>
> 2. only call extend_netdev_table when map->priomap_len < max_len,
> this will make sure array size of dev->map->priomap always
> bigger than any prioidx.
>
> 3. add a function write_update_netdev_table to make codes clear.
>
> Change From v2:
> 1. protect extend_netdev_table by RTNL.
> 2. when extend_netdev_table failed,call dev_put to reduce device's refcount.
>
> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: Eric Dumazet <edumazet@google.com>
> ---
> net/core/netprio_cgroup.c | 71 ++++++++++++++++++++++++++++++++++-----------
> 1 files changed, 54 insertions(+), 17 deletions(-)
>
[...]
> +
> +static int update_netdev_tables(void)
> +{
> + int ret = 0;
> struct net_device *dev;
> - u32 max_len = atomic_read(&max_prioidx) + 1;
> + u32 max_len;
> struct netprio_map *map;
need to check if net subsystem is initialized before we try
to use it here...
if (some_check) -> need to lookup what this check is
return ret;
>
> rtnl_lock();
> + max_len = atomic_read(&max_prioidx) + 1;
> for_each_netdev(&init_net, dev) {
> map = rtnl_dereference(dev->priomap);
> - if ((!map) ||
> - (map->priomap_len < max_len))
> - extend_netdev_table(dev, max_len);
> + /*
> + * don't allocate priomap if we didn't
> + * change net_prio.ifpriomap (map == NULL),
> + * this will speed up skb_update_prio.
> + */
> + if (map && map->priomap_len < max_len) {
> + ret = extend_netdev_table(dev, max_len);
> + if (ret < 0)
> + break;
> + }
> }
> rtnl_unlock();
> + return ret;
> }
>
> static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp)
> {
> struct cgroup_netprio_state *cs;
> - int ret;
> + int ret = -EINVAL;
>
> cs = kzalloc(sizeof(*cs), GFP_KERNEL);
> if (!cs)
> return ERR_PTR(-ENOMEM);
>
> - if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx) {
> - kfree(cs);
> - return ERR_PTR(-EINVAL);
> - }
> + if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
> + goto out;
>
> ret = get_prioidx(&cs->prioidx);
> - if (ret != 0) {
> + if (ret < 0) {
> pr_warn("No space in priority index array\n");
> - kfree(cs);
> - return ERR_PTR(ret);
> + goto out;
> + }
> +
> + ret = update_netdev_tables();
> + if (ret < 0) {
> + put_prioidx(cs->prioidx);
> + goto out;
> }
Gao,
This introduces a null ptr dereference when netprio_cgroup is built
into the kernel because update_netdev_tables() depends on init_net.
However cgrp_create is being called by cgroup_init before
do_initcalls() is called and before net_dev_init().
.John
^ permalink raw reply
* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
From: David Miller @ 2012-07-17 20:46 UTC (permalink / raw)
To: ja; +Cc: netdev
In-Reply-To: <alpine.LFD.2.00.1207172249270.1831@ja.ssi.bg>
From: Julian Anastasov <ja@ssi.bg>
Date: Tue, 17 Jul 2012 23:41:16 +0300 (EEST)
> IIRC, struct fib_info was shared by different
> prefixes. It saves a lot of memory when thousands of
> routes are created to same GW. Now if we end up with 1 or
> 2 fib_info structures for default routes, the nh_exceptions list
> can become very long. May be fib_info is not a good place
> to hide such data.
Your analysis of what fib_info is and how it's intended to
work is accurate.
But we don't use a linked list for the exceptions in the final
version, we use a reclaiming RCU'd hash table like we use for TCP
metrics.
See the updated version of patch #5 and what I actually committed to
net-next.
^ permalink raw reply
* [PATCH net-next] ipv4: fix rcu splat
From: Eric Dumazet @ 2012-07-17 20:42 UTC (permalink / raw)
To: David Miller; +Cc: netdev
From: Eric Dumazet <edumazet@google.com>
free_nh_exceptions() should use rcu_dereference_protected(..., 1)
since its called after one RCU grace period.
Also add some const-ification in recent code.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/fib_semantics.c | 4 ++--
net/ipv4/inet_connection_sock.c | 4 ++--
net/ipv4/route.c | 13 +++++++------
3 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 1e09852..2b57d76 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -148,11 +148,11 @@ static void free_nh_exceptions(struct fib_nh *nh)
for (i = 0; i < FNHE_HASH_SIZE; i++) {
struct fib_nh_exception *fnhe;
- fnhe = rcu_dereference(hash[i].chain);
+ fnhe = rcu_dereference_protected(hash[i].chain, 1);
while (fnhe) {
struct fib_nh_exception *next;
- next = rcu_dereference(fnhe->fnhe_next);
+ next = rcu_dereference_protected(fnhe->fnhe_next, 1);
kfree(fnhe);
fnhe = next;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 89e74a3..68bb5a6 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -806,8 +806,8 @@ EXPORT_SYMBOL_GPL(inet_csk_compat_setsockopt);
static struct dst_entry *inet_csk_rebuild_route(struct sock *sk, struct flowi *fl)
{
- struct inet_sock *inet = inet_sk(sk);
- struct ip_options_rcu *inet_opt;
+ const struct inet_sock *inet = inet_sk(sk);
+ const struct ip_options_rcu *inet_opt;
__be32 daddr = inet->inet_daddr;
struct flowi4 *fl4;
struct rtable *rt;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 812e444..f67e702 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1275,7 +1275,7 @@ static void rt_del(unsigned int hash, struct rtable *rt)
spin_unlock_bh(rt_hash_lock_addr(hash));
}
-static void __build_flow_key(struct flowi4 *fl4, struct sock *sk,
+static void __build_flow_key(struct flowi4 *fl4, const struct sock *sk,
const struct iphdr *iph,
int oif, u8 tos,
u8 prot, u32 mark, int flow_flags)
@@ -1294,7 +1294,8 @@ static void __build_flow_key(struct flowi4 *fl4, struct sock *sk,
iph->daddr, iph->saddr, 0, 0);
}
-static void build_skb_flow_key(struct flowi4 *fl4, struct sk_buff *skb, struct sock *sk)
+static void build_skb_flow_key(struct flowi4 *fl4, const struct sk_buff *skb,
+ const struct sock *sk)
{
const struct iphdr *iph = ip_hdr(skb);
int oif = skb->dev->ifindex;
@@ -1305,10 +1306,10 @@ static void build_skb_flow_key(struct flowi4 *fl4, struct sk_buff *skb, struct s
__build_flow_key(fl4, sk, iph, oif, tos, prot, mark, 0);
}
-static void build_sk_flow_key(struct flowi4 *fl4, struct sock *sk)
+static void build_sk_flow_key(struct flowi4 *fl4, const struct sock *sk)
{
const struct inet_sock *inet = inet_sk(sk);
- struct ip_options_rcu *inet_opt;
+ const struct ip_options_rcu *inet_opt;
__be32 daddr = inet->inet_daddr;
rcu_read_lock();
@@ -1323,8 +1324,8 @@ static void build_sk_flow_key(struct flowi4 *fl4, struct sock *sk)
rcu_read_unlock();
}
-static void ip_rt_build_flow_key(struct flowi4 *fl4, struct sock *sk,
- struct sk_buff *skb)
+static void ip_rt_build_flow_key(struct flowi4 *fl4, const struct sock *sk,
+ const struct sk_buff *skb)
{
if (skb)
build_skb_flow_key(fl4, skb, sk);
^ permalink raw reply related
* Re: [net-next PATCH 01/02] net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel.
From: Joe Perches @ 2012-07-17 20:36 UTC (permalink / raw)
To: Saurabh; +Cc: netdev
In-Reply-To: <20120717194449.GA3350@debian-saurabh-64.vyatta.com>
On Tue, 2012-07-17 at 12:44 -0700, Saurabh wrote:
> Incorporated David and Steffen's comments.
> Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module.
[]
> diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c
[]
> +int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler)
> +{
> + struct xfrm_tunnel __rcu **pprev;
> + struct xfrm_tunnel *t;
> + int ret = -EEXIST;
> + int priority = handler->priority;
> +
> + mutex_lock(&xfrm4_mode_tunnel_input_mutex);
> +
> + for (pprev = &rcv_notify_handlers;
> + (t = rcu_dereference_protected(*pprev,
> + lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) != NULL;
> + pprev = &t->next) {
> + if (t->priority > priority)
> + break;
> + if (t->priority == priority)
> + goto err;
> +
> + }
> +
> + handler->next = *pprev;
> + rcu_assign_pointer(*pprev, handler);
> +
> + ret = 0;
> +
> +err:
> + mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(xfrm4_mode_tunnel_input_register);
Isn't the multiple indirection of **pprev unnecessary?
Perhaps something like this is simpler and easier to read?
int xfrm4_mode_tunnel_input_register(struct xfrm_tunnel *handler)
{
struct xfrm_tunnel __rcu *prev;
struct xfrm_tunnel *t;
int ret = -EEXIST;
int priority = handler->priority;
mutex_lock(&xfrm4_mode_tunnel_input_mutex);
prev = rcv_notify_handlers;
while ((t = rcu_dereference_protected(prev,
lockdep_is_held(&xfrm4_mode_tunnel_input_mutex))) {
if (t->priority > priority)
break;
if (t->priority == priority)
goto err;
prev = t->next;
}
handler->next = prev;
rcu_assign_pointer(prev, handler);
ret = 0;
err:
mutex_unlock(&xfrm4_mode_tunnel_input_mutex);
return ret;
}
^ permalink raw reply
* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
From: Julian Anastasov @ 2012-07-17 20:41 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20120717.061418.1893307699868826531.davem@davemloft.net>
Hello,
On Tue, 17 Jul 2012, David Miller wrote:
> These patches implement the final mechanism necessary to really allow
> us to go without the route cache in ipv4.
>
> We need a place to have long-term storage of PMTU/redirect information
> which is independent of the routes themselves, yet does not get us
> back into a situation where we have to write to metrics or anything
> like that.
>
> For this we use an "next-hop exception" table in the FIB nexthops.
>
> Currently it is a simple linked list and uses a single global lock
> for synchronization, but that can be easily adjusted as-needed.
>
> The one thing I desperately want to avoid is having to create clone
> routes in the FIB trie for this purpose, because that is very
> expensive. However, I'm willing to entertain such an idea later
> if this current scheme proves to have downsides that the FIB trie
> variant would not have.
IIRC, struct fib_info was shared by different
prefixes. It saves a lot of memory when thousands of
routes are created to same GW. Now if we end up with 1 or
2 fib_info structures for default routes, the nh_exceptions list
can become very long. May be fib_info is not a good place
to hide such data.
Regards
--
Julian Anastasov <ja@ssi.bg>
^ permalink raw reply
* Re: [PATCH] ipv4: Fix nexthop exception hash computation.
From: Eric Dumazet @ 2012-07-17 20:33 UTC (permalink / raw)
To: David Miller; +Cc: netdev
In-Reply-To: <20120717.132350.1202093690532763592.davem@davemloft.net>
On Tue, 2012-07-17 at 13:23 -0700, David Miller wrote:
> Need to mask it with (FNHE_HASH_SIZE - 1).
>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
OK, I have a small patch too, sending in a minute.
^ permalink raw reply
* [PATCH] ipv4: Fix nexthop exception hash computation.
From: David Miller @ 2012-07-17 20:23 UTC (permalink / raw)
To: netdev
Need to mask it with (FNHE_HASH_SIZE - 1).
Signed-off-by: David S. Miller <davem@davemloft.net>
---
net/ipv4/route.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a5bd0b4..812e444 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1347,6 +1347,16 @@ static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash, __be3
return oldest;
}
+static inline u32 fnhe_hashfun(__be32 daddr)
+{
+ u32 hval;
+
+ hval = (__force u32) daddr;
+ hval ^= (hval >> 11) ^ (hval >> 22);
+
+ return hval & (FNHE_HASH_SIZE - 1);
+}
+
static struct fib_nh_exception *find_or_create_fnhe(struct fib_nh *nh, __be32 daddr)
{
struct fnhe_hash_bucket *hash = nh->nh_exceptions;
@@ -1361,8 +1371,7 @@ static struct fib_nh_exception *find_or_create_fnhe(struct fib_nh *nh, __be32 da
return NULL;
}
- hval = (__force u32) daddr;
- hval ^= (hval >> 11) ^ (hval >> 22);
+ hval = fnhe_hashfun(daddr);
hash += hval;
depth = 0;
@@ -1890,8 +1899,7 @@ static void rt_bind_exception(struct rtable *rt, struct fib_nh *nh, __be32 daddr
struct fib_nh_exception *fnhe;
u32 hval;
- hval = (__force u32) daddr;
- hval ^= (hval >> 11) ^ (hval >> 22);
+ hval = fnhe_hashfun(daddr);
for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
fnhe = rcu_dereference(fnhe->fnhe_next)) {
--
1.7.10.4
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox