* Re: [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2
From: Mel Gorman @ 2009-10-27 13:27 UTC (permalink / raw)
To: Sven Geggus
Cc: Frans Pop, Jiri Kosina, Karol Lewandowski, Tobias Oetiker,
Rafael J. Wysocki, David Miller, Reinette Chatre, Kalle Valo,
David Rientjes, KOSAKI Motohiro, Mohamed Abbas, Jens Axboe,
John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
netdev, linux-kernel, linux-mm@kvack.org
In-Reply-To: <20091024140207.GA5102@geggus.net>
On Sat, Oct 24, 2009 at 04:02:09PM +0200, Sven Geggus wrote:
> Mel Gorman schrieb am Donnerstag, den 22. Oktober um 16:22 Uhr:
>
> > Test 1: Verify your problem occurs on 2.6.32-rc5 if you can
>
> Problem persists. RAID resync in progress :(
>
What about the rest of the patches, any luck?
Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply
* Re: phy_connect_direct problem
From: Kristoffer Glembo @ 2009-10-27 13:10 UTC (permalink / raw)
To: netdev
In-Reply-To: <4AE6CB6A.5050600@gaisler.com>
Hi again,
Please forget about this. I'm obviously blind as I can't see the
difference between adjust_state and adjust_link ...
Best regards,
Kristoffer Glembo
Kristoffer Glembo wrote:
> Hi,
>
> I'm in the process of converting an ethernet driver to using the PHY
> layer, but ran into a problem where my adjust_state handler was never
> called and the link did not go up ...
>
> I use phy_connect to connect the PHY device with my ethernet driver. In
> phy_connect_direct I found the following code:
>
> phy_prepare_link(phydev, handler);
> phy_start_machine(phydev, NULL);
>
> So first the adjust_state is set to handler by phy_prepare_link and then
> it is overwritten with the NULL parameter passed to phy_start_machine.
>
> Maybe I'm abusing the PHY layer somehow since there are plenty of other
> drivers using this interface. However the patch below solved the issue
> for me.
>
> Best regards,
> Kristoffer Glembo
>
>
> ---
> drivers/net/phy/phy_device.c | 3 +--
> 1 files changed, 1 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index b10fedd..d27ca80 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -311,8 +311,7 @@ int phy_connect_direct(struct net_device *dev,
> struct phy_device *phydev,
> if (rc)
> return rc;
>
> - phy_prepare_link(phydev, handler);
> - phy_start_machine(phydev, NULL);
> + phy_start_machine(phydev, handler);
> if (phydev->irq > 0)
> phy_start_interrupts(phydev);
>
^ permalink raw reply
* [net-next-2.6 PATCH v4 3/3] TCPCT part 1c: initial SYN exchange with SYNACK data
From: William Allen Simpson @ 2009-10-27 12:29 UTC (permalink / raw)
To: Linux Kernel Network Developers
In-Reply-To: <4AE6E35C.2050101@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2272 bytes --]
This is a significantly revised implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley). That patch was previously reviewed:
http://thread.gmane.org/gmane.linux.network/102586
The principle difference is using a TCP option to carry the cookie nonce,
instead of a user configured offset in the data. This is more flexible and
less subject to user configuration error. Such a cookie option has been
suggested for many years, and is also useful without SYN data, allowing
several related concepts to use the same extension option.
"Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html
"Re: what a new TCP header might look like", May 12, 1998.
ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail
Data structures are carefully composed to require minimal additions.
For example, the struct tcp_options_received cookie_plus variable fits
between existing 16-bit and 8-bit variables, requiring no additional
space (taking alignment into consideration). There are no additions to
tcp_request_sock, and only 1 pointer and 1 flag byte in tcp_sock.
Allocations have been rearranged to avoid requiring GFP_ATOMIC, with
only one unavoidable exception in tcp_create_openreq_child(), where the
tcp_sock itself is created GFP_ATOMIC.
These functions will also be used in subsequent patches that implement
additional features.
Requires:
TCPCT part 1a: add request_values parameter for sending SYNACK
TCPCT part 1b: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS, functions
Signed-off-by: William.Allen.Simpson@gmail.com
---
include/linux/tcp.h | 34 ++++++-
include/net/tcp.h | 67 +++++++++++++-
net/ipv4/syncookies.c | 5 +-
net/ipv4/tcp.c | 128 +++++++++++++++++++++++++-
net/ipv4/tcp_input.c | 84 +++++++++++++++--
net/ipv4/tcp_ipv4.c | 62 +++++++++++--
net/ipv4/tcp_minisocks.c | 43 +++++++--
net/ipv4/tcp_output.c | 227 +++++++++++++++++++++++++++++++++++++++++++---
net/ipv6/syncookies.c | 5 +-
net/ipv6/tcp_ipv6.c | 47 +++++++++-
10 files changed, 639 insertions(+), 63 deletions(-)
[-- Attachment #2: TCPCT+1c4.patch --]
[-- Type: text/plain, Size: 39173 bytes --]
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 6fd59d1..5da3ef4 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -257,31 +257,36 @@ struct tcp_options_received {
sack_ok : 4, /* SACK seen on SYN packet */
snd_wscale : 4, /* Window scaling received from sender */
rcv_wscale : 4; /* Window scaling to send to receiver */
-/* SACKs data */
+ u8 cookie_plus:6; /* bytes in authenticator/cookie option */
u8 num_sacks; /* Number of SACK blocks */
- u16 user_mss; /* mss requested by user in ioctl */
+ u16 user_mss; /* mss requested by user in ioctl */
u16 mss_clamp; /* Maximal mss, negotiated at connection setup */
};
static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
{
- rx_opt->tstamp_ok = rx_opt->sack_ok = rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
+ rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
+ rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
+ rx_opt->cookie_plus = 0;
}
/* This is the max number of SACKS that we'll generate and process. It's safe
- * to increse this, although since:
+ * to increase this, although since:
* size = TCPOLEN_SACK_BASE_ALIGNED (4) + n * TCPOLEN_SACK_PERBLOCK (8)
* only four options will fit in a standard TCP header */
#define TCP_NUM_SACKS 4
+struct tcp_cookie_values;
+struct tcp_request_sock_ops;
+
struct tcp_request_sock {
struct inet_request_sock req;
#ifdef CONFIG_TCP_MD5SIG
/* Only used by TCP MD5 Signature so far. */
const struct tcp_request_sock_ops *af_specific;
#endif
- u32 rcv_isn;
- u32 snt_isn;
+ u32 rcv_isn;
+ u32 snt_isn;
};
static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
@@ -451,6 +456,19 @@ struct tcp_sock {
/* TCP MD5 Signature Option information */
struct tcp_md5sig_info *md5sig_info;
#endif
+
+ /* When the cookie options are generated and exchanged, then this
+ * object holds a reference to them (cookie_values->kref). Also
+ * contains related tcp_cookie_transactions fields.
+ */
+ struct tcp_cookie_values *cookie_values;
+
+ u8 cookie_in_always:1,
+ cookie_out_never:1,
+ extend_timestamp:1,
+ s_data_constant:1,
+ s_data_in:1,
+ s_data_out:1;
};
static inline struct tcp_sock *tcp_sk(const struct sock *sk)
@@ -469,6 +487,10 @@ struct tcp_timewait_sock {
u16 tw_md5_keylen;
u8 tw_md5_key[TCP_MD5SIG_MAXKEYLEN];
#endif
+ /* Few sockets in timewait have cookies; in that case, then this
+ * object holds a reference to it (tw_cookie_values->kref)
+ */
+ struct tcp_cookie_values *tw_cookie_values;
};
static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 142f32e..51b7426 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -30,6 +30,7 @@
#include <linux/dmaengine.h>
#include <linux/crypto.h>
#include <linux/cryptohash.h>
+#include <linux/kref.h>
#include <net/inet_connection_sock.h>
#include <net/inet_timewait_sock.h>
@@ -167,6 +168,7 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
#define TCPOPT_SACK 5 /* SACK Block */
#define TCPOPT_TIMESTAMP 8 /* Better RTT estimations/PAWS */
#define TCPOPT_MD5SIG 19 /* MD5 Signature (RFC2385) */
+#define TCPOPT_COOKIE 253 /* Cookie extension (experimental) */
/*
* TCP option lengths
@@ -177,6 +179,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
#define TCPOLEN_SACK_PERM 2
#define TCPOLEN_TIMESTAMP 10
#define TCPOLEN_MD5SIG 18
+#define TCPOLEN_COOKIE_BASE 2 /* Cookie-less header extension */
+#define TCPOLEN_COOKIE_PAIR 3 /* Cookie pair header extension */
+#define TCPOLEN_COOKIE_MAX (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MAX)
+#define TCPOLEN_COOKIE_MIN (TCPOLEN_COOKIE_BASE+TCP_COOKIE_MIN)
/* But this is what stacks really send out. */
#define TCPOLEN_TSTAMP_ALIGNED 12
@@ -405,7 +411,7 @@ extern int tcp_recvmsg(struct kiocb *iocb, struct sock *sk,
extern void tcp_parse_options(struct sk_buff *skb,
struct tcp_options_received *opt_rx,
- int estab);
+ u8 **cryptic, int estab);
extern u8 *tcp_parse_md5sig_option(struct tcphdr *th);
@@ -1477,6 +1483,65 @@ struct tcp_request_sock_ops {
#endif
};
+/**
+ * A tcp_sock contains a pointer to the current value, and this is cloned to
+ * the tcp_timewait_sock.
+ *
+ * @cookie_pair: variable data from the option exchange.
+ *
+ * @cookie_desired: user specified tcpct_cookie_desired. Zero
+ * indicates default (sysctl_tcp_cookie_size).
+ * After cookie sent, remembers size of cookie.
+ *
+ * @s_data_desired: user specified tcpct_s_data_desired. When the
+ * constant payload is specified (s_data_constant),
+ * holds its length instead.
+ *
+ * @s_data_payload: constant data that is to be included in the
+ * payload of SYN or SYNACK segments when the
+ * cookie option is present.
+ */
+struct tcp_cookie_values {
+ struct kref kref;
+ u8 cookie_pair[TCP_COOKIE_PAIR_SIZE];
+ u8 cookie_pair_size;
+ u8 cookie_desired;
+ u16 s_data_desired;
+ u8 s_data_payload[0];
+};
+
+static inline void tcp_cookie_values_release(struct kref *kref)
+{
+ kfree(container_of(kref, struct tcp_cookie_values, kref));
+}
+
+/* The length of constant payload data. Note that s_data_desired is
+ * overloaded, depending on s_data_constant: either the length of constant
+ * data (returned here) or the limit on variable data.
+ */
+static inline int tcp_s_data_size(const struct tcp_sock *tp)
+{
+ return (NULL != tp->cookie_values && tp->s_data_constant)
+ ? tp->cookie_values->s_data_desired
+ : 0;
+}
+
+/* As tcp_request_sock has already been extended in other places, the
+ * only remaining method is to pass stack values along as function
+ * parameters. These parameters are not needed after sending SYNACK.
+ */
+struct tcp_extend_values {
+ u8 cookie_bakery[TCP_COOKIE_MAX];
+ u8 cookie_plus;
+ u8 cookie_in_always:1,
+ cookie_out_never:1;
+};
+
+static inline struct tcp_extend_values *tcp_xv(const struct request_values *rvp)
+{
+ return (struct tcp_extend_values *)rvp;
+}
+
extern void tcp_v4_init(void);
extern void tcp_init(void);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 5ec678a..cdab491 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -253,6 +253,8 @@ EXPORT_SYMBOL(cookie_check_timestamp);
struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
struct ip_options *opt)
{
+ struct tcp_options_received tcp_opt;
+ u8 *cryptic_value;
struct inet_request_sock *ireq;
struct tcp_request_sock *treq;
struct tcp_sock *tp = tcp_sk(sk);
@@ -263,7 +265,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
int mss;
struct rtable *rt;
__u8 rcv_wscale;
- struct tcp_options_received tcp_opt;
if (!sysctl_tcp_syncookies || !th->ack)
goto out;
@@ -278,7 +279,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
/* check for timestamp cookie support */
memset(&tcp_opt, 0, sizeof(tcp_opt));
- tcp_parse_options(skb, &tcp_opt, 0);
+ tcp_parse_options(skb, &tcp_opt, &cryptic_value, 0);
if (tcp_opt.saw_tstamp)
cookie_check_timestamp(&tcp_opt);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 206a291..12409df 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2039,8 +2039,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
int val;
int err = 0;
- /* This is a string value all the others are int's */
- if (optname == TCP_CONGESTION) {
+ /* These are data/string values, all the others are ints */
+ if (TCP_CONGESTION == optname) {
char name[TCP_CA_NAME_MAX];
if (optlen < 1)
@@ -2056,6 +2056,92 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
err = tcp_set_congestion_control(sk, name);
release_sock(sk);
return err;
+ } else if (TCP_COOKIE_TRANSACTIONS == optname) {
+ struct tcp_cookie_transactions ctd;
+ struct tcp_cookie_values *cvp = NULL;
+
+ if (sizeof(ctd) > optlen)
+ return -EINVAL;
+ if (copy_from_user(&ctd, optval, sizeof(ctd)))
+ return -EFAULT;
+ if (sizeof(ctd.tcpct_value) < ctd.tcpct_used)
+ return -EINVAL;
+
+ if (0 == ctd.tcpct_cookie_desired) {
+ /* default to global value */
+ } else if ((0x1 & ctd.tcpct_cookie_desired)
+ || TCP_COOKIE_MAX < ctd.tcpct_cookie_desired
+ || TCP_COOKIE_MIN > ctd.tcpct_cookie_desired) {
+ return -EINVAL;
+ }
+
+ if (TCP_COOKIE_OUT_NEVER & ctd.tcpct_flags) {
+ /* Supercedes all other values */
+ lock_sock(sk);
+ if (NULL != tp->cookie_values) {
+ kref_put(&tp->cookie_values->kref,
+ tcp_cookie_values_release);
+ tp->cookie_values = NULL;
+ }
+ tp->cookie_in_always = 0; /* false */
+ tp->cookie_out_never = 1; /* true */
+ tp->extend_timestamp = 0; /* false */
+ tp->s_data_constant = 0; /* false */
+ tp->s_data_in = 0; /* false */
+ tp->s_data_out = 0; /* false */
+ release_sock(sk);
+ return err;
+ }
+
+ /* Allocate ancillary memory before locking.
+ */
+ if (0 < ctd.tcpct_used
+ || (NULL == tp->cookie_values
+ && (0 < sysctl_tcp_cookie_size
+ || 0 < ctd.tcpct_cookie_desired
+ || 0 < ctd.tcpct_s_data_desired))) {
+ cvp = kmalloc(sizeof(*cvp) + ctd.tcpct_used,
+ GFP_KERNEL);
+ if (NULL == cvp)
+ return -ENOMEM;
+ }
+
+ lock_sock(sk);
+ tp->cookie_in_always = (TCP_COOKIE_IN_ALWAYS & ctd.tcpct_flags);
+ tp->cookie_out_never = 0; /* false */
+ tp->extend_timestamp = (TCP_EXTEND_TIMESTAMP & ctd.tcpct_flags);
+ tp->s_data_in = 0; /* false */
+ tp->s_data_out = 0; /* false */
+
+ if (NULL == cvp) {
+ /* No cookies by default. */
+ tp->s_data_constant = 0; /* false */
+ } else if (0 == ctd.tcpct_used) {
+ /* No constant payload data. */
+ cvp->cookie_desired = ctd.tcpct_cookie_desired;
+ cvp->s_data_desired = ctd.tcpct_s_data_desired;
+ tp->cookie_values = cvp;
+ tp->s_data_constant = 0; /* false */
+ } else {
+ /* Changes in values are recorded by a change in
+ * pointer, ensuring that the cookie will differ,
+ * without separately hashing each value later.
+ */
+ if (unlikely(NULL != tp->cookie_values)) {
+ kref_put(&tp->cookie_values->kref,
+ tcp_cookie_values_release);
+ }
+ kref_init(&cvp->kref);
+ memcpy(cvp->s_data_payload, ctd.tcpct_value,
+ ctd.tcpct_used);
+ cvp->cookie_desired = ctd.tcpct_cookie_desired;
+ cvp->s_data_desired = ctd.tcpct_used;
+ tp->cookie_values = cvp;
+ tp->s_data_constant = 1; /* true */
+ }
+
+ release_sock(sk);
+ return err;
}
if (optlen < sizeof(int))
@@ -2387,6 +2473,44 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
return -EFAULT;
return 0;
+
+ case TCP_COOKIE_TRANSACTIONS: {
+ struct tcp_cookie_transactions ctd;
+ struct tcp_cookie_values *cvp = tp->cookie_values;
+
+ if (get_user(len, optlen))
+ return -EFAULT;
+ if (len < sizeof(ctd))
+ return -EINVAL;
+
+ memset(&ctd, 0, sizeof(ctd));
+ ctd.tcpct_flags =
+ (tp->cookie_in_always ? TCP_COOKIE_IN_ALWAYS : 0)
+ | (tp->cookie_out_never ? TCP_COOKIE_OUT_NEVER : 0)
+ | (tp->extend_timestamp ? TCP_EXTEND_TIMESTAMP : 0)
+ | (tp->s_data_in ? TCP_S_DATA_IN : 0)
+ | (tp->s_data_out ? TCP_S_DATA_OUT : 0);
+
+ if (NULL != cvp) {
+ /* Cookie(s) saved, return as nonce */
+ if (sizeof(ctd.tcpct_value) < cvp->cookie_pair_size) {
+ /* impossible? */
+ return -EINVAL;
+ }
+ memcpy(&ctd.tcpct_value[0], &cvp->cookie_pair[0],
+ cvp->cookie_pair_size);
+ ctd.tcpct_used = cvp->cookie_pair_size;
+
+ ctd.tcpct_cookie_desired = cvp->cookie_desired;
+ ctd.tcpct_s_data_desired = cvp->s_data_desired;
+ }
+
+ if (put_user(sizeof(ctd), optlen))
+ return -EFAULT;
+ if (copy_to_user(optval, &ctd, sizeof(ctd)))
+ return -EFAULT;
+ return 0;
+ }
default:
return -ENOPROTOOPT;
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d86784b..b2a2da1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3698,11 +3698,11 @@ old_ack:
* the fast version below fails.
*/
void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
- int estab)
+ u8 **cryptic, int estab)
{
unsigned char *ptr;
struct tcphdr *th = tcp_hdr(skb);
- int length = (th->doff * 4) - sizeof(struct tcphdr);
+ int length = tcp_option_len_th(th);
ptr = (unsigned char *)(th + 1);
opt_rx->saw_tstamp = 0;
@@ -3782,6 +3782,19 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
*/
break;
#endif
+ case TCPOPT_COOKIE:
+ /* This option carries 3 different lengths.
+ */
+ if (TCPOLEN_COOKIE_MAX >= opsize
+ && TCPOLEN_COOKIE_MIN <= opsize) {
+ opt_rx->cookie_plus = opsize;
+ *cryptic = ptr;
+ } else if (TCPOLEN_COOKIE_PAIR == opsize) {
+ /* not yet implemented */
+ } else if (TCPOLEN_COOKIE_BASE == opsize) {
+ /* not yet implemented */
+ }
+ break;
}
ptr += opsize-2;
@@ -3810,17 +3823,21 @@ static int tcp_parse_aligned_timestamp(struct tcp_sock *tp, struct tcphdr *th)
* If it is wrong it falls back on tcp_parse_options().
*/
static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
- struct tcp_sock *tp)
+ struct tcp_sock *tp, u8 **cryptic)
{
- if (th->doff == sizeof(struct tcphdr) >> 2) {
+ /* In the spirit of fast parsing, compare doff directly to shifted
+ * constant values. Because equality is used, short doff can be
+ * ignored here, and checked later.
+ */
+ if ((sizeof(*th) >> 2) == th->doff) {
tp->rx_opt.saw_tstamp = 0;
return 0;
} else if (tp->rx_opt.tstamp_ok &&
- th->doff == (sizeof(struct tcphdr)>>2)+(TCPOLEN_TSTAMP_ALIGNED>>2)) {
+ ((sizeof(*th)+TCPOLEN_TSTAMP_ALIGNED)>>2) == th->doff) {
if (tcp_parse_aligned_timestamp(tp, th))
return 1;
}
- tcp_parse_options(skb, &tp->rx_opt, 1);
+ tcp_parse_options(skb, &tp->rx_opt, cryptic, 1);
return 1;
}
@@ -3830,7 +3847,7 @@ static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
*/
u8 *tcp_parse_md5sig_option(struct tcphdr *th)
{
- int length = (th->doff << 2) - sizeof (*th);
+ int length = tcp_option_len_th(th);
u8 *ptr = (u8*)(th + 1);
/* If the TCP option is too short, we can short cut */
@@ -5070,10 +5087,11 @@ out:
static int tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
struct tcphdr *th, int syn_inerr)
{
+ u8 *cv;
struct tcp_sock *tp = tcp_sk(sk);
/* RFC1323: H1. Apply PAWS check first. */
- if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp &&
+ if (tcp_fast_parse_options(skb, th, tp, &cv) && tp->rx_opt.saw_tstamp &&
tcp_paws_discard(sk, skb)) {
if (!th->rst) {
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
@@ -5361,11 +5379,14 @@ discard:
static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
struct tcphdr *th, unsigned len)
{
- struct tcp_sock *tp = tcp_sk(sk);
+ u8 *cryptic_value;
struct inet_connection_sock *icsk = inet_csk(sk);
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp_cookie_values *cvp = tp->cookie_values;
int saved_clamp = tp->rx_opt.mss_clamp;
+ int queued = 0;
- tcp_parse_options(skb, &tp->rx_opt, 0);
+ tcp_parse_options(skb, &tp->rx_opt, &cryptic_value, 0);
if (th->ack) {
/* rfc793:
@@ -5462,6 +5483,44 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
* Change state from SYN-SENT only after copied_seq
* is initialized. */
tp->copied_seq = tp->rcv_nxt;
+
+ if (NULL != cvp
+ && 0 < cvp->cookie_pair_size
+ && 0 < tp->rx_opt.cookie_plus) {
+ int cookie_size = tp->rx_opt.cookie_plus
+ - TCPOLEN_COOKIE_BASE;
+ int cookie_pair_size = cookie_size
+ + cvp->cookie_desired;
+
+ /* A cookie extension option was sent and returned.
+ * Note that each incoming SYNACK replaces the
+ * Responder cookie. The initial exchange is most
+ * fragile, as protection against spoofing relies
+ * entirely upon the sequence and timestamp (above).
+ * This replacement strategy allows the correct pair to
+ * pass through, while any others will be filtered via
+ * Responder verification later.
+ */
+ if (sizeof(cvp->cookie_pair) >= cookie_pair_size) {
+ memcpy(&cvp->cookie_pair[cvp->cookie_desired],
+ cryptic_value, cookie_size);
+ cvp->cookie_pair_size = cookie_pair_size;
+ }
+
+ if (tcp_header_len_th(th) < skb->len) {
+ /* Queue incoming transaction data. */
+ __skb_pull(skb, tcp_header_len_th(th));
+ __skb_queue_tail(&sk->sk_receive_queue, skb);
+ skb_set_owner_r(skb, sk);
+ sk->sk_data_ready(sk, 0);
+ tp->s_data_in = 1; /* true */
+ queued = 1; /* should be amount? */
+ tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+ tp->rcv_wup = TCP_SKB_CB(skb)->end_seq;
+ tp->copied_seq = TCP_SKB_CB(skb)->seq + 1;
+ }
+ }
+
smp_mb();
tcp_set_state(sk, TCP_ESTABLISHED);
@@ -5513,11 +5572,14 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
TCP_DELACK_MAX, TCP_RTO_MAX);
discard:
- __kfree_skb(skb);
+ if (0 == queued)
+ __kfree_skb(skb);
return 0;
} else {
tcp_send_ack(sk);
}
+ if (0 < queued)
+ return 0; /* amount queued? */
return -1;
}
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6edc5e2..1569be9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -217,7 +217,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
if (inet->opt)
inet_csk(sk)->icsk_ext_hdr_len = inet->opt->optlen;
- tp->rx_opt.mss_clamp = 536;
+ tp->rx_opt.mss_clamp = TCP_MIN_RCVMSS;
/* Socket identity is still unknown (sport may be zero).
* However we set state to SYN-SENT and not releasing socket
@@ -1213,9 +1213,12 @@ static struct timewait_sock_ops tcp_timewait_sock_ops = {
int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
{
- struct inet_request_sock *ireq;
+ struct tcp_extend_values tmp_ext;
struct tcp_options_received tmp_opt;
+ u8 *cryptic_value;
+ struct inet_request_sock *ireq;
struct request_sock *req;
+ struct tcp_sock *tp = tcp_sk(sk);
__be32 saddr = ip_hdr(skb)->saddr;
__be32 daddr = ip_hdr(skb)->daddr;
__u32 isn = TCP_SKB_CB(skb)->when;
@@ -1260,16 +1263,37 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
#endif
tcp_clear_options(&tmp_opt);
- tmp_opt.mss_clamp = 536;
- tmp_opt.user_mss = tcp_sk(sk)->rx_opt.user_mss;
+ tmp_opt.mss_clamp = TCP_MIN_RCVMSS;
+ tmp_opt.user_mss = tp->rx_opt.user_mss;
- tcp_parse_options(skb, &tmp_opt, 0);
+ tcp_parse_options(skb, &tmp_opt, &cryptic_value, 0);
+
+ if (0 < tmp_opt.cookie_plus
+ && tmp_opt.saw_tstamp
+ && !tp->cookie_out_never
+ && (0 < sysctl_tcp_cookie_size
+ || (NULL != tp->cookie_values
+ && 0 < tp->cookie_values->cookie_desired))) {
+#ifdef CONFIG_SYN_COOKIES
+ want_cookie = 0; /* not our kind of cookie */
+#endif
+ tmp_ext.cookie_out_never = 0; /* false */
+ tmp_ext.cookie_plus = tmp_opt.cookie_plus;
+
+ /* secret recipe not yet implemented */
+ } else if (!tp->cookie_in_always) {
+ /* redundant indications, but ensure initialization. */
+ tmp_ext.cookie_out_never = 1; /* true */
+ tmp_ext.cookie_plus = 0;
+ } else {
+ goto drop_and_free;
+ }
+ tmp_ext.cookie_in_always = tp->cookie_in_always;
if (want_cookie && !tmp_opt.saw_tstamp)
tcp_clear_options(&tmp_opt);
tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
-
tcp_openreq_init(req, &tmp_opt, skb);
ireq = inet_rsk(req);
@@ -1336,7 +1360,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
}
tcp_rsk(req)->snt_isn = isn;
- if (__tcp_v4_send_synack(sk, req, NULL, dst) ||
+ if (__tcp_v4_send_synack(sk, req, (struct request_values *)&tmp_ext, dst) ||
want_cookie)
goto drop_and_free;
@@ -1815,7 +1839,7 @@ static int tcp_v4_init_sock(struct sock *sk)
*/
tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
tp->snd_cwnd_clamp = ~0;
- tp->mss_cache = 536;
+ tp->mss_cache = TCP_MIN_RCVMSS;
tp->reordering = sysctl_tcp_reordering;
icsk->icsk_ca_ops = &tcp_init_congestion_ops;
@@ -1831,6 +1855,19 @@ static int tcp_v4_init_sock(struct sock *sk)
tp->af_specific = &tcp_sock_ipv4_specific;
#endif
+ /* TCP Cookie Transactions */
+ if (0 < sysctl_tcp_cookie_size) {
+ /* Default, cookies without s_data. */
+ tp->cookie_values =
+ kzalloc(sizeof(*tp->cookie_values),
+ sk->sk_allocation);
+ if (NULL != tp->cookie_values)
+ kref_init(&tp->cookie_values->kref);
+ }
+ /* Presumed zeroed, in order of appearance:
+ * cookie_in_always, cookie_out_never, extend_timestamp,
+ * s_data_constant, s_data_in, s_data_out
+ */
sk->sk_sndbuf = sysctl_tcp_wmem[1];
sk->sk_rcvbuf = sysctl_tcp_rmem[1];
@@ -1884,6 +1921,15 @@ void tcp_v4_destroy_sock(struct sock *sk)
sk->sk_sndmsg_page = NULL;
}
+ /*
+ * If cookie or s_data exists, remove it.
+ */
+ if (NULL != tp->cookie_values) {
+ kref_put(&tp->cookie_values->kref,
+ tcp_cookie_values_release);
+ tp->cookie_values = NULL;
+ }
+
percpu_counter_dec(&tcp_sockets_allocated);
}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 8819882..785e5f4 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -96,13 +96,14 @@ enum tcp_tw_status
tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
const struct tcphdr *th)
{
- struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
struct tcp_options_received tmp_opt;
+ u8 *cryptic_value;
+ struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
int paws_reject = 0;
tmp_opt.saw_tstamp = 0;
if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
- tcp_parse_options(skb, &tmp_opt, 0);
+ tcp_parse_options(skb, &tmp_opt, &cryptic_value, 0);
if (tmp_opt.saw_tstamp) {
tmp_opt.ts_recent = tcptw->tw_ts_recent;
@@ -394,9 +395,12 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
/* Now setup tcp_sock */
newtp = tcp_sk(newsk);
newtp->pred_flags = 0;
- newtp->rcv_wup = newtp->copied_seq = newtp->rcv_nxt = treq->rcv_isn + 1;
- newtp->snd_sml = newtp->snd_una = newtp->snd_nxt = treq->snt_isn + 1;
- newtp->snd_up = treq->snt_isn + 1;
+
+ newtp->rcv_wup = newtp->copied_seq =
+ newtp->rcv_nxt = treq->rcv_isn + 1;
+
+ newtp->snd_sml = newtp->snd_una = newtp->snd_nxt =
+ newtp->snd_up = treq->snt_isn + 1 + tcp_s_data_size(tcp_sk(sk));
tcp_prequeue_init(newtp);
@@ -429,9 +433,24 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
tcp_set_ca_state(newsk, TCP_CA_Open);
tcp_init_xmit_timers(newsk);
skb_queue_head_init(&newtp->out_of_order_queue);
- newtp->write_seq = treq->snt_isn + 1;
- newtp->pushed_seq = newtp->write_seq;
+ newtp->write_seq = newtp->pushed_seq =
+ treq->snt_isn + 1 + tcp_s_data_size(tcp_sk(sk));
+ /* TCP Cookie Transactions */
+ if (NULL != tcp_sk(sk)->cookie_values) {
+ /* Instead of reusing the original, replace with
+ * default, cookies without s_data.
+ */
+ newtp->cookie_values =
+ kzalloc(sizeof(*newtp->cookie_values),
+ GFP_ATOMIC);
+ if (NULL != newtp->cookie_values)
+ kref_init(&newtp->cookie_values->kref);
+ }
+ /* Presumed copied, in order of appearance:
+ * cookie_in_always, cookie_out_never, extend_timestamp,
+ * s_data_constant, s_data_in, s_data_out
+ */
newtp->rx_opt.saw_tstamp = 0;
newtp->rx_opt.dsack = 0;
@@ -495,15 +514,16 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
struct request_sock **prev)
{
+ struct tcp_options_received tmp_opt;
+ u8 *cryptic_value;
const struct tcphdr *th = tcp_hdr(skb);
__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
int paws_reject = 0;
- struct tcp_options_received tmp_opt;
struct sock *child;
tmp_opt.saw_tstamp = 0;
- if (th->doff > (sizeof(struct tcphdr)>>2)) {
- tcp_parse_options(skb, &tmp_opt, 0);
+ if (th->doff > (sizeof(*th) >> 2)) {
+ tcp_parse_options(skb, &tmp_opt, &cryptic_value, 0);
if (tmp_opt.saw_tstamp) {
tmp_opt.ts_recent = req->ts_recent;
@@ -596,7 +616,8 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
* Invalid ACK: reset will be sent by listening socket
*/
if ((flg & TCP_FLAG_ACK) &&
- (TCP_SKB_CB(skb)->ack_seq != tcp_rsk(req)->snt_isn + 1))
+ (TCP_SKB_CB(skb)->ack_seq != tcp_rsk(req)->snt_isn + 1 +
+ tcp_s_data_size(tcp_sk(sk))))
return sk;
/* Also, it would be not so bad idea to check rcv_tsecr, which
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index bcfbe41..9a901eb 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -370,6 +370,7 @@ static inline int tcp_urg_mode(const struct tcp_sock *tp)
#define OPTION_TS (1 << 1)
#define OPTION_MD5 (1 << 2)
#define OPTION_WSCALE (1 << 3)
+#define OPTION_COOKIE_EXTENSION (1 << 4)
struct tcp_out_options {
u8 options; /* bit field of OPTION_* */
@@ -377,8 +378,37 @@ struct tcp_out_options {
u8 num_sack_blocks; /* number of SACK blocks to include */
u16 mss; /* 0 to disable */
__u32 tsval, tsecr; /* need to include OPTION_TS */
+ u8 *cookie_copy; /* temporary pointer */
+ u8 cookie_size; /* bytes in copy */
};
+/* The sysctl int routines are generic, so check consistency here.
+ */
+static u8 tcp_cookie_size_check(u8 desired)
+{
+ if (0 < desired) {
+ /* previously specified */
+ return desired;
+ }
+ if (0 >= sysctl_tcp_cookie_size) {
+ /* no default specified */
+ return 0;
+ }
+ if (TCP_COOKIE_MIN > sysctl_tcp_cookie_size) {
+ /* value too small, increase to minimum */
+ return TCP_COOKIE_MIN;
+ }
+ if (TCP_COOKIE_MAX < sysctl_tcp_cookie_size) {
+ /* value too large, decrease to maximum */
+ return TCP_COOKIE_MAX;
+ }
+ if (0x1 & sysctl_tcp_cookie_size) {
+ /* 8-bit multiple, illegal, fix it */
+ return (u8)(sysctl_tcp_cookie_size + 0x1);
+ }
+ return (u8)sysctl_tcp_cookie_size;
+}
+
/* Write previously computed TCP options to the packet.
*
* Beware: Something in the Internet is very sensitive to the ordering of
@@ -395,11 +425,22 @@ struct tcp_out_options {
static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
const struct tcp_out_options *opts,
__u8 **md5_hash) {
- if (unlikely(OPTION_MD5 & opts->options)) {
- *ptr++ = htonl((TCPOPT_NOP << 24) |
- (TCPOPT_NOP << 16) |
- (TCPOPT_MD5SIG << 8) |
- TCPOLEN_MD5SIG);
+ u8 options = opts->options; /* mungable copy */
+
+ if (unlikely(OPTION_MD5 & options)) {
+ if (unlikely(OPTION_COOKIE_EXTENSION & options)) {
+ *ptr++ = htonl((TCPOPT_COOKIE << 24) |
+ (TCPOLEN_COOKIE_BASE << 16) |
+ (TCPOPT_MD5SIG << 8) |
+ TCPOLEN_MD5SIG);
+ } else {
+ *ptr++ = htonl((TCPOPT_NOP << 24) |
+ (TCPOPT_NOP << 16) |
+ (TCPOPT_MD5SIG << 8) |
+ TCPOLEN_MD5SIG);
+ }
+ /* larger cookies are incompatible */
+ options &= ~OPTION_COOKIE_EXTENSION;
*md5_hash = (__u8 *)ptr;
ptr += 4;
} else {
@@ -412,12 +453,13 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
opts->mss);
}
- if (likely(OPTION_TS & opts->options)) {
- if (unlikely(OPTION_SACK_ADVERTISE & opts->options)) {
+ if (likely(OPTION_TS & options)) {
+ if (unlikely(OPTION_SACK_ADVERTISE & options)) {
*ptr++ = htonl((TCPOPT_SACK_PERM << 24) |
(TCPOLEN_SACK_PERM << 16) |
(TCPOPT_TIMESTAMP << 8) |
TCPOLEN_TIMESTAMP);
+ options &= ~OPTION_SACK_ADVERTISE;
} else {
*ptr++ = htonl((TCPOPT_NOP << 24) |
(TCPOPT_NOP << 16) |
@@ -428,15 +470,48 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
*ptr++ = htonl(opts->tsecr);
}
- if (unlikely(OPTION_SACK_ADVERTISE & opts->options &&
- !(OPTION_TS & opts->options))) {
+ /* Specification requires after timestamp, so do it now.
+ */
+ if (unlikely(OPTION_COOKIE_EXTENSION & options)) {
+ u8 *cookie_copy = opts->cookie_copy;
+ u8 cookie_size = opts->cookie_size;
+
+ if (unlikely(0x1 & cookie_size)) {
+ /* 8-bit multiple, illegal, ignore */
+ cookie_size = 0;
+ } else if (likely(0x2 & cookie_size)) {
+ __u8 *p = (__u8 *)ptr;
+
+ /* 16-bit multiple */
+ *p++ = TCPOPT_COOKIE;
+ *p++ = TCPOLEN_COOKIE_BASE + cookie_size;
+ *p++ = *cookie_copy++;
+ *p++ = *cookie_copy++;
+ ptr++;
+ cookie_size -= 2;
+ } else {
+ /* 32-bit multiple */
+ *ptr++ = htonl(((TCPOPT_NOP << 24) |
+ (TCPOPT_NOP << 16) |
+ (TCPOPT_COOKIE << 8) |
+ TCPOLEN_COOKIE_BASE) +
+ cookie_size);
+ }
+
+ if (0 < cookie_size) {
+ memcpy(ptr, cookie_copy, cookie_size);
+ ptr += (cookie_size >> 2);
+ }
+ }
+
+ if (unlikely(OPTION_SACK_ADVERTISE & options)) {
*ptr++ = htonl((TCPOPT_NOP << 24) |
(TCPOPT_NOP << 16) |
(TCPOPT_SACK_PERM << 8) |
TCPOLEN_SACK_PERM);
}
- if (unlikely(OPTION_WSCALE & opts->options)) {
+ if (unlikely(OPTION_WSCALE & options)) {
*ptr++ = htonl((TCPOPT_NOP << 24) |
(TCPOPT_WINDOW << 16) |
(TCPOLEN_WINDOW << 8) |
@@ -471,11 +546,19 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
struct tcp_out_options *opts,
struct tcp_md5sig_key **md5) {
struct tcp_sock *tp = tcp_sk(sk);
+ struct tcp_cookie_values *cvp = tp->cookie_values;
unsigned size = 0;
+ u8 cookie_size = (!tp->cookie_out_never && NULL != cvp)
+ ? tcp_cookie_size_check(cvp->cookie_desired)
+ : 0;
#ifdef CONFIG_TCP_MD5SIG
*md5 = tp->af_specific->md5_lookup(sk, sk);
if (*md5) {
+ if (0 < cookie_size) {
+ /* cookie-less extension */
+ opts->options |= OPTION_COOKIE_EXTENSION;
+ }
opts->options |= OPTION_MD5;
size += TCPOLEN_MD5SIG_ALIGNED;
}
@@ -512,6 +595,63 @@ static unsigned tcp_syn_options(struct sock *sk, struct sk_buff *skb,
size += TCPOLEN_SACKPERM_ALIGNED;
}
+ /* Having both authentication and cookies for security is redundant,
+ * and there's certainly not enough room. Instead, the cookie-less
+ * variant is proposed above.
+ *
+ * Consider the pessimal case with authentication. The options
+ * could look like:
+ * COOKIE|MD5(20) + MSS(4) + WSCALE(4) + SACK|TS(12) == 40
+ *
+ * (Currently, the timestamps && *MD5 test above prevents this.)
+ *
+ * Note that timestamps are required by the specification.
+ *
+ * Odd numbers of bytes are prohibited by the specification, ensuring
+ * that the cookie is 16-bit aligned, and the resulting cookie pair is
+ * 32-bit aligned.
+ */
+ if (NULL == *md5
+ && (OPTION_TS & opts->options)
+ && 0 < cookie_size) {
+ int need = TCPOLEN_COOKIE_BASE + cookie_size;
+ int remaining = MAX_TCP_OPTION_SPACE - size;
+
+ if (0x2 & need) {
+ /* 32-bit multiple */
+ need += 2; /* NOPs */
+
+ if (need > remaining) {
+ /* try shrinking cookie to fit */
+ cookie_size -= 2;
+ need -= 4;
+ }
+ }
+ while (need > remaining && TCP_COOKIE_MIN <= cookie_size) {
+ cookie_size -= 4;
+ need -= 4;
+ }
+ if (TCP_COOKIE_MIN <= cookie_size) {
+ opts->options |= OPTION_COOKIE_EXTENSION;
+ opts->cookie_copy = &cvp->cookie_pair[0];
+ opts->cookie_size = cookie_size;
+
+ /* Remember for future incarnations. */
+ cvp->cookie_desired = cookie_size;
+
+ if (cvp->cookie_desired != cvp->cookie_pair_size) {
+ /* Currently use random bytes as a nonce,
+ * assuming these are completely unpredictable
+ * by hostile users of the same system.
+ */
+ get_random_bytes(opts->cookie_copy,
+ cookie_size);
+ cvp->cookie_pair_size = cookie_size;
+ }
+
+ size += need;
+ }
+ }
return size;
}
@@ -520,14 +660,23 @@ static unsigned tcp_synack_options(struct sock *sk,
struct request_sock *req,
unsigned mss, struct sk_buff *skb,
struct tcp_out_options *opts,
- struct tcp_md5sig_key **md5) {
- unsigned size = 0;
+ struct tcp_md5sig_key **md5,
+ struct tcp_extend_values *xvp)
+{
struct inet_request_sock *ireq = inet_rsk(req);
+ unsigned size = 0;
+ u8 cookie_plus = (NULL != xvp && !xvp->cookie_out_never)
+ ? xvp->cookie_plus
+ : 0;
char doing_ts;
#ifdef CONFIG_TCP_MD5SIG
*md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req);
if (*md5) {
+ if (0 < cookie_plus) {
+ /* cookie-less extension */
+ opts->options |= OPTION_COOKIE_EXTENSION;
+ }
opts->options |= OPTION_MD5;
size += TCPOLEN_MD5SIG_ALIGNED;
}
@@ -561,6 +710,34 @@ static unsigned tcp_synack_options(struct sock *sk,
size += TCPOLEN_SACKPERM_ALIGNED;
}
+ /* Similar rationale to tcp_syn_options() applies here, too.
+ * If the <SYN> options fit, the same options should fit now!
+ */
+ if (NULL == *md5
+ && doing_ts
+ && 0 < cookie_plus) {
+ int need = cookie_plus; /* has TCPOLEN_COOKIE_BASE */
+ int remaining = MAX_TCP_OPTION_SPACE - size;
+
+ if (0x2 & need) {
+ /* 32-bit multiple */
+ need += 2; /* NOPs */
+ }
+ if (need <= remaining) {
+ opts->options |= OPTION_COOKIE_EXTENSION;
+ opts->cookie_copy = &xvp->cookie_bakery[0];
+ opts->cookie_size = cookie_plus - TCPOLEN_COOKIE_BASE;
+
+ /* secret recipe not yet implemented */
+ get_random_bytes(opts->cookie_copy,
+ opts->cookie_size);
+
+ size += need;
+ } else {
+ /* There's no error return, so flag it. */
+ xvp->cookie_out_never = 1; /* true */
+ }
+ }
return size;
}
@@ -2230,14 +2407,15 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
struct request_values *rvp,
struct request_sock *req)
{
+ struct tcp_out_options opts;
+ struct tcp_extend_values *xvp = tcp_xv(rvp);
struct inet_request_sock *ireq = inet_rsk(req);
struct tcp_sock *tp = tcp_sk(sk);
struct tcphdr *th;
- int tcp_header_size;
- struct tcp_out_options opts;
struct sk_buff *skb;
struct tcp_md5sig_key *md5;
__u8 *md5_hash_location;
+ int tcp_header_size;
int mss;
skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
@@ -2275,7 +2453,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
#endif
TCP_SKB_CB(skb)->when = tcp_time_stamp;
tcp_header_size = tcp_synack_options(sk, req, mss,
- skb, &opts, &md5) +
+ skb, &opts, &md5, xvp) +
sizeof(struct tcphdr);
skb_push(skb, tcp_header_size);
@@ -2293,6 +2471,25 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
*/
tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn,
TCPCB_FLAG_SYN | TCPCB_FLAG_ACK);
+
+ /* If cookies are active, and constant data is available, copy it
+ * directly from the listening socket.
+ */
+ if (NULL != xvp
+ && !xvp->cookie_out_never
+ && 0 < xvp->cookie_plus
+ && tp->s_data_constant) {
+ const struct tcp_cookie_values *cvp = tp->cookie_values;
+
+ if (NULL != cvp
+ && 0 < cvp->s_data_desired) {
+ u8 *buf = skb_put(skb, cvp->s_data_desired);
+
+ memcpy(buf, cvp->s_data_payload, cvp->s_data_desired);
+ TCP_SKB_CB(skb)->end_seq += cvp->s_data_desired;
+ }
+ }
+
th->seq = htonl(TCP_SKB_CB(skb)->seq);
th->ack_seq = htonl(tcp_rsk(req)->rcv_isn + 1);
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index c46da53..a0ad07b 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -159,6 +159,8 @@ static inline int cookie_check(struct sk_buff *skb, __u32 cookie)
struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
{
+ struct tcp_options_received tcp_opt;
+ u8 *cryptic_value;
struct inet_request_sock *ireq;
struct inet6_request_sock *ireq6;
struct tcp_request_sock *treq;
@@ -171,7 +173,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
int mss;
struct dst_entry *dst;
__u8 rcv_wscale;
- struct tcp_options_received tcp_opt;
if (!sysctl_tcp_syncookies || !th->ack)
goto out;
@@ -186,7 +187,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
/* check for timestamp cookie support */
memset(&tcp_opt, 0, sizeof(tcp_opt));
- tcp_parse_options(skb, &tcp_opt, 0);
+ tcp_parse_options(skb, &tcp_opt, &cryptic_value, 0);
if (tcp_opt.saw_tstamp)
cookie_check_timestamp(&tcp_opt);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3960d72..f3e44be 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1162,11 +1162,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb)
*/
static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
{
+ struct tcp_extend_values tmp_ext;
+ struct tcp_options_received tmp_opt;
+ u8 *cryptic_value;
struct inet6_request_sock *treq;
struct ipv6_pinfo *np = inet6_sk(sk);
- struct tcp_options_received tmp_opt;
- struct tcp_sock *tp = tcp_sk(sk);
struct request_sock *req = NULL;
+ struct tcp_sock *tp = tcp_sk(sk);
__u32 isn = TCP_SKB_CB(skb)->when;
#ifdef CONFIG_SYN_COOKIES
int want_cookie = 0;
@@ -1206,7 +1208,29 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
tmp_opt.user_mss = tp->rx_opt.user_mss;
- tcp_parse_options(skb, &tmp_opt, 0);
+ tcp_parse_options(skb, &tmp_opt, &cryptic_value, 0);
+
+ if (0 < tmp_opt.cookie_plus
+ && tmp_opt.saw_tstamp
+ && !tp->cookie_out_never
+ && (0 < sysctl_tcp_cookie_size
+ || (NULL != tp->cookie_values
+ && 0 < tp->cookie_values->cookie_desired))) {
+#ifdef CONFIG_SYN_COOKIES
+ want_cookie = 0; /* not our kind of cookie */
+#endif
+ tmp_ext.cookie_out_never = 0; /* false */
+ tmp_ext.cookie_plus = tmp_opt.cookie_plus;
+
+ /* secret recipe not yet implemented */
+ } else if (!tp->cookie_in_always) {
+ /* redundant indications, but ensure initialization. */
+ tmp_ext.cookie_out_never = 1; /* true */
+ tmp_ext.cookie_plus = 0;
+ } else {
+ goto drop;
+ }
+ tmp_ext.cookie_in_always = tp->cookie_in_always;
if (want_cookie && !tmp_opt.saw_tstamp)
tcp_clear_options(&tmp_opt);
@@ -1244,7 +1268,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
security_inet_conn_request(sk, skb, req);
- if (tcp_v6_send_synack(sk, req, NULL) ||
+ if (tcp_v6_send_synack(sk, req, (struct request_values *)&tmp_ext) ||
want_cookie)
goto drop;
@@ -1850,7 +1874,7 @@ static int tcp_v6_init_sock(struct sock *sk)
*/
tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
tp->snd_cwnd_clamp = ~0;
- tp->mss_cache = 536;
+ tp->mss_cache = TCP_MIN_RCVMSS;
tp->reordering = sysctl_tcp_reordering;
@@ -1866,6 +1890,19 @@ static int tcp_v6_init_sock(struct sock *sk)
tp->af_specific = &tcp_sock_ipv6_specific;
#endif
+ /* TCP Cookie Transactions */
+ if (0 < sysctl_tcp_cookie_size) {
+ /* Default, cookies without s_data. */
+ tp->cookie_values =
+ kzalloc(sizeof(*tp->cookie_values),
+ sk->sk_allocation);
+ if (NULL != tp->cookie_values)
+ kref_init(&tp->cookie_values->kref);
+ }
+ /* Presumed zeroed, in order of appearance:
+ * cookie_in_always, cookie_out_never, extend_timestamp,
+ * s_data_constant, s_data_in, s_data_out
+ */
sk->sk_sndbuf = sysctl_tcp_wmem[1];
sk->sk_rcvbuf = sysctl_tcp_rmem[1];
--
1.6.0.4
^ permalink raw reply related
* Re: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
From: Mel Gorman @ 2009-10-27 12:27 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: David Rientjes, Frans Pop, Jiri Kosina, Sven Geggus,
Karol Lewandowski, Tobias Oetiker, Rafael J. Wysocki,
David Miller, Reinette Chatre, Kalle Valo, Mohamed Abbas,
Jens Axboe, John W. Linville, Pekka Enberg,
Bartlomiej Zolnierkiewicz, Greg Kroah-Hartman,
Stephan von Krawczynski, Kernel Testers List, netdev,
linux-kernel, linux-mm@kvack.org
In-Reply-To: <20091026222159.2F72.A69D9226@jp.fujitsu.com>
On Tue, Oct 27, 2009 at 11:42:55AM +0900, KOSAKI Motohiro wrote:
> > On Mon, 26 Oct 2009, KOSAKI Motohiro wrote:
> >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index bf72055..5a27896 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1899,6 +1899,12 @@ rebalance:
> > > if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> > > /* Wait for some write requests to complete then retry */
> > > congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > +
> > > + /*
> > > + * While we wait congestion wait, Amount of free memory can
> > > + * be changed dramatically. Thus, we kick kswapd again.
> > > + */
> > > + wake_all_kswapd(order, zonelist, high_zoneidx);
> > > goto rebalance;
> > > }
> > >
> >
> > We're blocking to finish writeback of the directly reclaimed memory, why
> > do we need to wake kswapd afterwards?
>
> the same reason of "goto restart" case. that's my intention.
> if following scenario occur, it is equivalent that we didn't call wake_all_kswapd().
>
> 1. call congestion_wait()
> 2. kswapd reclaimed lots memory and sleep
> 3. another task consume lots memory
> 4. wakeup from congestion_wait()
>
> IOW, if we falled into __alloc_pages_slowpath(), we naturally expect
> next page_alloc() don't fall into slowpath. however if kswapd end to
> its work too early, this assumption isn't true.
>
> Is this too pessimistic assumption?
>
hmm.
The reason it's not woken in both cases a second time was to match the
behaviour of 2.6.30. If the direct reclaimer goes asleep and another task
consumes the memory the direct reclaimer freed then the greedy process should
kick kswapd back awake again as free memory goes below the low watermark.
However, if the greedy process was allocating order-0, it's possible that
the watermarks for order-0 are being met leaving kswapd alone where as the
high-order ones are not leaving kswapd to go back asleep or to reclaim at
the wrong order.
It's a functional change but I can add it to the list of things to
consider. Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* [net-next-2.6 PATCH v4 2/3] TCPCT part 1b: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS, functions
From: William Allen Simpson @ 2009-10-27 12:18 UTC (permalink / raw)
To: Linux Kernel Network Developers
In-Reply-To: <4AE6E35C.2050101@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1294 bytes --]
Define sysctl (tcp_cookie_size) to turn on and off the cookie option
default globally, instead of a compiled configuration option.
Define per socket option (TCP_COOKIE_TRANSACTIONS) for setting constant
data values, retrieving variable cookie values, and other facilities.
Redefine two TCP header functions to accept TCP header pointer.
When subtracting, return signed int to allow error checking.
Move inline tcp_clear_options() unchanged from net/tcp.h to linux/tcp.h,
near its corresponding struct tcp_options_received (prior to changes).
This is a straightforward re-implementation of an earlier (year-old)
patch that no longer applies cleanly, with permission of the original
author (Adam Langley). The patch was previously reviewed:
http://thread.gmane.org/gmane.linux.network/102586
These functions will also be used in subsequent patches that implement
additional features.
Signed-off-by: William.Allen.Simpson@gmail.com
---
Documentation/networking/ip-sysctl.txt | 8 +++++
include/linux/tcp.h | 47 +++++++++++++++++++++++++++++++-
include/net/tcp.h | 6 +---
net/ipv4/sysctl_net_ipv4.c | 8 +++++
net/ipv4/tcp_output.c | 8 +++++
5 files changed, 71 insertions(+), 6 deletions(-)
[-- Attachment #2: TCPCT+1b4.patch --]
[-- Type: text/plain, Size: 5734 bytes --]
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index a0e134d..d47c000 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -164,6 +164,14 @@ tcp_congestion_control - STRING
additional choices may be available based on kernel configuration.
Default is set as part of kernel configuration.
+tcp_cookie_size - INTEGER
+ Default size of TCP Cookie Transactions (TCPCT) option, that may be
+ overridden on a per socket basis by the TCPCT socket option.
+ Values greater than the maximum (16) are interpreted as the maximum.
+ Values greater than zero and less than the minimum (8) are interpreted
+ as the minimum.
+ Default: 0 (off).
+
tcp_dsack - BOOLEAN
Allows TCP to send "duplicate" SACKs.
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 61723a7..6fd59d1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -96,6 +96,7 @@ enum {
#define TCP_QUICKACK 12 /* Block/reenable quick acks */
#define TCP_CONGESTION 13 /* Congestion control algorithm */
#define TCP_MD5SIG 14 /* TCP MD5 Signature (RFC2385) */
+#define TCP_COOKIE_TRANSACTIONS 15 /* TCP Cookie Transactions */
#define TCPI_OPT_TIMESTAMPS 1
#define TCPI_OPT_SACK 2
@@ -170,6 +171,34 @@ struct tcp_md5sig {
__u8 tcpm_key[TCP_MD5SIG_MAXKEYLEN]; /* key (binary) */
};
+/* for TCP_COOKIE_TRANSACTIONS (TCPCT) socket option */
+#define TCP_COOKIE_MAX 16 /* 128-bits */
+#define TCP_COOKIE_MIN 8 /* 64-bits */
+#define TCP_COOKIE_PAIR_SIZE (2*TCP_COOKIE_MAX)
+
+#define TCP_S_DATA_MAX 64U /* after TCP+IP options */
+#define TCP_S_DATA_MSS_DEFAULT 536U /* default MSS (RFC1122) */
+
+/* Flags for both getsockopt and setsockopt */
+#define TCP_COOKIE_IN_ALWAYS (1 << 0) /* Discard SYN without cookie */
+#define TCP_COOKIE_OUT_NEVER (1 << 1) /* Prohibit outgoing cookies,
+ * supercedes everything else. */
+#define TCP_EXTEND_TIMESTAMP (1 << 4) /* Initiate 64-bit timestamps */
+
+/* Flags for getsockopt */
+#define TCP_S_DATA_IN (1 << 2) /* Was data received? */
+#define TCP_S_DATA_OUT (1 << 3) /* Was data sent? */
+
+/* TCP_COOKIE_TRANSACTIONS data */
+struct tcp_cookie_transactions {
+ __u16 tcpct_flags; /* see above */
+ __u8 __tcpct_pad1; /* zero */
+ __u8 tcpct_cookie_desired; /* bytes */
+ __u16 tcpct_s_data_desired; /* bytes of variable data */
+ __u16 tcpct_used; /* bytes in value */
+ __u8 tcpct_value[TCP_S_DATA_MSS_DEFAULT];
+};
+
#ifdef __KERNEL__
#include <linux/skbuff.h>
@@ -193,6 +222,17 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
return (tcp_hdr(skb)->doff - 5) * 4;
}
+static inline unsigned int tcp_header_len_th(const struct tcphdr *th)
+{
+ return th->doff * 4;
+}
+
+/* When doff is bad, this could be negative. */
+static inline int tcp_option_len_th(const struct tcphdr *th)
+{
+ return (int)(th->doff * 4) - sizeof(*th);
+}
+
/* This defines a selective acknowledgement block. */
struct tcp_sack_block_wire {
__be32 start_seq;
@@ -223,6 +263,11 @@ struct tcp_options_received {
u16 mss_clamp; /* Maximal mss, negotiated at connection setup */
};
+static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
+{
+ rx_opt->tstamp_ok = rx_opt->sack_ok = rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
+}
+
/* This is the max number of SACKS that we'll generate and process. It's safe
* to increse this, although since:
* size = TCPOLEN_SACK_BASE_ALIGNED (4) + n * TCPOLEN_SACK_PERBLOCK (8)
@@ -431,6 +476,6 @@ static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
return (struct tcp_timewait_sock *)sk;
}
-#endif
+#endif /* __KERNEL__ */
#endif /* _LINUX_TCP_H */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 4abdb4d..142f32e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -237,6 +237,7 @@ extern int sysctl_tcp_base_mss;
extern int sysctl_tcp_workaround_signed_windows;
extern int sysctl_tcp_slow_start_after_idle;
extern int sysctl_tcp_max_ssthresh;
+extern int sysctl_tcp_cookie_size;
extern atomic_t tcp_memory_allocated;
extern struct percpu_counter tcp_sockets_allocated;
@@ -343,11 +344,6 @@ static inline void tcp_dec_quickack_mode(struct sock *sk,
extern void tcp_enter_quickack_mode(struct sock *sk);
-static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
-{
- rx_opt->tstamp_ok = rx_opt->sack_ok = rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
-}
-
#define TCP_ECN_OK 1
#define TCP_ECN_QUEUE_CWR 2
#define TCP_ECN_DEMAND_CWR 4
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 2dcf04d..3422c54 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -714,6 +714,14 @@ static struct ctl_table ipv4_table[] = {
},
{
.ctl_name = CTL_UNNUMBERED,
+ .procname = "tcp_cookie_size",
+ .data = &sysctl_tcp_cookie_size,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
.procname = "udp_mem",
.data = &sysctl_udp_mem,
.maxlen = sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 79042f4..bcfbe41 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -59,6 +59,14 @@ int sysctl_tcp_base_mss __read_mostly = 512;
/* By default, RFC2861 behavior. */
int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
+#ifdef CONFIG_SYSCTL
+/* By default, let the user enable it. */
+int sysctl_tcp_cookie_size __read_mostly = 0;
+#else
+int sysctl_tcp_cookie_size __read_mostly = TCP_COOKIE_MAX;
+#endif
+
+
/* Account for new data that has been sent to the network. */
static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb)
{
--
1.6.0.4
^ permalink raw reply related
* [net-next-2.6 PATCH v4 1/3] TCPCT part 1a: add request_values parameter for sending SYNACK
From: William Allen Simpson @ 2009-10-27 12:15 UTC (permalink / raw)
To: Linux Kernel Network Developers
In-Reply-To: <4AE6E35C.2050101@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 933 bytes --]
Add optional function parameters associated with sending SYNACK.
These parameters are not needed after sending SYNACK, and are not
used for retransmission. Avoids extending struct tcp_request_sock,
and avoids allocating kernel memory.
Also affects DCCP as it uses common struct request_sock_ops,
but this parameter is currently reserved for future use.
Signed-off-by: William.Allen.Simpson@gmail.com
---
include/net/request_sock.h | 8 +++++++-
include/net/tcp.h | 1 +
net/dccp/ipv4.c | 5 +++--
net/dccp/ipv6.c | 5 +++--
net/dccp/minisocks.c | 2 +-
net/ipv4/inet_connection_sock.c | 2 +-
net/ipv4/tcp_ipv4.c | 11 +++++++----
net/ipv4/tcp_minisocks.c | 2 +-
net/ipv4/tcp_output.c | 1 +
net/ipv6/tcp_ipv6.c | 14 +++++++-------
10 files changed, 32 insertions(+), 19 deletions(-)
[-- Attachment #2: TCPCT+1a4.patch --]
[-- Type: text/plain, Size: 7357 bytes --]
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index c719084..c9b50eb 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -27,13 +27,19 @@ struct sk_buff;
struct dst_entry;
struct proto;
+/* empty to "strongly type" an otherwise void parameter.
+ */
+struct request_values {
+};
+
struct request_sock_ops {
int family;
int obj_size;
struct kmem_cache *slab;
char *slab_name;
int (*rtx_syn_ack)(struct sock *sk,
- struct request_sock *req);
+ struct request_sock *req,
+ struct request_values *rvp);
void (*send_ack)(struct sock *sk, struct sk_buff *skb,
struct request_sock *req);
void (*send_reset)(struct sock *sk,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..4abdb4d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -443,6 +443,7 @@ extern int tcp_connect(struct sock *sk);
extern struct sk_buff * tcp_make_synack(struct sock *sk,
struct dst_entry *dst,
+ struct request_values *rvp,
struct request_sock *req);
extern int tcp_disconnect(struct sock *sk, int flags);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 00028d4..8d4a88a 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -477,7 +477,8 @@ static struct dst_entry* dccp_v4_route_skb(struct net *net, struct sock *sk,
return &rt->u.dst;
}
-static int dccp_v4_send_response(struct sock *sk, struct request_sock *req)
+static int dccp_v4_send_response(struct sock *sk, struct request_sock *req,
+ struct request_values *rv_unused)
{
int err = -1;
struct sk_buff *skb;
@@ -626,7 +627,7 @@ int dccp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
dreq->dreq_iss = dccp_v4_init_sequence(skb);
dreq->dreq_service = service;
- if (dccp_v4_send_response(sk, req))
+ if (dccp_v4_send_response(sk, req, NULL))
goto drop_and_free;
inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT);
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 6d89f9f..f64981e 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -241,7 +241,8 @@ out:
}
-static int dccp_v6_send_response(struct sock *sk, struct request_sock *req)
+static int dccp_v6_send_response(struct sock *sk, struct request_sock *req,
+ struct request_values *rv_unused)
{
struct inet6_request_sock *ireq6 = inet6_rsk(req);
struct ipv6_pinfo *np = inet6_sk(sk);
@@ -468,7 +469,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
dreq->dreq_iss = dccp_v6_init_sequence(skb);
dreq->dreq_service = service;
- if (dccp_v6_send_response(sk, req))
+ if (dccp_v6_send_response(sk, req, NULL))
goto drop_and_free;
inet6_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT);
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 5ca49ce..af226a0 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -184,7 +184,7 @@ struct sock *dccp_check_req(struct sock *sk, struct sk_buff *skb,
* counter (backoff, monitored by dccp_response_timer).
*/
req->retrans++;
- req->rsk_ops->rtx_syn_ack(sk, req);
+ req->rsk_ops->rtx_syn_ack(sk, req, NULL);
}
/* Network Duplicate, discard packet */
return NULL;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index f6a0af7..b81d3c9 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -504,7 +504,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
if (time_after_eq(now, req->expires)) {
if ((req->retrans < thresh ||
(inet_rsk(req)->acked && req->retrans < max_retries))
- && !req->rsk_ops->rtx_syn_ack(parent, req)) {
+ && !req->rsk_ops->rtx_syn_ack(parent, req, NULL)) {
unsigned long timeo;
if (req->retrans++ == 0)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a4a3390..6edc5e2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -743,6 +743,7 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
* socket.
*/
static int __tcp_v4_send_synack(struct sock *sk, struct request_sock *req,
+ struct request_values *rvp,
struct dst_entry *dst)
{
const struct inet_request_sock *ireq = inet_rsk(req);
@@ -753,7 +754,7 @@ static int __tcp_v4_send_synack(struct sock *sk, struct request_sock *req,
if (!dst && (dst = inet_csk_route_req(sk, req)) == NULL)
return -1;
- skb = tcp_make_synack(sk, dst, req);
+ skb = tcp_make_synack(sk, dst, rvp, req);
if (skb) {
struct tcphdr *th = tcp_hdr(skb);
@@ -774,9 +775,10 @@ static int __tcp_v4_send_synack(struct sock *sk, struct request_sock *req,
return err;
}
-static int tcp_v4_send_synack(struct sock *sk, struct request_sock *req)
+static int tcp_v4_send_synack(struct sock *sk, struct request_sock *req,
+ struct request_values *rvp)
{
- return __tcp_v4_send_synack(sk, req, NULL);
+ return __tcp_v4_send_synack(sk, req, rvp, NULL);
}
/*
@@ -1334,7 +1336,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
}
tcp_rsk(req)->snt_isn = isn;
- if (__tcp_v4_send_synack(sk, req, dst) || want_cookie)
+ if (__tcp_v4_send_synack(sk, req, NULL, dst) ||
+ want_cookie)
goto drop_and_free;
inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index e320afe..8819882 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -537,7 +537,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
* Enforce "SYN-ACK" according to figure 8, figure 6
* of RFC793, fixed by RFC1122.
*/
- req->rsk_ops->rtx_syn_ack(sk, req);
+ req->rsk_ops->rtx_syn_ack(sk, req, NULL);
return NULL;
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2e2eb74..79042f4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2219,6 +2219,7 @@ int tcp_send_synack(struct sock *sk)
/* Prepare a SYN-ACK. */
struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
+ struct request_values *rvp,
struct request_sock *req)
{
struct inet_request_sock *ireq = inet_rsk(req);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index c54ec36..3960d72 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -461,7 +461,8 @@ out:
}
-static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req)
+static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req,
+ struct request_values *rvp)
{
struct inet6_request_sock *treq = inet6_rsk(req);
struct ipv6_pinfo *np = inet6_sk(sk);
@@ -499,7 +500,7 @@ static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req)
if ((err = xfrm_lookup(sock_net(sk), &dst, &fl, sk, 0)) < 0)
goto done;
- skb = tcp_make_synack(sk, dst, req);
+ skb = tcp_make_synack(sk, dst, rvp, req);
if (skb) {
struct tcphdr *th = tcp_hdr(skb);
@@ -1243,13 +1244,12 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
security_inet_conn_request(sk, skb, req);
- if (tcp_v6_send_synack(sk, req))
+ if (tcp_v6_send_synack(sk, req, NULL) ||
+ want_cookie)
goto drop;
- if (!want_cookie) {
- inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
- return 0;
- }
+ inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ return 0;
drop:
if (req)
--
1.6.0.4
^ permalink raw reply related
* [net-next-2.6 v4 0/4] TCPCT part1: cookie option exchange
From: William Allen Simpson @ 2009-10-27 12:11 UTC (permalink / raw)
To: Linux Kernel Network Developers
Updated to patch cleanly with recent commits.
Changes from previous version:
* Strongly type otherwise void parameter with empty struct.
* Combined two shorter patches (b + c), dropping update of 2 drivers,
as nobody has gotten back to me about testing results.
* Added a little Documentation for new sysctl.
^ permalink raw reply
* netlink: symantics for NLM_F_MULTI and NLMSG_DONE ?
From: Roger Willcocks @ 2009-10-27 11:43 UTC (permalink / raw)
To: netdev
The netlink rfc (3549) says that in multi-part messages (messages split
over multiple netlink packets), the first and all following headers
have the NLM_F_MULTI flag set except for the last header which has the
header type NLMSG_DONE.
afaict NLMSG_DONE is redundant, since one could simply not set the
NLM_F_MULTI flag for the final header; and if a multi-part message
can't cross a datagram boundary - which is assumed by e.g. the reading
example in
http://www.kernel.org/doc/man-pages/online/pages/man7/netlink.7.html
- NLMSG_DONE is redundant too.
And the rfc doesn't say what NLMSG_DONE /and/ NLM_F_MULTI should mean,
although net/netlink/af_netlink.c always sets both.
So, what are the symantics for NLM_F_MULTI and NLMSG_DONE?
^ permalink raw reply
* phy_connect_direct problem
From: Kristoffer Glembo @ 2009-10-27 10:28 UTC (permalink / raw)
To: netdev
Hi,
I'm in the process of converting an ethernet driver to using the PHY
layer, but ran into a problem where my adjust_state handler was never
called and the link did not go up ...
I use phy_connect to connect the PHY device with my ethernet driver. In
phy_connect_direct I found the following code:
phy_prepare_link(phydev, handler);
phy_start_machine(phydev, NULL);
So first the adjust_state is set to handler by phy_prepare_link and then
it is overwritten with the NULL parameter passed to phy_start_machine.
Maybe I'm abusing the PHY layer somehow since there are plenty of other
drivers using this interface. However the patch below solved the issue
for me.
Best regards,
Kristoffer Glembo
---
drivers/net/phy/phy_device.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index b10fedd..d27ca80 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -311,8 +311,7 @@ int phy_connect_direct(struct net_device *dev,
struct phy_device *phydev,
if (rc)
return rc;
- phy_prepare_link(phydev, handler);
- phy_start_machine(phydev, NULL);
+ phy_start_machine(phydev, handler);
if (phydev->irq > 0)
phy_start_interrupts(phydev);
--
1.5.2.2
^ permalink raw reply related
* Re: [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2
From: Mel Gorman @ 2009-10-27 10:40 UTC (permalink / raw)
To: reinette chatre
Cc: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
Tobias Oetiker, Rafael J. Wysocki, David Miller, Kalle Valo,
David Rientjes, KOSAKI Motohiro, Abbas, Mohamed, Jens Axboe,
John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org
In-Reply-To: <1256226219.21134.1493.camel@rc-desk>
On Thu, Oct 22, 2009 at 08:43:38AM -0700, reinette chatre wrote:
> On Thu, 2009-10-22 at 07:22 -0700, Mel Gorman wrote:
> > [Bug #14141] order 2 page allocation failures in iwlagn
> > Commit 4752c93c30441f98f7ed723001b1a5e3e5619829 introduced GFP_ATOMIC
> > allocations within the wireless driver. This has caused large numbers
> > of failure reports to occur as reported by Frans Pop. Fixing this
> > requires changes to the driver if it wants to use GFP_ATOMIC which
> > is in the hands of Mohamed Abbas and Reinette Chatre. However,
> > it is very likely that it has being compounded by core mm changes
> > that this series is aimed at.
>
> Driver has been changed to allocate paged skb for its receive buffers.
> This reduces amount of memory needed from order-2 to order-1. This work
> is significant and will thus be in 2.6.33.
>
What do you want to do for -stable in 2.6.31?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply
* Re: [PATCH 5/5] ONLY-APPLY-IF-STILL-FAILING Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion
From: Frans Pop @ 2009-10-27 10:29 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Mel Gorman, Jiri Kosina, Sven Geggus, Karol Lewandowski,
Tobias Oetiker, Rafael J. Wysocki, David Miller, Reinette Chatre,
Kalle Valo, David Rientjes, Mohamed Abbas, Jens Axboe,
John W. Linville, Pekka Enberg, Bartlomiej Zolnierkiewicz,
Greg Kroah-Hartman, Stephan von Krawczynski, Kernel Testers List,
netdev, linux-kernel, linux-mm@kvack.org
In-Reply-To: <20091026235628.2F7B.A69D9226@jp.fujitsu.com>
On Tuesday 27 October 2009, KOSAKI Motohiro wrote:
> Oops. no, please no.
> 8aa7e847 is regression fixing commit. this revert indicate the
> regression occur again.
> if we really need to revert it, we need to revert 1faa16d2287 too.
> however, I doubt this commit really cause regression to iwlan. IOW,
> I agree Jens.
This is not intended as a patch for mainline, but just as a test to see if
it improves things. It may be a regression fix, but it also creates a
significant change in behavior during swapping in my test case.
If a fix is needed, it will probably by different from this revert.
Please read: http://lkml.org/lkml/2009/10/26/510.
This mail has some data: http://lkml.org/lkml/2009/10/26/455.
> I hope to try reproduce this problem on my test environment. Can anyone
> please explain reproduce way?
Please see my mails in this thread for bug #14141:
http://thread.gmane.org/gmane.linux.kernel/896714
You will probably need to read some of them to understand the context of
the two mails linked above.
The most relevant ones are (all from the same thread; not sure why gmane
gives such weird links):
http://article.gmane.org/gmane.linux.kernel.mm/39909
http://article.gmane.org/gmane.linux.kernel.kernel-testers/7228
http://article.gmane.org/gmane.linux.kernel.kernel-testers/7165
> Is special hardware necessary?
Not special hardware, but you may need an encrypted partition and NFS; the
test may need to be modified according to the amount of memory you have.
I think it should be possible to reproduce the freezes I see while ignoring
the SKB allocation errors as IMO those are just a symptom, not the cause.
So you should not need wireless.
The severity of the freezes during my test often increases if the test is
repeated (without rebooting).
Cheers,
FJP
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH 6/9] ser_gigaset: checkpatch cleanup
From: Karsten Keil @ 2009-10-27 10:20 UTC (permalink / raw)
To: isdn4linux
Cc: Tilman Schmidt, Joe Perches, netdev, linux-kernel, i4ldeveloper,
Hansjoerg Lipp, David Miller
In-Reply-To: <4AE637D8.60809@imap.cc>
On Dienstag, 27. Oktober 2009 00:59:20 Tilman Schmidt wrote:
> Am 26.10.2009 01:54 schrieb Joe Perches:
> > On Sun, 2009-10-25 at 20:30 +0100, Tilman Schmidt wrote:
> >> Duly uglified as demanded by checkpatch.pl.
> >> diff --git a/drivers/isdn/gigaset/ser-gigaset.c
> >> b/drivers/isdn/gigaset/ser-gigaset.c index 3071a52..ac3409e 100644
> >> --- a/drivers/isdn/gigaset/ser-gigaset.c
> >> +++ b/drivers/isdn/gigaset/ser-gigaset.c
> >> @@ -164,9 +164,15 @@ static void gigaset_modem_fill(unsigned long data)
> >> {
> >> struct cardstate *cs = (struct cardstate *) data;
> >> struct bc_state *bcs;
> >> + struct sk_buff *nextskb;
> >> int sent = 0;
> >>
> >> - if (!cs || !(bcs = cs->bcs)) {
> >> + if (!cs) {
> >> + gig_dbg(DEBUG_OUTPUT, "%s: no cardstate", __func__);
> >> + return;
> >> + }
> >> + bcs = cs->bcs;
> >> + if (!bcs) {
> >> gig_dbg(DEBUG_OUTPUT, "%s: no cardstate", __func__);
> >> return;
> >> }
> >
> > perhaps:
> > if (!cs || !cs->bcs) {
> > gig_dbg(DEBUG_OUTPUT, "%s: no cardstate", __func__);
> > return;
> > }
> > bcs = cs->bcs;
>
> That would evaluate cs->bcs twice, and is also, in my experience,
gcc should handle this subsequent double evaluation well enough.
> significantly more prone to easily overlooked typos which result in
> checking a different pointer in the if statement than the one that's
> actually used in the subsequent assignment.
>
Yes this may happen, but more often a = in if statements should be a ==.
The kernel code style says only one statement per line, which implies no
assignments in if statements, so we should follow this.
The checkpatch.pl script complain about these issues.
Yes sometimes in the past (while preparing some old code for kernel submit)
I was not very happy about all these rules, it take lot time to reach zero
reports from checkpatch.
But it make lot of sense in the long term, currently I have to debug code
written without any code style, it is really, really painful and I need 5-10
times more time for simple understanding the basic function of the code.
Karsten
^ permalink raw reply
* Re: [PATCH] vlan: allow VLAN ID 0 to be used
From: Eric Dumazet @ 2009-10-27 10:02 UTC (permalink / raw)
To: Benny Amorsen
Cc: Gertjan Hofman, Matt Carlson, netdev@vger.kernel.org,
Patrick McHardy, David S. Miller
In-Reply-To: <m3aazdb1ue.fsf@ursa.amorsen.dk>
Benny Amorsen a écrit :
> Eric Dumazet <eric.dumazet@gmail.com> writes:
>
>> Here is the patch I cooked that permitted VLAN 0 to be used with tg3
>> (and other HW accelerated vlan nics I suppose)
>>
>> [PATCH] vlan: allow VLAN ID 0 to be used
>>
>> We currently use a 16 bit field (vlan_tci) to store VLAN ID on a skb.
>>
>> 0 value is used a special value, meaning VLAN ID not set.
>> This forbids use of VLAN ID 0
>
> Are you sure you actually want to do this?
>
> VLAN 0 IS special. Frames received on VLAN 0 should be treated just as
> if they had no VLAN tag at all, except that they have an 802.1p value.
>
> Sending frames with VLAN 0 should have something to do with whether
> the sender wants to use 802.1p, which doesn't really have much to do
> with VLAN's at all...
>
> It would be nice if the unsuspecting user was at least warned that their
> use of VLAN 0 is non-standard and may cause surprising results like
> leakage into the "native" VLAN. That could be done in /sbin/ip or
> /sbin/vconfig, of course.
>
Quotting http://en.wikipedia.org/wiki/IEEE_802.1Q
VLAN Identifier (VID): a 12-bit field specifying the VLAN to which the frame belongs.
A value of 0 means that the frame doesn't belong to any VLAN; in this case the 802.1Q
tag specifies only a priority and is referred to as a priority tag.
A value of hex FFF is reserved for implementation use.
All other values may be used as VLAN identifiers, allowing up to 4094 VLANs
So we expect to generate a 802.1Q frame, even with a VID=0 field.
Before patch, device sends a non 802.1Q frame, which is not what was wanted by user.
(Maybe he wants to check its device/network is able to transport 1522 bytes frames, who knows...)
To use non tagged frames, user selects eth0 device, and to send tagged frames, he selects eth0.0
Now, maybe eth0 and eth0.0 should share same IP addresses, because incoming frame
with ID=0 tag should be received by eth0 device, but I am not sure standard requires this.
^ permalink raw reply
* Re: [PATCH] vlan: allow VLAN ID 0 to be used
From: Benny Amorsen @ 2009-10-27 9:52 UTC (permalink / raw)
To: Eric Dumazet
In-Reply-To: <4AE5CAC6.4000604@gmail.com>
Eric Dumazet <eric.dumazet@gmail.com> writes:
> Here is the patch I cooked that permitted VLAN 0 to be used with tg3
> (and other HW accelerated vlan nics I suppose)
>
> [PATCH] vlan: allow VLAN ID 0 to be used
>
> We currently use a 16 bit field (vlan_tci) to store VLAN ID on a skb.
>
> 0 value is used a special value, meaning VLAN ID not set.
> This forbids use of VLAN ID 0
Are you sure you actually want to do this?
VLAN 0 IS special. Frames received on VLAN 0 should be treated just as
if they had no VLAN tag at all, except that they have an 802.1p value.
Sending frames with VLAN 0 should have something to do with whether
the sender wants to use 802.1p, which doesn't really have much to do
with VLAN's at all...
It would be nice if the unsuspecting user was at least warned that their
use of VLAN 0 is non-standard and may cause surprising results like
leakage into the "native" VLAN. That could be done in /sbin/ip or
/sbin/vconfig, of course.
/Benny
^ permalink raw reply
* Re: performance regression in virtio-net in 2.6.32-rc4
From: Avi Kivity @ 2009-10-27 9:41 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Rusty Russell, virtualization, kvm, netdev
In-Reply-To: <20091026184835.GB26473@redhat.com>
On 10/26/2009 08:48 PM, Michael S. Tsirkin wrote:
> Hi!
> I noticed a performance regression in virtio net: going from
> 2.6.31 to 2.6.32-rc4 I see this, for guest to host communication:
>
> Any tips on debugging this?
>
Lacking better advice, a bisect can help as a last resort. 'git bisect
start -- drivers/net drivers/virtio' will probably find it fastest.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* next tree for October 27: znet build failure
From: Sachin Sant @ 2009-10-27 9:35 UTC (permalink / raw)
To: linux-next; +Cc: Stephen Rothwell, netdev, David Miller
In-Reply-To: <20091027190850.9694fb39.sfr@canb.auug.org.au>
Fails to build on x86_32 with lots of errors related to znet
drivers/net/znet.c:107:29: error: wireless/i82593.h: No such file or directory
drivers/net/znet.c:133: error: field 'i593_init' has incomplete type
drivers/net/znet.c: In function 'znet_set_multicast_list':
drivers/net/znet.c:235: error: invalid application of 'sizeof' to incomplete type 'struct i82593_conf_block'
drivers/net/znet.c:247: error: dereferencing pointer to incomplete type
.....
Commit 879e9304... moved wireless/i82593.h to drivers/staging/wavelan/i82593.h
Thanks
-Sachin
--
---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------
^ permalink raw reply
* Re: Increasing TCP initial cwnd
From: Ilpo Järvinen @ 2009-10-27 7:38 UTC (permalink / raw)
To: Yair Gottdenker; +Cc: Netdev
In-Reply-To: <fa3cddca0910260807q714a24fdp1ffae601bb6f67ba@mail.gmail.com>
On Mon, 26 Oct 2009, Yair Gottdenker wrote:
> I already had a look in it. Their changes that are related to changing
> the initial cwnd (snd_cwnd) are in several places in the patch:
Right. I know what I've modified in that patch... ;-)
> 1. tcp_ipv4.c in function tcp_init_cwnd. since currently I am working
> with no metric this is irrelevant.
You managed to mix up this filename twice in a row in a different way :-).
...That function resides in tcp_input.c instead. And changes in
tcp_init_metrics are not completely irrelevant, even with no metrics it
is executed as "no metrics" works a bit differenctly than you seem to
think.
> 2. tcp_ipv4.c: in function tcp_v4_init_sock, which I applied
> 3. tcp_minisocks.c - which I applied
> . They patched kernel 2.6.18.8 which is very old and there were a lot
> of changes since that in the kernel tcp layer. I didn't test their
> patch but applying the same logic to 2.6.31.3 doesn't seems to change
> initial window size.
>
> It is clear that in order to correctly make the change to the initial
> cwnd, one should apply changes to many more scenarios like setting the
> initial cwnd after idle time and more. I want to proceed step by step,
> first the naive changes to make it work.
Hmm, it now rings a bell... You cannot directly set initial window to a
large value (or you can but that doesn't do anything on the wire) because
you hit first the limit of receiver advertised window that applies
auto-tuning (with a starting value was four IIRC).
--
i.
^ permalink raw reply
* Re: [PATCH] dcache: better name hash function
From: Eric Dumazet @ 2009-10-27 7:29 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel, Al Viro
In-Reply-To: <4AE69829.9070207@gmail.com>
Eric Dumazet a écrit :
>
>
> 511 value on 64bit, and 1023 on 32bit arches are nice because
> hashsz * sizeof(pointer) <= 4096, wasting space for one pointer only.
>
> Conclusion : jhash and 511/1023 hashsize for netdevices,
> no divides, only one multiply for the fold.
Just forget about 511 & 1023, as power of two works too.
-> 512 & 1024 + jhash
Guess what, David already said this :)
^ permalink raw reply
* Re: [PATCH] net: Adjust softirq raising in __napi_schedule
From: Johannes Berg @ 2009-10-27 7:01 UTC (permalink / raw)
To: Tilman Schmidt
Cc: Jarek Poplawski, David Miller,
hidave.darkstar-Re5JQEeQqe8AvxtiuMwx3w,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, tglx-hfZtesqFncYOwBW4kG4KsQ,
linux-wireless-u79uwXL29TY76Z2rM5mHXA,
linux-ppp-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
paulus-eUNUBHrolfbYtjvyW6yDsg, Michael Buesch, Oliver Hartkopp
In-Reply-To: <4AE64441.7060008-ZTO5kqT2PaM@public.gmane.org>
[-- Attachment #1: Type: text/plain, Size: 1706 bytes --]
On Tue, 2009-10-27 at 01:52 +0100, Tilman Schmidt wrote:
> > Any code (say ISDN code) that calls netif_rx() is clearly assuming to
> > always be running in (soft)irq context, otherwise it couldn't call
> > netif_rx() unconditionally. Agree so far?
>
> Well, in fact I'm not sure. :-) All I know is that in the ISDN case, no
> such assumption is explicitly stated anywhere. (The code in question is
> called from the rcvcallb_skb() callback method which the hardware driver
> calls when data has been received, and the description of that method in
> Documentation/isdn/INTERFACE does not say anything about the context in
> which it may be called.) The relevant code in drivers/isdn/i4l/isdn_ppp.c
> is rather old, perhaps even older than softirqs and the netif_rx() /
> netif_rx_ni() split. (Bear in mind that we are talking about the old
> ISDN4Linux subsystem which initially didn't even make it into the 2.6
> series because it was considered obsolete.) It seems quite possible to me
> that just no one ever thought about that question.
Heh :)
> > So now if you change the ISDN code to call netif_rx_ni(), you've changed
> > the assumption that the ISDN code makes -- that it is running in
> > (soft)irq context. Therefore, you need to verify that this is actually a
> > correct change, which is what I tried to say.
>
> Understood. However, the fact that the local_softirq_pending message is
> appearing would seem to indicate that this assumption was wrong to
> begin with, wouldn't it?
I thought it only recently started appearing with a new driver or
something, but I may have misunderstood you. Anyway, I think that sums
up the issue from my POV.
johannes
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]
^ permalink raw reply
* Re: [PATCH] dcache: better name hash function
From: Eric Dumazet @ 2009-10-27 6:50 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel, Al Viro
In-Reply-To: <4AE68E23.20205@gmail.com>
Eric Dumazet a écrit :
> unsigned int fold2(unsigned hash)
> {
> return ((unsigned long long)hash * 211) >> 32;
> }
>
I tried this reciprocal thing with 511 and 1023 values and got on a PIII 550 MHz, gcc-3.3.2 :
# ./hashtest 100000 511
jhash_string 0.033123 1.01 234 1.06
fnv32 0.033911 1.02 254 1.38
# ./hashtest 1000000 511
jhash_string 0.331155 1.00 2109 1.10
fnv32 0.359346 1.00 2151 1.65
# ./hashtest 10000000 511
jhash_string 3.383340 1.00 19985 1.03
fnv32 3.849359 1.00 20198 1.53
# ./hashtest 100000 1023
jhash_string 0.033123 1.03 134 1.01
fnv32 0.034260 1.03 142 1.32
# ./hashtest 1000000 1023
jhash_string 0.332329 1.00 1075 1.06
fnv32 0.422035 1.00 1121 1.59
# ./hashtest 10000000 1023
jhash_string 3.417559 1.00 10107 1.01
fnv32 3.747563 1.00 10223 1.35
511 value on 64bit, and 1023 on 32bit arches are nice because
hashsz * sizeof(pointer) <= 4096, wasting space for one pointer only.
Conclusion : jhash and 511/1023 hashsize for netdevices,
no divides, only one multiply for the fold.
^ permalink raw reply
* RE: [patch next 3/4] netxen: fix bonding support
From: Dhananjay Phadke @ 2009-10-27 6:06 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: netdev@vger.kernel.org
In-Reply-To: <m1aazdl713.fsf@fess.ebiederm.org>
Eric,
Thanks for reporting and verifying.
-Dhananjay
________________________________________
From: Eric W. Biederman [ebiederm@xmission.com]
Sent: Monday, October 26, 2009 10:50 PM
To: Dhananjay Phadke
Cc: netdev@vger.kernel.org
Subject: Re: [patch next 3/4] netxen: fix bonding support
Dhananjay Phadke <dhananjay.phadke@qlogic.com> writes:
>> Yes. That should prevent the null pointer deference. Will it also
>> allow setting the mac address when the NIC is down?
>>
>> Eric
>
> Yes, we do save new address in netdev->dev_addr.
> This is later on programmed in hardware when interface is brought up.
Yep. Thanks. I just tested it and confirmed the crash is no longer there.
Eric
^ permalink raw reply
* Re: [PATCH] dcache: better name hash function
From: Eric Dumazet @ 2009-10-27 6:07 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel, Al Viro
In-Reply-To: <19864844.24581256620784317.JavaMail.root@tahiti.vyatta.com>
Stephen Hemminger a écrit :
> One of the root causes of slowness in network usage
> was my original choice of power of 2 for hash size, to avoid
> a mod operation. It turns out if size is not a power of 2
> the original algorithm works fairly well.
Interesting, but I suspect all users have power of 2 tables :(
>
> On slow cpu; with 10million entries and 211 hash size
>
>
>
> How important is saving the one division, versus getting better
> distribution.
unsigned int fold1(unsigned hash)
{
return hash % 211;
}
Compiler uses a reciprocal divide because of 211 being a constant.
And you also could try following that contains one multiply only,
and check if hash distribution properties are still OK
unsigned int fold2(unsigned hash)
{
return ((unsigned long long)hash * 211) >> 32;
}
fold1:
movl 4(%esp), %ecx
movl $-1689489505, %edx
movl %ecx, %eax
mull %edx
shrl $7, %edx
imull $211, %edx, %edx
subl %edx, %ecx
movl %ecx, %eax
ret
fold2:
movl $211, %eax
mull 4(%esp)
movl %edx, %eax
ret
^ permalink raw reply
* Re: [patch next 3/4] netxen: fix bonding support
From: Eric W. Biederman @ 2009-10-27 5:50 UTC (permalink / raw)
To: Dhananjay Phadke; +Cc: netdev@vger.kernel.org
In-Reply-To: <7608421F3572AB4292BB2532AE89D5658B0B6E4F74@AVEXMB1.qlogic.org>
Dhananjay Phadke <dhananjay.phadke@qlogic.com> writes:
>> Yes. That should prevent the null pointer deference. Will it also
>> allow setting the mac address when the NIC is down?
>>
>> Eric
>
> Yes, we do save new address in netdev->dev_addr.
> This is later on programmed in hardware when interface is brought up.
Yep. Thanks. I just tested it and confirmed the crash is no longer there.
Eric
^ permalink raw reply
* Re: [PATCH] dcache: better name hash function
From: David Miller @ 2009-10-27 5:24 UTC (permalink / raw)
To: stephen.hemminger
Cc: eric.dumazet, akpm, torvalds, opurdila, netdev, linux-kernel,
viro
In-Reply-To: <19864844.24581256620784317.JavaMail.root@tahiti.vyatta.com>
From: Stephen Hemminger <stephen.hemminger@vyatta.com>
Date: Mon, 26 Oct 2009 22:19:44 -0700 (PDT)
> How important is saving the one division, versus getting better
> distribution.
80 cpu cycles or more on some processors. Cheaper to use
jenkins with a power-of-2 sized hash.
^ permalink raw reply
* Re: [PATCH] dcache: better name hash function
From: Stephen Hemminger @ 2009-10-27 5:19 UTC (permalink / raw)
To: Eric Dumazet
Cc: Andrew Morton, Linus Torvalds, Octavian Purdila, netdev,
linux-kernel, Al Viro
In-Reply-To: <9986527.24561256620662709.JavaMail.root@tahiti.vyatta.com>
One of the root causes of slowness in network usage
was my original choice of power of 2 for hash size, to avoid
a mod operation. It turns out if size is not a power of 2
the original algorithm works fairly well.
On slow cpu; with 10million entries and 211 hash size
Algorithm Time Ratio Max StdDev
string10 1.271871 1.00 47397 0.01
djb2 1.406322 1.00 47452 0.12
SuperFastHash 1.422348 1.00 48400 1.99
string_hash31 1.424079 1.00 47437 0.08
jhash_string 1.459232 1.00 47954 1.01
sdbm 1.499209 1.00 47499 0.22
fnv32 1.539341 1.00 47728 0.75
full_name_hash 1.556792 1.00 47412 0.04
string_hash17 1.719039 1.00 47413 0.05
pjw 1.827365 1.00 47441 0.09
elf 2.033545 1.00 47441 0.09
fnv64 2.199533 1.00 47666 0.53
crc 5.705784 1.00 47913 0.95
md5_string 10.308376 1.00 47946 1.00
fletcher 1.418866 1.01 53189 18.65
adler32 2.842117 1.01 53255 18.79
kr_hash 1.175678 6.43 468517 507.44
xor 1.114692 11.02 583189 688.96
lastchar 0.795316 21.10 1000000 976.02
How important is saving the one division, versus getting better
distribution.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox