* RE: [patch -next] bna: off by one in bfa_msgq_rspq_pi_update()
From: Rasesh Mody @ 2011-08-24 17:33 UTC (permalink / raw)
To: Dan Carpenter
Cc: Debashis Dutt, open list:BROCADE BNA 10 GI...,
kernel-janitors@vger.kernel.org, Jing Huang
In-Reply-To: <20110824113028.GD5975@shale.localdomain>
>From: Dan Carpenter [mailto:error27@gmail.com]
>Sent: Wednesday, August 24, 2011 4:30 AM
>
>The rspq->rsphdlr[] array has BFI_MC_MAX elements, so this test was
>off by one.
>
>Signed-off-by: Dan Carpenter <error27@gmail.com>
>
>diff --git a/drivers/net/ethernet/brocade/bna/bfa_msgq.c
>b/drivers/net/ethernet/brocade/bna/bfa_msgq.c
>index ed52187..dd36427 100644
>--- a/drivers/net/ethernet/brocade/bna/bfa_msgq.c
>+++ b/drivers/net/ethernet/brocade/bna/bfa_msgq.c
>@@ -483,7 +483,7 @@ bfa_msgq_rspq_pi_update(struct bfa_msgq_rspq *rspq,
>struct bfi_mbmsg *mb)
> mc = msghdr->msg_class;
> num_entries = ntohs(msghdr->num_entries);
>
>- if ((mc > BFI_MC_MAX) || (rspq->rsphdlr[mc].cbfn == NULL))
>+ if ((mc >= BFI_MC_MAX) || (rspq->rsphdlr[mc].cbfn == NULL))
> break;
>
> (rspq->rsphdlr[mc].cbfn)(rspq->rsphdlr[mc].cbarg, msghdr);
Acked-by: Rasesh Mody <rmody@brocade.com>
Thanks,
Rasesh
^ permalink raw reply
* Re: pull request: batman-adv 2011-08-24
From: David Miller @ 2011-08-24 17:35 UTC (permalink / raw)
To: lindner_marek; +Cc: netdev, b.a.t.m.a.n
In-Reply-To: <1314190838-2273-1-git-send-email-lindner_marek@yahoo.de>
From: Marek Lindner <lindner_marek@yahoo.de>
Date: Wed, 24 Aug 2011 15:00:30 +0200
> the following 8 patches constitute the first batch I'd like to get the pulled
> into net-next-2.6/3.2. They bring a new feature (AP isolation on the mesh
> layer), some minor cleanups, spelling fixes and some additional debugfs
> output.
...
> git://git.open-mesh.org/linux-merge.git batman-adv/next
Pulled, thanks.
^ permalink raw reply
* Re: [PATCH 01/75] net: add APIs for manipulating skb page fragments.
From: Konrad Rzeszutek Wilk @ 2011-08-24 18:21 UTC (permalink / raw)
To: Ian Campbell
Cc: netdev, linux-kernel, David S. Miller, Eric Dumazet,
Michał Mirosław
In-Reply-To: <1313760467-8598-1-git-send-email-ian.campbell@citrix.com>
On Fri, Aug 19, 2011 at 02:26:33PM +0100, Ian Campbell wrote:
> The primary aim is to add skb_frag_(ref|unref) in order to remove the use of
> bare get/put_page on SKB pages fragments and to isolate users from subsequent
> changes to the skb_frag_t data structure.
>
> Also included are helper APIs for passing a paged fragment to kmap and
> dma_map_page since I was seeing the same pattern a lot. A helper for
> pci_map_page is ommitted due to Michał Mirosław's recommendation that users
> should transition to pci_map_page instead.
You mean "transition to dma_map_page instead." ?
^ permalink raw reply
* Re: [PATCH 68/75] hv: netvsc: convert to SKB paged frag API.
From: Konrad Rzeszutek Wilk @ 2011-08-24 18:30 UTC (permalink / raw)
To: Ian Campbell
Cc: netdev, linux-kernel, Hank Janssen, Haiyang Zhang,
Greg Kroah-Hartman, K. Y. Srinivasan, Abhishek Kane, devel
In-Reply-To: <1313760467-8598-68-git-send-email-ian.campbell@citrix.com>
What is with the 'XXX' ?
> diff --git a/drivers/net/cxgb4/sge.c b/drivers/net/cxgb4/sge.c
> index f1813b5..3e7c4b3 100644
> --- a/drivers/net/cxgb4/sge.c
> +++ b/drivers/net/cxgb4/sge.c
> @@ -1416,7 +1416,7 @@ static inline void copy_frags(struct sk_buff *skb,
> unsigned int n;
>
> /* usually there's just one frag */
> - skb_frag_set_page(skb, 0, gl->frags[0].page);
> + skb_frag_set_page(skb, 0, gl->frags[0].page.p); /* XXX */
> ssi->frags[0].page_offset = gl->frags[0].page_offset + offset;
> ssi->frags[0].size = gl->frags[0].size - offset;
> ssi->nr_frags = gl->nfrags;
> @@ -1425,7 +1425,7 @@ static inline void copy_frags(struct sk_buff *skb,
> memcpy(&ssi->frags[1], &gl->frags[1], n * sizeof(skb_frag_t));
>
> /* get a reference to the last page, we don't own it */
> - get_page(gl->frags[n].page);
> + get_page(gl->frags[n].page.p); /* XXX */
> }
>
> /**
> @@ -1482,7 +1482,7 @@ static void t4_pktgl_free(const struct pkt_gl *gl)
> const skb_frag_t *p;
>
> for (p = gl->frags, n = gl->nfrags - 1; n--; p++)
> - put_page(p->page);
> + put_page(p->page.p); /* XXX */
> }
>
> /*
> @@ -1635,7 +1635,7 @@ static void restore_rx_bufs(const struct pkt_gl *si, struct sge_fl *q,
> else
> q->cidx--;
> d = &q->sdesc[q->cidx];
> - d->page = si->frags[frags].page;
> + d->page = si->frags[frags].page.p; /* XXX */
> d->dma_addr |= RX_UNMAPPED_BUF;
> q->avail++;
> }
> @@ -1717,7 +1717,7 @@ static int process_responses(struct sge_rspq *q, int budget)
> for (frags = 0, fp = si.frags; ; frags++, fp++) {
> rsd = &rxq->fl.sdesc[rxq->fl.cidx];
> bufsz = get_buf_size(rsd);
> - fp->page = rsd->page;
> + fp->page.p = rsd->page; /* XXX */
> fp->page_offset = q->offset;
> fp->size = min(bufsz, len);
> len -= fp->size;
> @@ -1734,8 +1734,8 @@ static int process_responses(struct sge_rspq *q, int budget)
> get_buf_addr(rsd),
> fp->size, DMA_FROM_DEVICE);
>
> - si.va = page_address(si.frags[0].page) +
> - si.frags[0].page_offset;
> + si.va = page_address(si.frags[0].page.p) +
> + si.frags[0].page_offset; /* XXX */
>
> prefetch(si.va);
>
> diff --git a/drivers/net/cxgb4vf/sge.c b/drivers/net/cxgb4vf/sge.c
> index 6d6060e..3688423 100644
> --- a/drivers/net/cxgb4vf/sge.c
> +++ b/drivers/net/cxgb4vf/sge.c
> @@ -1397,7 +1397,7 @@ struct sk_buff *t4vf_pktgl_to_skb(const struct pkt_gl *gl,
> skb_copy_to_linear_data(skb, gl->va, pull_len);
>
> ssi = skb_shinfo(skb);
> - skb_frag_set_page(skb, 0, gl->frags[0].page);
> + skb_frag_set_page(skb, 0, gl->frags[0].page.p); /* XXX */
> ssi->frags[0].page_offset = gl->frags[0].page_offset + pull_len;
> ssi->frags[0].size = gl->frags[0].size - pull_len;
> if (gl->nfrags > 1)
> @@ -1410,7 +1410,7 @@ struct sk_buff *t4vf_pktgl_to_skb(const struct pkt_gl *gl,
> skb->truesize += skb->data_len;
>
> /* Get a reference for the last page, we don't own it */
> - get_page(gl->frags[gl->nfrags - 1].page);
> + get_page(gl->frags[gl->nfrags - 1].page.p); /* XXX */
> }
>
> out:
> @@ -1430,7 +1430,7 @@ void t4vf_pktgl_free(const struct pkt_gl *gl)
>
> frag = gl->nfrags - 1;
> while (frag--)
> - put_page(gl->frags[frag].page);
> + put_page(gl->frags[frag].page.p); /* XXX */
> }
>
> /**
> @@ -1450,7 +1450,7 @@ static inline void copy_frags(struct sk_buff *skb,
> unsigned int n;
>
> /* usually there's just one frag */
> - skb_frag_set_page(skb, 0, gl->frags[0].page);
> + skb_frag_set_page(skb, 0, gl->frags[0].page.p); /* XXX */
> si->frags[0].page_offset = gl->frags[0].page_offset + offset;
> si->frags[0].size = gl->frags[0].size - offset;
> si->nr_frags = gl->nfrags;
> @@ -1460,7 +1460,7 @@ static inline void copy_frags(struct sk_buff *skb,
> memcpy(&si->frags[1], &gl->frags[1], n * sizeof(skb_frag_t));
>
> /* get a reference to the last page, we don't own it */
> - get_page(gl->frags[n].page);
> + get_page(gl->frags[n].page.p); /* XXX */
> }
>
> /**
> @@ -1613,7 +1613,7 @@ static void restore_rx_bufs(const struct pkt_gl *gl, struct sge_fl *fl,
> else
> fl->cidx--;
> sdesc = &fl->sdesc[fl->cidx];
> - sdesc->page = gl->frags[frags].page;
> + sdesc->page = gl->frags[frags].page.p; /* XXX */
> sdesc->dma_addr |= RX_UNMAPPED_BUF;
> fl->avail++;
> }
> @@ -1701,7 +1701,7 @@ int process_responses(struct sge_rspq *rspq, int budget)
> BUG_ON(rxq->fl.avail == 0);
> sdesc = &rxq->fl.sdesc[rxq->fl.cidx];
> bufsz = get_buf_size(sdesc);
> - fp->page = sdesc->page;
> + fp->page.p = sdesc->page; /* XXX */
> fp->page_offset = rspq->offset;
> fp->size = min(bufsz, len);
> len -= fp->size;
> @@ -1719,8 +1719,8 @@ int process_responses(struct sge_rspq *rspq, int budget)
> dma_sync_single_for_cpu(rspq->adapter->pdev_dev,
> get_buf_addr(sdesc),
> fp->size, DMA_FROM_DEVICE);
> - gl.va = (page_address(gl.frags[0].page) +
> - gl.frags[0].page_offset);
> + gl.va = (page_address(gl.frags[0].page.p) +
> + gl.frags[0].page_offset); /* XXX */
> prefetch(gl.va);
>
^ permalink raw reply
* [PATCH 0/5] SUNRPC: make rpcbind clients allocated and destroyed on dynamically
From: Stanislav Kinsbursky @ 2011-08-24 18:33 UTC (permalink / raw)
To: Trond.Myklebust
Cc: linux-nfs, xemul, neilb, netdev, linux-kernel, bfields, davem
This patch is required for further RPC layer virtualization, because rpcbind
clients have to be per network namespace.
To achive this, we have to untie network namespace from rpcbind clients sockets.
The idea of this patch set is to make rpcbind clients non-static. I.e. rpcbind
clients will be created during first RPC service creation, and destroyed when
last RPC service is stopped.
With this patch set rpcbind clients can be virtualized easely.
The following series consists of:
---
Stanislav Kinsbursky (5):
SUNRPC: introduce helpers for reference counted rpcbind clients
SUNRPC: use reference count helpers
SUNRPC: make RPC service dependable on rpcbind clients creation
SUNRPC: remove rpcbind clients creation during service registring
SUNRPC: remove rpcbind clients destruction on module cleanup
include/linux/sunrpc/clnt.h | 2 +
net/sunrpc/rpcb_clnt.c | 86 ++++++++++++++++++++++++++++---------------
net/sunrpc/sunrpc_syms.c | 3 --
net/sunrpc/svc.c | 5 +++
4 files changed, 63 insertions(+), 33 deletions(-)
--
Signature
^ permalink raw reply
* [PATCH 1/5] SUNRPC: introduce helpers for reference counted rpcbind clients
From: Stanislav Kinsbursky @ 2011-08-24 18:33 UTC (permalink / raw)
To: Trond.Myklebust
Cc: linux-nfs, xemul, neilb, netdev, linux-kernel, bfields, davem
In-Reply-To: <20110824183304.4924.94670.stgit@localhost6.localdomain6>
This helpers will be used for dynamical creation and destruction of rpcbind
clients.
Variable rpcb_users is actually a counter of lauched RPC services. If rpcbind
client has been created already, then we just increase rpcb_users.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
---
net/sunrpc/rpcb_clnt.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 51 insertions(+), 0 deletions(-)
diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
index e45d2fb..c84e6a3 100644
--- a/net/sunrpc/rpcb_clnt.c
+++ b/net/sunrpc/rpcb_clnt.c
@@ -114,6 +114,9 @@ static struct rpc_program rpcb_program;
static struct rpc_clnt * rpcb_local_clnt;
static struct rpc_clnt * rpcb_local_clnt4;
+DEFINE_SPINLOCK(rpcb_clnt_lock);
+unsigned int rpcb_users;
+
struct rpcbind_args {
struct rpc_xprt * r_xprt;
@@ -161,6 +164,54 @@ static void rpcb_map_release(void *data)
kfree(map);
}
+static int rpcb_get_local(void)
+{
+ spin_lock(&rpcb_clnt_lock);
+ if (rpcb_users)
+ rpcb_users++;
+ spin_unlock(&rpcb_clnt_lock);
+
+ return rpcb_users;
+}
+
+void rpcb_put_local(void)
+{
+ struct rpc_clnt *clnt = rpcb_local_clnt;
+ struct rpc_clnt *clnt4 = rpcb_local_clnt4;
+ int shutdown;
+
+ spin_lock(&rpcb_clnt_lock);
+ if (--rpcb_users == 0) {
+ rpcb_local_clnt = NULL;
+ rpcb_local_clnt4 = NULL;
+ }
+ shutdown = !rpcb_users;
+ spin_unlock(&rpcb_clnt_lock);
+
+ if (shutdown) {
+ /*
+ * cleanup_rpcb_clnt - remove xprtsock's sysctls, unregister
+ */
+ if (clnt4)
+ rpc_shutdown_client(clnt4);
+ if (clnt)
+ rpc_shutdown_client(clnt);
+ }
+ return;
+}
+
+static void rpcb_set_local(struct rpc_clnt *clnt, struct rpc_clnt *clnt4)
+{
+ /* Protected by rpcb_create_local_mutex */
+ rpcb_local_clnt = clnt;
+ rpcb_local_clnt4 = clnt4;
+ rpcb_users++;
+ dprintk("RPC: created new rpcb local clients (rpcb_local_clnt: "
+ "0x%p, rpcb_local_clnt4: 0x%p)\n", rpcb_local_clnt,
+ rpcb_local_clnt4);
+
+}
+
/*
* Returns zero on success, otherwise a negative errno value
* is returned.
^ permalink raw reply related
* [PATCH 2/5] SUNRPC: use reference count helpers
From: Stanislav Kinsbursky @ 2011-08-24 18:33 UTC (permalink / raw)
To: Trond.Myklebust
Cc: linux-nfs, xemul, neilb, netdev, linux-kernel, bfields, davem
In-Reply-To: <20110824183304.4924.94670.stgit@localhost6.localdomain6>
All is simple: we just increase users conters if rpcbind clients are present
already. Otherwise we create new rpcbind clients and set users counter to 1.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
---
net/sunrpc/rpcb_clnt.c | 12 ++++--------
1 files changed, 4 insertions(+), 8 deletions(-)
diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
index c84e6a3..b4cc0f1 100644
--- a/net/sunrpc/rpcb_clnt.c
+++ b/net/sunrpc/rpcb_clnt.c
@@ -256,9 +256,7 @@ static int rpcb_create_local_unix(void)
clnt4 = NULL;
}
- /* Protected by rpcb_create_local_mutex */
- rpcb_local_clnt = clnt;
- rpcb_local_clnt4 = clnt4;
+ rpcb_set_local(clnt, clnt4);
out:
return result;
@@ -310,9 +308,7 @@ static int rpcb_create_local_net(void)
clnt4 = NULL;
}
- /* Protected by rpcb_create_local_mutex */
- rpcb_local_clnt = clnt;
- rpcb_local_clnt4 = clnt4;
+ rpcb_set_local(clnt, clnt4);
out:
return result;
@@ -327,11 +323,11 @@ static int rpcb_create_local(void)
static DEFINE_MUTEX(rpcb_create_local_mutex);
int result = 0;
- if (rpcb_local_clnt)
+ if (rpcb_get_local())
return result;
mutex_lock(&rpcb_create_local_mutex);
- if (rpcb_local_clnt)
+ if (rpcb_get_local())
goto out;
if (rpcb_create_local_unix() != 0)
^ permalink raw reply related
* [PATCH 3/5] SUNRPC: make RPC service dependable on rpcbind clients creation
From: Stanislav Kinsbursky @ 2011-08-24 18:33 UTC (permalink / raw)
To: Trond.Myklebust
Cc: linux-nfs, xemul, neilb, netdev, linux-kernel, bfields, davem
In-Reply-To: <20110824183304.4924.94670.stgit@localhost6.localdomain6>
We create or increase users counter of rcbind clients during RPC service
creation and decrease this counter (and possibly destroy those clients) on RPC
service destruction.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
---
include/linux/sunrpc/clnt.h | 2 ++
net/sunrpc/rpcb_clnt.c | 2 +-
net/sunrpc/svc.c | 5 +++++
3 files changed, 8 insertions(+), 1 deletions(-)
diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index db7bcaf..65a8115 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -135,10 +135,12 @@ void rpc_shutdown_client(struct rpc_clnt *);
void rpc_release_client(struct rpc_clnt *);
void rpc_task_release_client(struct rpc_task *);
+int rpcb_create_local(void);
int rpcb_register(u32, u32, int, unsigned short);
int rpcb_v4_register(const u32 program, const u32 version,
const struct sockaddr *address,
const char *netid);
+void rpcb_put_local(void);
void rpcb_getport_async(struct rpc_task *);
void rpc_call_start(struct rpc_task *);
diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
index b4cc0f1..437ec60 100644
--- a/net/sunrpc/rpcb_clnt.c
+++ b/net/sunrpc/rpcb_clnt.c
@@ -318,7 +318,7 @@ out:
* Returns zero on success, otherwise a negative errno value
* is returned.
*/
-static int rpcb_create_local(void)
+int rpcb_create_local(void)
{
static DEFINE_MUTEX(rpcb_create_local_mutex);
int result = 0;
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 6a69a11..0df8532 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -367,6 +367,9 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
unsigned int xdrsize;
unsigned int i;
+ if (rpcb_create_local() < 0)
+ return NULL;
+
if (!(serv = kzalloc(sizeof(*serv), GFP_KERNEL)))
return NULL;
serv->sv_name = prog->pg_name;
@@ -491,6 +494,8 @@ svc_destroy(struct svc_serv *serv)
svc_unregister(serv);
kfree(serv->sv_pools);
kfree(serv);
+
+ rpcb_put_local();
}
EXPORT_SYMBOL_GPL(svc_destroy);
^ permalink raw reply related
* [PATCH 4/5] SUNRPC: remove rpcbind clients creation during service registring
From: Stanislav Kinsbursky @ 2011-08-24 18:34 UTC (permalink / raw)
To: Trond.Myklebust
Cc: linux-nfs, xemul, neilb, netdev, linux-kernel, bfields, davem
In-Reply-To: <20110824183304.4924.94670.stgit@localhost6.localdomain6>
We don't need this code since rpcbind clients are creating during RPC service
creation.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
---
net/sunrpc/rpcb_clnt.c | 9 ---------
1 files changed, 0 insertions(+), 9 deletions(-)
diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
index 437ec60..f363efe 100644
--- a/net/sunrpc/rpcb_clnt.c
+++ b/net/sunrpc/rpcb_clnt.c
@@ -429,11 +429,6 @@ int rpcb_register(u32 prog, u32 vers, int prot, unsigned short port)
struct rpc_message msg = {
.rpc_argp = &map,
};
- int error;
-
- error = rpcb_create_local();
- if (error)
- return error;
dprintk("RPC: %sregistering (%u, %u, %d, %u) with local "
"rpcbind\n", (port ? "" : "un"),
@@ -569,11 +564,7 @@ int rpcb_v4_register(const u32 program, const u32 version,
struct rpc_message msg = {
.rpc_argp = &map,
};
- int error;
- error = rpcb_create_local();
- if (error)
- return error;
if (rpcb_local_clnt4 == NULL)
return -EPROTONOSUPPORT;
^ permalink raw reply related
* [PATCH 5/5] SUNRPC: remove rpcbind clients destruction on module cleanup
From: Stanislav Kinsbursky @ 2011-08-24 18:34 UTC (permalink / raw)
To: Trond.Myklebust
Cc: linux-nfs, xemul, neilb, netdev, linux-kernel, bfields, davem
In-Reply-To: <20110824183304.4924.94670.stgit@localhost6.localdomain6>
We don't need this anymore since now rpcbind clients are destroying during last
RPC service shutdown.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
---
net/sunrpc/rpcb_clnt.c | 12 ------------
net/sunrpc/sunrpc_syms.c | 3 ---
2 files changed, 0 insertions(+), 15 deletions(-)
diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
index f363efe..94a310d 100644
--- a/net/sunrpc/rpcb_clnt.c
+++ b/net/sunrpc/rpcb_clnt.c
@@ -1098,15 +1098,3 @@ static struct rpc_program rpcb_program = {
.version = rpcb_version,
.stats = &rpcb_stats,
};
-
-/**
- * cleanup_rpcb_clnt - remove xprtsock's sysctls, unregister
- *
- */
-void cleanup_rpcb_clnt(void)
-{
- if (rpcb_local_clnt4)
- rpc_shutdown_client(rpcb_local_clnt4);
- if (rpcb_local_clnt)
- rpc_shutdown_client(rpcb_local_clnt);
-}
diff --git a/net/sunrpc/sunrpc_syms.c b/net/sunrpc/sunrpc_syms.c
index 9d08091..8ec9778 100644
--- a/net/sunrpc/sunrpc_syms.c
+++ b/net/sunrpc/sunrpc_syms.c
@@ -61,8 +61,6 @@ static struct pernet_operations sunrpc_net_ops = {
extern struct cache_detail unix_gid_cache;
-extern void cleanup_rpcb_clnt(void);
-
static int __init
init_sunrpc(void)
{
@@ -102,7 +100,6 @@ out:
static void __exit
cleanup_sunrpc(void)
{
- cleanup_rpcb_clnt();
rpcauth_remove_module();
cleanup_socket_xprt();
svc_cleanup_xprt_sock();
^ permalink raw reply related
* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
From: Alexander Zimmermann @ 2011-08-24 19:03 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Jerry Chu, Lukowski Damian, Hannemann Arnd
In-Reply-To: <1314202918.2296.39.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>
Hi Eric,
Am 24.08.2011 um 18:21 schrieb Eric Dumazet:
> On one dev machine running net-next, I just found strange tcp sessions
> that retransmit a frame forever (The other peer disappeared)
not forever...
If remember correctly you will stop after 120s.
>
> # ss -emoi dst 10.2.1.1
> State Recv-Q Send-Q Local Address:Port Peer Address:Port
> ESTAB 0 816 10.2.1.2:37930 10.2.1.1:ssh timer:(on,630ms,246) ino:60786 sk:ffff8801189aa400
> mem:(r0,w3776,f320,t0) ts sack ecn cubic wscale:8,6 rto:1680 rtt:16.25/7.5 ato:40 ssthresh:7 send 1.4Mbps rcv_rtt:10 rcv_space:16632
>
>
> You can see the retransmit count : 246
>
> What possibly can be going on ?
>
> What happened to backoff ?
>
> # grep . /proc/sys/net/ipv4/tcp_retries*
> /proc/sys/net/ipv4/tcp_retries1:3
> /proc/sys/net/ipv4/tcp_retries2:15
>
>
>
> extract of tcpdump :
>
> 12:01:02.074244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128024 59389>
> 12:01:03.754243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128192 59389>
> 12:01:05.434245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128360 59389>
> 12:01:07.114243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128528 59389>
> 12:01:08.794248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128696 59389>
> 12:01:10.474242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128864 59389>
> 12:01:12.154243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129032 59389>
> 12:01:13.834241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129200 59389>
> 12:01:15.514246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129368 59389>
> 12:01:17.194244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129536 59389>
> 12:01:18.874248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129704 59389>
> 12:01:20.554243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129872 59389>
> 12:01:22.234244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130040 59389>
> 12:01:23.914244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130208 59389>
> 12:01:25.594247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130376 59389>
> 12:01:27.274242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130544 59389>
> 12:01:28.954242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130712 59389>
> 12:01:30.634248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130880 59389>
> 12:01:32.314245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131048 59389>
> 12:01:33.994243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131216 59389>
> 12:01:35.674250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131384 59389>
> 12:01:37.354244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131552 59389>
> 12:01:39.034245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131720 59389>
> 12:01:40.714245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131888 59389>
> 12:01:42.394245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132056 59389>
> 12:01:44.074242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132224 59389>
> 12:01:45.754249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132392 59389>
> 12:01:47.434242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132560 59389>
> 12:01:49.114247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132728 59389>
> 12:01:50.794250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132896 59389>
> 12:01:52.474247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133064 59389>
> 12:01:54.154242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133232 59389>
> 12:01:55.834246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133400 59389>
> 12:01:57.514243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133568 59389>
> 12:01:59.194247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133736 59389>
> 12:02:00.874250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133904 59389>
> 12:02:02.554242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134072 59389>
> 12:02:04.234243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134240 59389>
> 12:02:05.914245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134408 59389>
> 12:02:07.594244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134576 59389>
> 12:02:09.274249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134744 59389>
> 12:02:10.954241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134912 59389>
> 12:02:12.634249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16135080 59389>
>
> tcp_retransmit_timer() does the exponential backoff, but something
> resets icsk_rto to a low value ?
>
> Ah, it seems to be because of commit f1ecd5d9e7366609
> (Revert Backoff [v3]: Revert RTO on ICMP destination unreachable)
>
> Since arp resolution (or routing, I dont know yet) fails, an
> internal/loopback ICMP host/network unreachable message is
> generated and handled in tcp_v4_err() :
Yeah, you have a local connectivity disruption. This is one
possible scenario.
>
> icsk_backoff-- and icsk_rto is reset.
>
> I am afraid this can generate a storm (cpu time at very least),
> in case we have many tcp sessions in this state.
Hmm, maybe. I don't know. Arnd or Damian what are you thing about this point?
>
> I guess its time for me to read RFC 6069
If you find a bug. Let me know.
Alex
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//
^ permalink raw reply
* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
From: Eric Dumazet @ 2011-08-24 19:45 UTC (permalink / raw)
To: Alexander Zimmermann; +Cc: netdev, Jerry Chu, Lukowski Damian, Hannemann Arnd
In-Reply-To: <3482698A-C35B-4BED-AEEF-EBA135991705@comsys.rwth-aachen.de>
Le mercredi 24 août 2011 à 21:03 +0200, Alexander Zimmermann a écrit :
> Hi Eric,
>
> Am 24.08.2011 um 18:21 schrieb Eric Dumazet:
>
> > On one dev machine running net-next, I just found strange tcp sessions
> > that retransmit a frame forever (The other peer disappeared)
>
> not forever...
> If remember correctly you will stop after 120s.
>
Hi Alexander
I just tried again one session, and got much more delay than that.
It stops because of a side effect, "icsk_retransmits" being a 8bit
field.
Every 256 retransmits, it becomes 255+1 -> 0
retransmits_timed_out() immediately returns false.
And backoff increases at this time.
Eventually, we retransmit 256*15 times, process 256*15 ICMP messages.
Thanks
^ permalink raw reply
* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
From: Jerry Chu @ 2011-08-24 19:39 UTC (permalink / raw)
To: Alexander Zimmermann
Cc: Eric Dumazet, netdev, Lukowski Damian, Hannemann Arnd
In-Reply-To: <3482698A-C35B-4BED-AEEF-EBA135991705@comsys.rwth-aachen.de>
Hi Alexander,
On Wed, Aug 24, 2011 at 12:03 PM, Alexander Zimmermann
<alexander.zimmermann@comsys.rwth-aachen.de> wrote:
> Hi Eric,
>
> Am 24.08.2011 um 18:21 schrieb Eric Dumazet:
>
>> On one dev machine running net-next, I just found strange tcp sessions
>> that retransmit a frame forever (The other peer disappeared)
>
> not forever...
> If remember correctly you will stop after 120s.
Yup. It looks like this "feature" was introduced in the patch
"Revert Backoff [v3]: Calculate TCP's connection close threshold as a
time value"
by Damian as well to bound the abort timeout by time duration rather
than how many
retries (icsk_retransmits). But as pointed out if rto is small it
could mean a lot of
retransmissions before one gives up.
Jerry
>
>>
>> # ss -emoi dst 10.2.1.1
>> State Recv-Q Send-Q Local Address:Port Peer Address:Port
>> ESTAB 0 816 10.2.1.2:37930 10.2.1.1:ssh timer:(on,630ms,246) ino:60786 sk:ffff8801189aa400
>> mem:(r0,w3776,f320,t0) ts sack ecn cubic wscale:8,6 rto:1680 rtt:16.25/7.5 ato:40 ssthresh:7 send 1.4Mbps rcv_rtt:10 rcv_space:16632
>>
>>
>> You can see the retransmit count : 246
>>
>> What possibly can be going on ?
>>
>> What happened to backoff ?
>>
>> # grep . /proc/sys/net/ipv4/tcp_retries*
>> /proc/sys/net/ipv4/tcp_retries1:3
>> /proc/sys/net/ipv4/tcp_retries2:15
>>
>>
>>
>> extract of tcpdump :
>>
>> 12:01:02.074244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128024 59389>
>> 12:01:03.754243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128192 59389>
>> 12:01:05.434245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128360 59389>
>> 12:01:07.114243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128528 59389>
>> 12:01:08.794248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128696 59389>
>> 12:01:10.474242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128864 59389>
>> 12:01:12.154243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129032 59389>
>> 12:01:13.834241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129200 59389>
>> 12:01:15.514246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129368 59389>
>> 12:01:17.194244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129536 59389>
>> 12:01:18.874248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129704 59389>
>> 12:01:20.554243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129872 59389>
>> 12:01:22.234244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130040 59389>
>> 12:01:23.914244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130208 59389>
>> 12:01:25.594247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130376 59389>
>> 12:01:27.274242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130544 59389>
>> 12:01:28.954242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130712 59389>
>> 12:01:30.634248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130880 59389>
>> 12:01:32.314245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131048 59389>
>> 12:01:33.994243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131216 59389>
>> 12:01:35.674250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131384 59389>
>> 12:01:37.354244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131552 59389>
>> 12:01:39.034245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131720 59389>
>> 12:01:40.714245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131888 59389>
>> 12:01:42.394245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132056 59389>
>> 12:01:44.074242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132224 59389>
>> 12:01:45.754249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132392 59389>
>> 12:01:47.434242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132560 59389>
>> 12:01:49.114247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132728 59389>
>> 12:01:50.794250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132896 59389>
>> 12:01:52.474247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133064 59389>
>> 12:01:54.154242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133232 59389>
>> 12:01:55.834246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133400 59389>
>> 12:01:57.514243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133568 59389>
>> 12:01:59.194247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133736 59389>
>> 12:02:00.874250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133904 59389>
>> 12:02:02.554242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134072 59389>
>> 12:02:04.234243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134240 59389>
>> 12:02:05.914245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134408 59389>
>> 12:02:07.594244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134576 59389>
>> 12:02:09.274249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134744 59389>
>> 12:02:10.954241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134912 59389>
>> 12:02:12.634249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16135080 59389>
>>
>> tcp_retransmit_timer() does the exponential backoff, but something
>> resets icsk_rto to a low value ?
>>
>> Ah, it seems to be because of commit f1ecd5d9e7366609
>> (Revert Backoff [v3]: Revert RTO on ICMP destination unreachable)
>>
>> Since arp resolution (or routing, I dont know yet) fails, an
>> internal/loopback ICMP host/network unreachable message is
>> generated and handled in tcp_v4_err() :
>
> Yeah, you have a local connectivity disruption. This is one
> possible scenario.
>
>>
>> icsk_backoff-- and icsk_rto is reset.
>>
>> I am afraid this can generate a storm (cpu time at very least),
>> in case we have many tcp sessions in this state.
>
> Hmm, maybe. I don't know. Arnd or Damian what are you thing about this point?
>
>>
>> I guess its time for me to read RFC 6069
>
> If you find a bug. Let me know.
>
> Alex
>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> //
> // Dipl.-Inform. Alexander Zimmermann
> // Department of Computer Science, Informatik 4
> // RWTH Aachen University
> // Ahornstr. 55, 52056 Aachen, Germany
> // phone: (49-241) 80-21422, fax: (49-241) 80-22222
> // email: zimmermann@cs.rwth-aachen.de
> // web: http://www.umic-mesh.net
> //
>
>
^ permalink raw reply
* Re: [Bugme-new] [Bug 40572] New: Intel Gigabit Ethernet 82576 50% packet loss after reboot
From: Alexander Duyck @ 2011-08-24 20:25 UTC (permalink / raw)
To: Andrew Morton; +Cc: netdev, e1000-devel, bugme-daemon, vojcik
In-Reply-To: <20110823143053.832c1aaa.akpm@linux-foundation.org>
On 08/23/2011 02:30 PM, Andrew Morton wrote:
> (switched to email. Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Fri, 5 Aug 2011 07:07:05 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=40572
>>
>> Summary: Intel Gigabit Ethernet 82576 50% packet loss after
>> reboot
>> Product: Drivers
>> Version: 2.5
>> Kernel Version: 3.0
>> Platform: All
>> OS/Version: Linux
>> Tree: Mainline
>> Status: NEW
>> Severity: blocking
>> Priority: P1
>> Component: Network
>> AssignedTo: drivers_network@kernel-bugs.osdl.org
>> ReportedBy: vojcik@gmail.com
>> Regression: No
> I'll change this to "yes".
>
>> Hi,
>>
>> I have strange problem with Intel dualport Gigabit ehternet card.
>> Problem appears after 3rd - 5th reboot.
>>
>> If you ping or make any network traffic you get 50% packet loss. No error
>> messages in logs.
>> When you make reboot all is ok in next few reboots.
>>
>> We have eliminated network problems like switches, cables etc. It's software
>> related.
>>
>> It looks like in kernel 2.6.37 we have the same problem but in 2.6.28.6
>> everything looks fine.
>>
>> I attach some files for additional information
This type of issue is typically a sign of a hardware problem. I would
recommend doing an lspci -vvv for the device in both the working and the
non-working cases to see if there is any difference between the two.
One thing we have seen in the past is an issue where the PCIe will not
link at x4 in all cases and will sometimes link at only x1. When this
occurs the device does not have enough PCIe bandwidth to handle heavy
workloads. You might want to try either reseating the network adapter
into the slot or moving it from one PCIe slot to another in the system
as it is possible the PCIe slot it is in may have an issue with one ore
more of the PCIe lanes.
Thanks,
Alex
^ permalink raw reply
* [PATCH net-next] rps: support IPIP encapsulation
From: Eric Dumazet @ 2011-08-24 20:41 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Tom Herbert
Skip IPIP header to get proper layer-4 information.
Like GRE tunnels, this only works if rxhash is not already provided by
the device itself (ethtool -K ethX rxhash off), to allow kernel compute
a software rxhash.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
A piece of cake ;)
net/core/dev.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/core/dev.c b/net/core/dev.c
index a4306f7..b668a3d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2608,6 +2608,8 @@ again:
}
}
break;
+ case IPPROTO_IPIP:
+ goto again;
default:
break;
}
^ permalink raw reply related
* Re: [PATCH 01/75] net: add APIs for manipulating skb page fragments.
From: Ian Campbell @ 2011-08-24 21:09 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
David S. Miller, Eric Dumazet, Michał Mirosław
In-Reply-To: <20110824182113.GF15675@dumpdata.com>
On Wed, 2011-08-24 at 19:21 +0100, Konrad Rzeszutek Wilk wrote:
> On Fri, Aug 19, 2011 at 02:26:33PM +0100, Ian Campbell wrote:
> > The primary aim is to add skb_frag_(ref|unref) in order to remove the use of
> > bare get/put_page on SKB pages fragments and to isolate users from subsequent
> > changes to the skb_frag_t data structure.
> >
> > Also included are helper APIs for passing a paged fragment to kmap and
> > dma_map_page since I was seeing the same pattern a lot. A helper for
> > pci_map_page is ommitted due to Michał Mirosław's recommendation that users
> > should transition to pci_map_page instead.
>
> You mean "transition to dma_map_page instead." ?
That's right, oops.
Ian.
^ permalink raw reply
* Re: [PATCH 68/75] hv: netvsc: convert to SKB paged frag API.
From: Ian Campbell @ 2011-08-24 21:10 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: netdev, linux-kernel, Hank Janssen, Haiyang Zhang,
Greg Kroah-Hartman, K. Y. Srinivasan, Abhishek Kane, devel
In-Reply-To: <20110824183010.GG15675@dumpdata.com>
On Wed, 2011-08-24 at 14:30 -0400, Konrad Rzeszutek Wilk wrote:
> What is with the 'XXX' ?
Those bits were supposed to be the "net: add support for
per-paged-fragment destructors" patch which I accidentally squashed into
the hv driver.
As I said to Dan they are a reminder referring to the dilemma I
mentioned in the intro mail. Basically that ".p" is ugly because
gl->frags[0].page isn't actually a paged fragment, it's just that this
driver uses skb_frag_t in its internal datastructures.
Ian.
^ permalink raw reply
* Re: [PATCH 01/75] net: add APIs for manipulating skb page fragments.
From: Konrad Rzeszutek Wilk @ 2011-08-24 21:15 UTC (permalink / raw)
To: Ian Campbell
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
David S. Miller, Eric Dumazet, Michał Mirosław
In-Reply-To: <1314220141.17978.693.camel@dagon.hellion.org.uk>
On Wed, Aug 24, 2011 at 10:09:01PM +0100, Ian Campbell wrote:
> On Wed, 2011-08-24 at 19:21 +0100, Konrad Rzeszutek Wilk wrote:
> > On Fri, Aug 19, 2011 at 02:26:33PM +0100, Ian Campbell wrote:
> > > The primary aim is to add skb_frag_(ref|unref) in order to remove the use of
> > > bare get/put_page on SKB pages fragments and to isolate users from subsequent
> > > changes to the skb_frag_t data structure.
> > >
> > > Also included are helper APIs for passing a paged fragment to kmap and
> > > dma_map_page since I was seeing the same pattern a lot. A helper for
> > > pci_map_page is ommitted due to Michał Mirosław's recommendation that users
> > > should transition to pci_map_page instead.
> >
> > You mean "transition to dma_map_page instead." ?
>
> That's right, oops.
With a big pot of tea next to me to keep me sharp I took look at all
75 patches and besides this comment and the 'XXX' (which I already emailed
you about it - it also shows up in "36/75] myri10ge: convert to SKB paged frag API.")
they all looked OK to me.
So you can stick 'Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>'
on all of them if you would like.
^ permalink raw reply
* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
From: Ilpo Järvinen @ 2011-08-24 22:44 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Jerry Chu, Damian Lukowski
In-Reply-To: <1314202918.2296.39.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>
On Wed, 24 Aug 2011, Eric Dumazet wrote:
> On one dev machine running net-next, I just found strange tcp sessions
> that retransmit a frame forever (The other peer disappeared)
>
> # ss -emoi dst 10.2.1.1
> State Recv-Q Send-Q Local Address:Port Peer Address:Port
> ESTAB 0 816 10.2.1.2:37930 10.2.1.1:ssh timer:(on,630ms,246) ino:60786 sk:ffff8801189aa400
> mem:(r0,w3776,f320,t0) ts sack ecn cubic wscale:8,6 rto:1680 rtt:16.25/7.5 ato:40 ssthresh:7 send 1.4Mbps rcv_rtt:10 rcv_space:16632
>
>
> You can see the retransmit count : 246
>
> What possibly can be going on ?
>
> What happened to backoff ?
>
> # grep . /proc/sys/net/ipv4/tcp_retries*
> /proc/sys/net/ipv4/tcp_retries1:3
> /proc/sys/net/ipv4/tcp_retries2:15
>
>
>
> extract of tcpdump :
>
> 12:01:02.074244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128024 59389>
> 12:01:03.754243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128192 59389>
> 12:01:05.434245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128360 59389>
> 12:01:07.114243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128528 59389>
> 12:01:08.794248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128696 59389>
> 12:01:10.474242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128864 59389>
> 12:01:12.154243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129032 59389>
> 12:01:13.834241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129200 59389>
> 12:01:15.514246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129368 59389>
> 12:01:17.194244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129536 59389>
> 12:01:18.874248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129704 59389>
> 12:01:20.554243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129872 59389>
> 12:01:22.234244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130040 59389>
> 12:01:23.914244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130208 59389>
> 12:01:25.594247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130376 59389>
> 12:01:27.274242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130544 59389>
> 12:01:28.954242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130712 59389>
> 12:01:30.634248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130880 59389>
> 12:01:32.314245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131048 59389>
> 12:01:33.994243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131216 59389>
> 12:01:35.674250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131384 59389>
> 12:01:37.354244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131552 59389>
> 12:01:39.034245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131720 59389>
> 12:01:40.714245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131888 59389>
> 12:01:42.394245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132056 59389>
> 12:01:44.074242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132224 59389>
> 12:01:45.754249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132392 59389>
> 12:01:47.434242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132560 59389>
> 12:01:49.114247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132728 59389>
> 12:01:50.794250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132896 59389>
> 12:01:52.474247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133064 59389>
> 12:01:54.154242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133232 59389>
> 12:01:55.834246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133400 59389>
> 12:01:57.514243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133568 59389>
> 12:01:59.194247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133736 59389>
> 12:02:00.874250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133904 59389>
> 12:02:02.554242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134072 59389>
> 12:02:04.234243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134240 59389>
> 12:02:05.914245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134408 59389>
> 12:02:07.594244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134576 59389>
> 12:02:09.274249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134744 59389>
> 12:02:10.954241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134912 59389>
> 12:02:12.634249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16135080 59389>
>
> tcp_retransmit_timer() does the exponential backoff, but something
> resets icsk_rto to a low value ?
>
> Ah, it seems to be because of commit f1ecd5d9e7366609
> (Revert Backoff [v3]: Revert RTO on ICMP destination unreachable)
>
> Since arp resolution (or routing, I dont know yet) fails, an
> internal/loopback ICMP host/network unreachable message is
> generated and handled in tcp_v4_err() :
>
> icsk_backoff-- and icsk_rto is reset.
>
> I am afraid this can generate a storm (cpu time at very least),
> in case we have many tcp sessions in this state.
But RTO (even without any backoffs) should be lower bounded to some not so
zeroish value?
> I guess its time for me to read RFC 6069
--
i.
^ permalink raw reply
* [RFC] per-containers tcp buffer limitation
From: Glauber Costa @ 2011-08-24 22:54 UTC (permalink / raw)
To: netdev-u79uwXL29TY76Z2rM5mHXA
Cc: Linux Containers, Pavel Emelyanov, David Miller,
ebiederm-aS9lmoZGLiVWk0Htik3J/w
[-- Attachment #1: Type: text/plain, Size: 1832 bytes --]
Hello,
This is a proof of concept of some code I have here to limit tcp send
and receive buffers per-container (in our case). At this phase, I am
more concerned in discussing my approach, so please curse my family no
further than the 3rd generation.
The problem we're trying to attack here, is that buffers can grow and
fill non-reclaimable kernel memory. When doing containers, we can't
afford having a malicious container pinning kernel memory at will,
therefore exhausting all the others.
So here a container will be seen in the host system as a group of tasks,
grouped in a cgroup. This cgroup will have files allowing us to specify
global per-cgroup limits on buffers. For that purpose, I created a new
sockets cgroup - didn't really think any other one of the existing would
do here.
As for the network code per-se, I tried to keep the same code that deals
with memory schedule as a basis and make it per-cgroup.
You will notice that struct proto now take function pointers to values
controlling memory pressure and will return per-cgroup data instead of
global ones. So the current behavior is maintained: after the first
threshold is hit, we enter memory pressure. After that, allocations are
suppressed.
Only tcp code was really touched here. udp had the pointers filled, but
we're not really controlling anything. But the fact that this lives in
generic code, makes it easier to do the same for other protocols in the
future.
For this patch specifically, I am not touching - just provisioning -
rmem and wmem specific knobs. I should also #ifdef a lot of this, but
hey, remember: rfc...
One drawback of this approach I found, is that cgroups does not really
work well with modules. A lot of the network code is modularized, so
this would have to be fixed somehow.
Let me know what you think.
[-- Attachment #2: patch-rfc-sndbuf.patch --]
[-- Type: text/plain, Size: 27967 bytes --]
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..744eb2c 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -53,6 +53,8 @@ SUBSYS(freezer)
SUBSYS(net_cls)
#endif
+SUBSYS(sockets)
+
/* */
#ifdef CONFIG_BLK_CGROUP
diff --git a/include/net/sock.h b/include/net/sock.h
index 8e4062f..aae468f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -63,6 +63,33 @@
#include <net/dst.h>
#include <net/checksum.h>
+#include <linux/cgroup.h>
+
+struct sockets_cgrp
+{
+ struct cgroup_subsys_state css;
+ struct sockets_cgrp *parent;
+ int tcp_memory_pressure;
+ int tcp_max_memory;
+ atomic_long_t tcp_memory_allocated;
+ struct percpu_counter tcp_sockets_allocated;
+ long tcp_prot_mem[3];
+
+ atomic_long_t udp_memory_allocated;
+};
+
+static inline struct sockets_cgrp *cgroup_sk(struct cgroup *cgrp)
+{
+ return container_of(cgroup_subsys_state(cgrp, sockets_subsys_id),
+ struct sockets_cgrp, css);
+}
+
+static inline struct sockets_cgrp *task_sk(struct task_struct *tsk)
+{
+ return container_of(task_subsys_state(tsk, sockets_subsys_id),
+ struct sockets_cgrp, css);
+}
+
/*
* This structure really needs to be cleaned up.
* Most of it is for TCP, and not used by any of
@@ -339,6 +366,7 @@ struct sock {
#endif
__u32 sk_mark;
u32 sk_classid;
+ struct sockets_cgrp *sk_cgrp;
void (*sk_state_change)(struct sock *sk);
void (*sk_data_ready)(struct sock *sk, int bytes);
void (*sk_write_space)(struct sock *sk);
@@ -785,19 +813,21 @@ struct proto {
#endif
/* Memory pressure */
- void (*enter_memory_pressure)(struct sock *sk);
- atomic_long_t *memory_allocated; /* Current allocated memory. */
- struct percpu_counter *sockets_allocated; /* Current number of sockets. */
+ void (*enter_memory_pressure)(struct sockets_cgrp *sg);
+ atomic_long_t *(*memory_allocated)(struct sockets_cgrp *sg); /* Current allocated memory. */
+ struct percpu_counter *(*sockets_allocated)(struct sockets_cgrp *sg); /* Current number of sockets. */
+
+ int (*init_cgroup)(struct cgroup *cgrp, struct cgroup_subsys *ss);
/*
* Pressure flag: try to collapse.
* Technical note: it is used by multiple contexts non atomically.
* All the __sk_mem_schedule() is of this nature: accounting
* is strict, actions are advisory and have some latency.
*/
- int *memory_pressure;
- long *sysctl_mem;
- int *sysctl_wmem;
- int *sysctl_rmem;
+ int *(*memory_pressure)(struct sockets_cgrp *sg);
+ long *(*prot_mem)(struct sockets_cgrp *sg);
+ int *(*prot_wmem)(struct sock *sk);
+ int *(*prot_rmem)(struct sock *sk);
int max_header;
bool no_autobind;
@@ -826,6 +856,20 @@ struct proto {
#endif
};
+#define sk_memory_pressure(prot, sg) \
+({ \
+ int *__ret = NULL; \
+ if (prot->memory_pressure) \
+ __ret = prot->memory_pressure(sg); \
+ __ret; \
+})
+
+#define sk_sockets_allocated(prot, sg) ({ struct percpu_counter *__p = prot->sockets_allocated(sg); __p; })
+#define sk_prot_mem(prot, sg) ({ long *__mem = prot->prot_mem(sg); __mem; })
+#define sk_prot_rmem(sk) ({ int *__mem = sk->sk_prot->prot_rmem(sk); __mem; })
+#define sk_prot_wmem(sk) ({ int *__mem = sk->sk_prot->prot_wmem(sk); __mem; })
+#define sk_memory_allocated(prot, sg) ({ atomic_long_t *__mem = prot->memory_allocated(sg); __mem; })
+
extern int proto_register(struct proto *prot, int alloc_slab);
extern void proto_unregister(struct proto *prot);
@@ -1658,10 +1702,11 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp);
static inline struct page *sk_stream_alloc_page(struct sock *sk)
{
struct page *page = NULL;
+ struct sockets_cgrp *sg = sk->sk_cgrp;
page = alloc_pages(sk->sk_allocation, 0);
if (!page) {
- sk->sk_prot->enter_memory_pressure(sk);
+ sk->sk_prot->enter_memory_pressure(sg);
sk_stream_moderate_sndbuf(sk);
}
return page;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 149a415..64318ee 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -255,7 +255,14 @@ extern int sysctl_tcp_thin_dupack;
extern atomic_long_t tcp_memory_allocated;
extern struct percpu_counter tcp_sockets_allocated;
-extern int tcp_memory_pressure;
+
+extern long *tcp_sysctl_mem(struct sockets_cgrp *sg);
+struct percpu_counter *sockets_allocated_tcp(struct sockets_cgrp *sg);
+int *memory_pressure_tcp(struct sockets_cgrp *sg);
+int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss);
+atomic_long_t *memory_allocated_tcp(struct sockets_cgrp *sg);
+int *tcp_sysctl_wmem(struct sock *sk);
+int *tcp_sysctl_rmem(struct sock *sk);
/*
* The next routines deal with comparing 32 bit unsigned ints
@@ -278,6 +285,9 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
{
struct percpu_counter *ocp = sk->sk_prot->orphan_count;
int orphans = percpu_counter_read_positive(ocp);
+ struct sockets_cgrp *sg = sk->sk_cgrp;
+
+ long *prot_mem = sk_prot_mem(sk->sk_prot, sg);
if (orphans << shift > sysctl_tcp_max_orphans) {
orphans = percpu_counter_sum_positive(ocp);
@@ -286,7 +296,7 @@ static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
}
if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
- atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
+ atomic_long_read(&tcp_memory_allocated) > prot_mem[2])
return true;
return false;
}
@@ -999,7 +1009,7 @@ static inline void tcp_openreq_init(struct request_sock *req,
ireq->loc_port = tcp_hdr(skb)->dest;
}
-extern void tcp_enter_memory_pressure(struct sock *sk);
+extern void tcp_enter_memory_pressure(struct sockets_cgrp *sg);
static inline int keepalive_intvl_when(const struct tcp_sock *tp)
{
diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index 779abb9..52a2258 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -31,34 +31,35 @@ TRACE_EVENT(sock_rcvqueue_full,
TRACE_EVENT(sock_exceed_buf_limit,
- TP_PROTO(struct sock *sk, struct proto *prot, long allocated),
+ TP_PROTO(struct sock *sk, struct proto *prot, long allocated,
+ long *prot_mem, int *prot_rmem),
- TP_ARGS(sk, prot, allocated),
+ TP_ARGS(sk, prot, allocated, prot_mem, prot_rmem),
TP_STRUCT__entry(
__array(char, name, 32)
- __field(long *, sysctl_mem)
+ __field(long *, prot_mem)
__field(long, allocated)
- __field(int, sysctl_rmem)
+ __field(int, prot_rmem)
__field(int, rmem_alloc)
),
TP_fast_assign(
strncpy(__entry->name, prot->name, 32);
- __entry->sysctl_mem = prot->sysctl_mem;
+ __entry->prot_mem = prot_mem;
__entry->allocated = allocated;
- __entry->sysctl_rmem = prot->sysctl_rmem[0];
+ __entry->prot_rmem = prot_rmem[0];
__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
),
TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
"sysctl_rmem=%d rmem_alloc=%d",
__entry->name,
- __entry->sysctl_mem[0],
- __entry->sysctl_mem[1],
- __entry->sysctl_mem[2],
+ __entry->prot_mem[0],
+ __entry->prot_mem[1],
+ __entry->prot_mem[2],
__entry->allocated,
- __entry->sysctl_rmem,
+ __entry->prot_rmem,
__entry->rmem_alloc)
);
diff --git a/net/core/sock.c b/net/core/sock.c
index bc745d0..f38045a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -111,6 +111,7 @@
#include <linux/init.h>
#include <linux/highmem.h>
#include <linux/user_namespace.h>
+#include <linux/cgroup.h>
#include <asm/uaccess.h>
#include <asm/system.h>
@@ -134,6 +135,55 @@
#include <net/tcp.h>
#endif
+static DEFINE_RWLOCK(proto_list_lock);
+static LIST_HEAD(proto_list);
+
+static int sockets_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct proto *proto;
+ int ret = 0;
+
+ read_lock(&proto_list_lock);
+ list_for_each_entry(proto, &proto_list, node) {
+ if (proto->init_cgroup) {
+ ret |= proto->init_cgroup(cgrp, ss);
+ }
+ }
+ read_unlock(&proto_list_lock);
+
+ return ret;
+}
+
+static void
+sockets_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct sockets_cgrp *sk = cgroup_sk(cgrp);
+
+ kfree(sk);
+}
+
+static struct cgroup_subsys_state *sockets_create(
+ struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+ struct sockets_cgrp *sk = kzalloc(sizeof(*sk), GFP_KERNEL);
+
+ if (!sk)
+ return ERR_PTR(-ENOMEM);
+
+ if (cgrp->parent)
+ sk->parent = cgroup_sk(cgrp->parent);
+
+ return &sk->css;
+}
+
+struct cgroup_subsys sockets_subsys = {
+ .name = "sockets",
+ .create = sockets_create,
+ .destroy = sockets_destroy,
+ .populate = sockets_populate,
+ .subsys_id = sockets_subsys_id,
+};
+
/*
* Each address family might have different locking rules, so we have
* one slock key per address family:
@@ -1114,6 +1164,14 @@ void sock_update_classid(struct sock *sk)
sk->sk_classid = classid;
}
EXPORT_SYMBOL(sock_update_classid);
+
+void sock_update_cgrp(struct sock *sk)
+{
+ rcu_read_lock();
+ sk->sk_cgrp = task_sk(current);
+ rcu_read_unlock();
+}
+
#endif
/**
@@ -1141,6 +1199,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
atomic_set(&sk->sk_wmem_alloc, 1);
sock_update_classid(sk);
+ sock_update_cgrp(sk);
}
return sk;
@@ -1210,6 +1269,7 @@ EXPORT_SYMBOL(sk_release_kernel);
struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
{
struct sock *newsk;
+ struct sockets_cgrp *sg = sk->sk_cgrp;
newsk = sk_prot_alloc(sk->sk_prot, priority, sk->sk_family);
if (newsk != NULL) {
@@ -1289,8 +1349,8 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
sk_set_socket(newsk, NULL);
newsk->sk_wq = NULL;
- if (newsk->sk_prot->sockets_allocated)
- percpu_counter_inc(newsk->sk_prot->sockets_allocated);
+ if (sk_sockets_allocated(sk->sk_prot, sg))
+ percpu_counter_inc(sk_sockets_allocated(sk->sk_prot, sg));
if (sock_flag(newsk, SOCK_TIMESTAMP) ||
sock_flag(newsk, SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1666,61 +1726,55 @@ int sk_wait_data(struct sock *sk, long *timeo)
}
EXPORT_SYMBOL(sk_wait_data);
-/**
- * __sk_mem_schedule - increase sk_forward_alloc and memory_allocated
- * @sk: socket
- * @size: memory size to allocate
- * @kind: allocation type
- *
- * If kind is SK_MEM_SEND, it means wmem allocation. Otherwise it means
- * rmem allocation. This function assumes that protocols which have
- * memory_pressure use sk_wmem_queued as write buffer accounting.
- */
-int __sk_mem_schedule(struct sock *sk, int size, int kind)
+int __sk_mem_schedule_cgrp(struct sock *sk, struct sockets_cgrp *sg,
+ int amt, int kind, int first)
{
struct proto *prot = sk->sk_prot;
- int amt = sk_mem_pages(size);
long allocated;
+ long *prot_mem;
+ int *memory_pressure;
- sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
- allocated = atomic_long_add_return(amt, prot->memory_allocated);
+ memory_pressure = sk_memory_pressure(prot, sg);
+ prot_mem = sk_prot_mem(prot, sg);
+
+ allocated = atomic_long_add_return(amt, sk_memory_allocated(prot, sg));
/* Under limit. */
- if (allocated <= prot->sysctl_mem[0]) {
- if (prot->memory_pressure && *prot->memory_pressure)
- *prot->memory_pressure = 0;
+ if (allocated <= prot_mem[0]) {
+ if (memory_pressure && *memory_pressure)
+ *memory_pressure = 0;
return 1;
}
/* Under pressure. */
- if (allocated > prot->sysctl_mem[1])
+ if (allocated > prot_mem[1])
if (prot->enter_memory_pressure)
- prot->enter_memory_pressure(sk);
+ prot->enter_memory_pressure(sg);
/* Over hard limit. */
- if (allocated > prot->sysctl_mem[2])
+ if (allocated > prot_mem[2])
goto suppress_allocation;
/* guarantee minimum buffer size under pressure */
if (kind == SK_MEM_RECV) {
- if (atomic_read(&sk->sk_rmem_alloc) < prot->sysctl_rmem[0])
+ if (atomic_read(&sk->sk_rmem_alloc) < sk_prot_rmem(sk)[0])
return 1;
} else { /* SK_MEM_SEND */
if (sk->sk_type == SOCK_STREAM) {
- if (sk->sk_wmem_queued < prot->sysctl_wmem[0])
+ if (sk->sk_wmem_queued < sk_prot_wmem(sk)[0])
return 1;
} else if (atomic_read(&sk->sk_wmem_alloc) <
- prot->sysctl_wmem[0])
+ sk_prot_wmem(sk)[0])
return 1;
}
- if (prot->memory_pressure) {
+ if (memory_pressure) {
int alloc;
- if (!*prot->memory_pressure)
+ if (!*memory_pressure)
return 1;
- alloc = percpu_counter_read_positive(prot->sockets_allocated);
- if (prot->sysctl_mem[2] > alloc *
+ alloc = percpu_counter_read_positive(sk_sockets_allocated(prot, sg));
+ if (prot_mem[2] > alloc *
sk_mem_pages(sk->sk_wmem_queued +
atomic_read(&sk->sk_rmem_alloc) +
sk->sk_forward_alloc))
@@ -1728,6 +1782,44 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
}
suppress_allocation:
+ if (first)
+ trace_sock_exceed_buf_limit(sk, prot, allocated,
+ prot_mem, sk_prot_rmem(sk));
+ return 0;
+}
+
+/**
+ * __sk_mem_schedule - increase sk_forward_alloc and memory_allocated
+ * @sk: socket
+ * @size: memory size to allocate
+ * @kind: allocation type
+ *
+ * If kind is SK_MEM_SEND, it means wmem allocation. Otherwise it means
+ * rmem allocation. This function assumes that protocols which have
+ * memory_pressure use sk_wmem_queued as write buffer accounting.
+ */
+int __sk_mem_schedule(struct sock *sk, int size, int kind)
+{
+ struct sockets_cgrp *sg;
+ int amt = sk_mem_pages(size);
+ int first = 1;
+ int ret = 0;
+ struct proto *prot = sk->sk_prot;
+
+ sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
+
+ for (sg = sk->sk_cgrp; sg != NULL; sg = sg->parent) {
+ int r;
+ r = __sk_mem_schedule_cgrp(sk, sg, amt, kind, first);
+ if (first)
+ ret = r;
+ first = 0;
+ }
+
+ if (ret > 0)
+ goto out;
+
+ /* Supress current allocation */
if (kind == SK_MEM_SEND && sk->sk_type == SOCK_STREAM) {
sk_stream_moderate_sndbuf(sk);
@@ -1739,12 +1831,15 @@ suppress_allocation:
return 1;
}
- trace_sock_exceed_buf_limit(sk, prot, allocated);
-
/* Alas. Undo changes. */
sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
- atomic_long_sub(amt, prot->memory_allocated);
- return 0;
+
+ for (sg = sk->sk_cgrp; sg != NULL; sg = sg->parent) {
+ atomic_long_sub(amt, sk_memory_allocated(prot, sg));
+ }
+out:
+ return ret;
+
}
EXPORT_SYMBOL(__sk_mem_schedule);
@@ -1755,14 +1850,16 @@ EXPORT_SYMBOL(__sk_mem_schedule);
void __sk_mem_reclaim(struct sock *sk)
{
struct proto *prot = sk->sk_prot;
+ struct sockets_cgrp *sg = sk->sk_cgrp;
+ int *memory_pressure = sk_memory_pressure(prot, sg);
atomic_long_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
- prot->memory_allocated);
+ sk_memory_allocated(prot, sg));
sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
- if (prot->memory_pressure && *prot->memory_pressure &&
- (atomic_long_read(prot->memory_allocated) < prot->sysctl_mem[0]))
- *prot->memory_pressure = 0;
+ if (memory_pressure && *memory_pressure &&
+ (atomic_long_read(sk_memory_allocated(prot, sg)) < sk_prot_mem(prot, sg)[0]))
+ *memory_pressure = 0;
}
EXPORT_SYMBOL(__sk_mem_reclaim);
@@ -2254,9 +2351,6 @@ void sk_common_release(struct sock *sk)
}
EXPORT_SYMBOL(sk_common_release);
-static DEFINE_RWLOCK(proto_list_lock);
-static LIST_HEAD(proto_list);
-
#ifdef CONFIG_PROC_FS
#define PROTO_INUSE_NR 64 /* should be enough for the first time */
struct prot_inuse {
@@ -2481,13 +2575,15 @@ static char proto_method_implemented(const void *method)
static void proto_seq_printf(struct seq_file *seq, struct proto *proto)
{
+ struct sockets_cgrp *sg = task_sk(current);
+
seq_printf(seq, "%-9s %4u %6d %6ld %-3s %6u %-3s %-10s "
"%2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c %2c\n",
proto->name,
proto->obj_size,
sock_prot_inuse_get(seq_file_net(seq), proto),
- proto->memory_allocated != NULL ? atomic_long_read(proto->memory_allocated) : -1L,
- proto->memory_pressure != NULL ? *proto->memory_pressure ? "yes" : "no" : "NI",
+ proto->memory_allocated != NULL ? atomic_long_read(sk_memory_allocated(proto, sg)) : -1L,
+ proto->memory_pressure != NULL ? *sk_memory_pressure(proto, sg) ? "yes" : "no" : "NI",
proto->max_header,
proto->slab == NULL ? "no" : "yes",
module_name(proto->owner),
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index b14ec7d..9b380be 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -53,19 +53,21 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
struct net *net = seq->private;
int orphans, sockets;
+ struct sockets_cgrp *sg = task_sk(current);
+
local_bh_disable();
orphans = percpu_counter_sum_positive(&tcp_orphan_count);
- sockets = percpu_counter_sum_positive(&tcp_sockets_allocated);
+ sockets = percpu_counter_sum_positive(sk_sockets_allocated((&tcp_prot), sg));
local_bh_enable();
socket_seq_show(seq);
seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
sock_prot_inuse_get(net, &tcp_prot), orphans,
tcp_death_row.tw_count, sockets,
- atomic_long_read(&tcp_memory_allocated));
+ atomic_long_read(sk_memory_allocated((&tcp_prot), sg)));
seq_printf(seq, "UDP: inuse %d mem %ld\n",
sock_prot_inuse_get(net, &udp_prot),
- atomic_long_read(&udp_memory_allocated));
+ atomic_long_read(sk_memory_allocated((&udp_prot), sg)));
seq_printf(seq, "UDPLITE: inuse %d\n",
sock_prot_inuse_get(net, &udplite_prot));
seq_printf(seq, "RAW: inuse %d\n",
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 46febca..a4eb7ea 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -293,6 +293,9 @@ EXPORT_SYMBOL(sysctl_tcp_wmem);
atomic_long_t tcp_memory_allocated; /* Current allocated memory. */
EXPORT_SYMBOL(tcp_memory_allocated);
+int tcp_memory_pressure;
+EXPORT_SYMBOL(tcp_memory_pressure);
+
/*
* Current number of TCP sockets.
*/
@@ -314,18 +317,118 @@ struct tcp_splice_state {
* All the __sk_mem_schedule() is of this nature: accounting
* is strict, actions are advisory and have some latency.
*/
-int tcp_memory_pressure __read_mostly;
-EXPORT_SYMBOL(tcp_memory_pressure);
-
-void tcp_enter_memory_pressure(struct sock *sk)
+void tcp_enter_memory_pressure(struct sockets_cgrp *sg)
{
- if (!tcp_memory_pressure) {
- NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
- tcp_memory_pressure = 1;
+ if (!sg->tcp_memory_pressure) {
+// FIXME: how to grab net pointer from cgroup ? */
+// NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
}
+
+ sg->tcp_memory_pressure = 1;
}
EXPORT_SYMBOL(tcp_enter_memory_pressure);
+long *tcp_sysctl_mem(struct sockets_cgrp *sg)
+{
+ return sg->tcp_prot_mem;
+}
+EXPORT_SYMBOL(tcp_sysctl_mem);
+
+int *tcp_sysctl_rmem(struct sock *sk)
+{
+ return sysctl_tcp_rmem;
+}
+EXPORT_SYMBOL(tcp_sysctl_rmem);
+
+int *tcp_sysctl_wmem(struct sock *sk)
+{
+ return sysctl_tcp_wmem;
+}
+EXPORT_SYMBOL(tcp_sysctl_wmem);
+
+atomic_long_t *memory_allocated_tcp(struct sockets_cgrp *sg)
+{
+ return &(sg->tcp_memory_allocated);
+}
+EXPORT_SYMBOL(memory_allocated_tcp);
+
+static int tcp_write_maxmem(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+ struct sockets_cgrp *sg = cgroup_sk(cgrp);
+
+ if (!cgroup_lock_live_group(cgrp))
+ return -ENODEV;
+
+ /*
+ * We can't allow more memory than our parents. Since this
+ * will be tested for all calls, by induction, there is no need
+ * to test any parent other than our own
+ * */
+ if (sg->parent && (val > sg->parent->tcp_max_memory))
+ val = sg->parent->tcp_max_memory;
+
+ sg->tcp_max_memory = val;
+
+ sg->tcp_prot_mem[0] = val / 4 * 3;
+ sg->tcp_prot_mem[1] = val;
+ sg->tcp_prot_mem[2] = sg->tcp_prot_mem[0] * 2;
+
+ cgroup_unlock();
+
+ return 0;
+}
+
+static u64 tcp_read_maxmem(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct sockets_cgrp *sg = cgroup_sk(cgrp);
+ u64 ret;
+
+ if (!cgroup_lock_live_group(cgrp))
+ return -ENODEV;
+ ret = sg->tcp_max_memory;
+
+ cgroup_unlock();
+ return ret;
+}
+
+static struct cftype tcp_files[] = {
+ {
+ .name = "tcp_maxmem",
+ .write_u64 = tcp_write_maxmem,
+ .read_u64 = tcp_read_maxmem,
+ },
+};
+
+int tcp_init_cgroup(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+ struct sockets_cgrp *sg = cgroup_sk(cgrp);
+ sg->tcp_memory_pressure = 0;
+
+ percpu_counter_init(&sg->tcp_sockets_allocated, 0);
+ atomic_long_set(&sg->tcp_memory_allocated, 0);
+
+ sg->tcp_max_memory = sysctl_tcp_mem[1];
+
+ sg->tcp_prot_mem[0] = sysctl_tcp_mem[1] / 4 * 3;
+ sg->tcp_prot_mem[1] = sysctl_tcp_mem[1];
+ sg->tcp_prot_mem[2] = sg->tcp_prot_mem[0] * 2;
+
+ return cgroup_add_files(cgrp, ss, tcp_files, ARRAY_SIZE(tcp_files));
+}
+EXPORT_SYMBOL(tcp_init_cgroup);
+
+int *memory_pressure_tcp(struct sockets_cgrp *sg)
+{
+ return &sg->tcp_memory_pressure;
+}
+EXPORT_SYMBOL(memory_pressure_tcp);
+
+struct percpu_counter *sockets_allocated_tcp(struct sockets_cgrp *sg)
+{
+ return &sg->tcp_sockets_allocated;
+}
+EXPORT_SYMBOL(sockets_allocated_tcp);
+
/* Convert seconds to retransmits based on initial and max timeout */
static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
{
@@ -710,7 +813,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
}
__kfree_skb(skb);
} else {
- sk->sk_prot->enter_memory_pressure(sk);
+ sk->sk_prot->enter_memory_pressure(sk->sk_cgrp);
sk_stream_moderate_sndbuf(sk);
}
return NULL;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ea0d218..38dac60 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -312,11 +312,12 @@ static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
static void tcp_grow_window(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
+ struct sockets_cgrp *sg = sk->sk_cgrp;
/* Check #1 */
if (tp->rcv_ssthresh < tp->window_clamp &&
(int)tp->rcv_ssthresh < tcp_space(sk) &&
- !tcp_memory_pressure) {
+ !sg->tcp_memory_pressure) {
int incr;
/* Check #2. Increase window, if skb with such overhead
@@ -393,15 +394,16 @@ static void tcp_clamp_window(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
+ struct sockets_cgrp *sg = sk->sk_cgrp;
icsk->icsk_ack.quick = 0;
- if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
+ if (sk->sk_rcvbuf < sk_prot_rmem(sk)[2] &&
!(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
- !tcp_memory_pressure &&
- atomic_long_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+ !sg->tcp_memory_pressure &&
+ atomic_long_read(&tcp_memory_allocated) < sk_prot_mem(sk->sk_prot, sg)[0]) {
sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
- sysctl_tcp_rmem[2]);
+ sk_prot_rmem(sk)[2]);
}
if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
@@ -4799,6 +4801,7 @@ static int tcp_prune_ofo_queue(struct sock *sk)
static int tcp_prune_queue(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
+ struct sockets_cgrp *sg = sk->sk_cgrp;
SOCK_DEBUG(sk, "prune_queue: c=%x\n", tp->copied_seq);
@@ -4806,7 +4809,7 @@ static int tcp_prune_queue(struct sock *sk)
if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk);
- else if (tcp_memory_pressure)
+ else if (sg->tcp_memory_pressure)
tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
tcp_collapse_ofo_queue(sk);
@@ -4864,6 +4867,7 @@ void tcp_cwnd_application_limited(struct sock *sk)
static int tcp_should_expand_sndbuf(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
+ struct sockets_cgrp *sg = sk->sk_cgrp;
/* If the user specified a specific send buffer setting, do
* not modify it.
@@ -4872,11 +4876,11 @@ static int tcp_should_expand_sndbuf(struct sock *sk)
return 0;
/* If we are under global TCP memory pressure, do not expand. */
- if (tcp_memory_pressure)
+ if (sg->tcp_memory_pressure)
return 0;
/* If we are under soft global TCP memory pressure, do not expand. */
- if (atomic_long_read(&tcp_memory_allocated) >= sysctl_tcp_mem[0])
+ if (atomic_long_read(&tcp_memory_allocated) >= sk_prot_mem(sk->sk_prot, sg)[0])
return 0;
/* If we filled the congestion window, do not expand. */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 955b8e6..aa6b68c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2597,13 +2597,14 @@ struct proto tcp_prot = {
.unhash = inet_unhash,
.get_port = inet_csk_get_port,
.enter_memory_pressure = tcp_enter_memory_pressure,
- .sockets_allocated = &tcp_sockets_allocated,
+ .memory_pressure = memory_pressure_tcp,
+ .sockets_allocated = sockets_allocated_tcp,
.orphan_count = &tcp_orphan_count,
- .memory_allocated = &tcp_memory_allocated,
- .memory_pressure = &tcp_memory_pressure,
- .sysctl_mem = sysctl_tcp_mem,
- .sysctl_wmem = sysctl_tcp_wmem,
- .sysctl_rmem = sysctl_tcp_rmem,
+ .memory_allocated = memory_allocated_tcp,
+ .init_cgroup = tcp_init_cgroup,
+ .prot_mem = tcp_sysctl_mem,
+ .prot_wmem = tcp_sysctl_wmem,
+ .prot_rmem = tcp_sysctl_rmem,
.max_header = MAX_TCP_HEADER,
.obj_size = sizeof(struct tcp_sock),
.slab_flags = SLAB_DESTROY_BY_RCU,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..24f975c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1905,6 +1905,7 @@ u32 __tcp_select_window(struct sock *sk)
int free_space = tcp_space(sk);
int full_space = min_t(int, tp->window_clamp, tcp_full_space(sk));
int window;
+ struct sockets_cgrp *sg = sk->sk_cgrp;
if (mss > full_space)
mss = full_space;
@@ -1912,7 +1913,7 @@ u32 __tcp_select_window(struct sock *sk)
if (free_space < (full_space >> 1)) {
icsk->icsk_ack.quick = 0;
- if (tcp_memory_pressure)
+ if (sg->tcp_memory_pressure)
tp->rcv_ssthresh = min(tp->rcv_ssthresh,
4U * tp->advmss);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ecd44b0..a82e38a 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -213,6 +213,9 @@ static void tcp_delack_timer(unsigned long data)
struct sock *sk = (struct sock *)data;
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
+ struct sockets_cgrp *sg;
+
+ sg = sk->sk_cgrp;
bh_lock_sock(sk);
if (sock_owned_by_user(sk)) {
@@ -261,7 +264,7 @@ static void tcp_delack_timer(unsigned long data)
}
out:
- if (tcp_memory_pressure)
+ if (sg->tcp_memory_pressure)
sk_mem_reclaim(sk);
out_unlock:
bh_unlock_sock(sk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1b5a193..d5025bd 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -120,9 +120,6 @@ EXPORT_SYMBOL(sysctl_udp_rmem_min);
int sysctl_udp_wmem_min __read_mostly;
EXPORT_SYMBOL(sysctl_udp_wmem_min);
-atomic_long_t udp_memory_allocated;
-EXPORT_SYMBOL(udp_memory_allocated);
-
#define MAX_UDP_PORTS 65536
#define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
@@ -1918,6 +1915,29 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
}
EXPORT_SYMBOL(udp_poll);
+atomic_long_t *memory_allocated_udp(struct sockets_cgrp *sg)
+{
+ return &sg->udp_memory_allocated;
+}
+EXPORT_SYMBOL(memory_allocated_udp);
+
+long *udp_sysctl_mem(struct sockets_cgrp *sg)
+{
+ return sysctl_udp_mem;
+}
+
+int *udp_sysctl_rmem(struct sock *sk)
+{
+ return &sysctl_udp_rmem_min;
+}
+EXPORT_SYMBOL(udp_sysctl_rmem);
+
+int *udp_sysctl_wmem(struct sock *sk)
+{
+ return &sysctl_udp_wmem_min;
+}
+EXPORT_SYMBOL(udp_sysctl_wmem);
+
struct proto udp_prot = {
.name = "UDP",
.owner = THIS_MODULE,
@@ -1936,10 +1956,10 @@ struct proto udp_prot = {
.unhash = udp_lib_unhash,
.rehash = udp_v4_rehash,
.get_port = udp_v4_get_port,
- .memory_allocated = &udp_memory_allocated,
- .sysctl_mem = sysctl_udp_mem,
- .sysctl_wmem = &sysctl_udp_wmem_min,
- .sysctl_rmem = &sysctl_udp_rmem_min,
+ .memory_allocated = &memory_allocated_udp,
+ .prot_mem = udp_sysctl_mem,
+ .prot_wmem = udp_sysctl_wmem,
+ .prot_rmem = udp_sysctl_rmem,
.obj_size = sizeof(struct udp_sock),
.slab_flags = SLAB_DESTROY_BY_RCU,
.h.udp_table = &udp_table,
[-- Attachment #3: Type: text/plain, Size: 206 bytes --]
_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply related
* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
From: Eric Dumazet @ 2011-08-24 23:00 UTC (permalink / raw)
To: Ilpo Järvinen; +Cc: netdev, Jerry Chu, Damian Lukowski
In-Reply-To: <alpine.DEB.2.00.1108250142030.11551@melkinpaasi.cs.helsinki.fi>
Le jeudi 25 août 2011 à 01:44 +0300, Ilpo Järvinen a écrit :
> On Wed, 24 Aug 2011, Eric Dumazet wrote:
>
> > On one dev machine running net-next, I just found strange tcp sessions
> > that retransmit a frame forever (The other peer disappeared)
> >
> > # ss -emoi dst 10.2.1.1
> > State Recv-Q Send-Q Local Address:Port Peer Address:Port
> > ESTAB 0 816 10.2.1.2:37930 10.2.1.1:ssh timer:(on,630ms,246) ino:60786 sk:ffff8801189aa400
> > mem:(r0,w3776,f320,t0) ts sack ecn cubic wscale:8,6 rto:1680 rtt:16.25/7.5 ato:40 ssthresh:7 send 1.4Mbps rcv_rtt:10 rcv_space:16632
> >
> >
> > You can see the retransmit count : 246
> >
> > What possibly can be going on ?
> >
> > What happened to backoff ?
> >
> > # grep . /proc/sys/net/ipv4/tcp_retries*
> > /proc/sys/net/ipv4/tcp_retries1:3
> > /proc/sys/net/ipv4/tcp_retries2:15
> >
> >
> >
> > extract of tcpdump :
> >
> > 12:01:02.074244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128024 59389>
> > 12:01:03.754243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128192 59389>
> > 12:01:05.434245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128360 59389>
> > 12:01:07.114243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128528 59389>
> > 12:01:08.794248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128696 59389>
> > 12:01:10.474242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16128864 59389>
> > 12:01:12.154243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129032 59389>
> > 12:01:13.834241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129200 59389>
> > 12:01:15.514246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129368 59389>
> > 12:01:17.194244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129536 59389>
> > 12:01:18.874248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129704 59389>
> > 12:01:20.554243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16129872 59389>
> > 12:01:22.234244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130040 59389>
> > 12:01:23.914244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130208 59389>
> > 12:01:25.594247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130376 59389>
> > 12:01:27.274242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130544 59389>
> > 12:01:28.954242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130712 59389>
> > 12:01:30.634248 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16130880 59389>
> > 12:01:32.314245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131048 59389>
> > 12:01:33.994243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131216 59389>
> > 12:01:35.674250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131384 59389>
> > 12:01:37.354244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131552 59389>
> > 12:01:39.034245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131720 59389>
> > 12:01:40.714245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16131888 59389>
> > 12:01:42.394245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132056 59389>
> > 12:01:44.074242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132224 59389>
> > 12:01:45.754249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132392 59389>
> > 12:01:47.434242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132560 59389>
> > 12:01:49.114247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132728 59389>
> > 12:01:50.794250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16132896 59389>
> > 12:01:52.474247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133064 59389>
> > 12:01:54.154242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133232 59389>
> > 12:01:55.834246 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133400 59389>
> > 12:01:57.514243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133568 59389>
> > 12:01:59.194247 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133736 59389>
> > 12:02:00.874250 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16133904 59389>
> > 12:02:02.554242 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134072 59389>
> > 12:02:04.234243 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134240 59389>
> > 12:02:05.914245 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134408 59389>
> > 12:02:07.594244 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134576 59389>
> > 12:02:09.274249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134744 59389>
> > 12:02:10.954241 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16134912 59389>
> > 12:02:12.634249 IP 10.2.1.2.37930 > 10.2.1.1.ssh: P 0:144(144) ack 1 win 1002 <nop,nop,timestamp 16135080 59389>
> >
> > tcp_retransmit_timer() does the exponential backoff, but something
> > resets icsk_rto to a low value ?
> >
> > Ah, it seems to be because of commit f1ecd5d9e7366609
> > (Revert Backoff [v3]: Revert RTO on ICMP destination unreachable)
> >
> > Since arp resolution (or routing, I dont know yet) fails, an
> > internal/loopback ICMP host/network unreachable message is
> > generated and handled in tcp_v4_err() :
> >
> > icsk_backoff-- and icsk_rto is reset.
> >
> > I am afraid this can generate a storm (cpu time at very least),
> > in case we have many tcp sessions in this state.
>
> But RTO (even without any backoffs) should be lower bounded to some not so
> zeroish value?
Apparently not.
The only thing that protect us from a flood is that ip_error() uses
inetpeer cache to ratelimit the icmp_send(ICMP_DEST_UNREACH)
This is why we get retransmit period >= 1 sec
vi +432 net/ipv4/tcp_ipv4.c
icsk->icsk_backoff--;
inet_csk(sk)->icsk_rto = (tp->srtt ? __tcp_set_rto(tp) :
TCP_TIMEOUT_INIT) << icsk->icsk_backoff;
tcp_bound_rto(sk);
and __tcp_set_rto() uses : return (tp->srtt >> 3) + tp->rttvar;
^ permalink raw reply
* Re: [PATCH net-next] rps: support IPIP encapsulation
From: David Miller @ 2011-08-24 23:14 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev, therbert
In-Reply-To: <1314218479.2506.15.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 24 Aug 2011 22:41:19 +0200
> Skip IPIP header to get proper layer-4 information.
>
> Like GRE tunnels, this only works if rxhash is not already provided by
> the device itself (ethtool -K ethX rxhash off), to allow kernel compute
> a software rxhash.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied, thanks.
^ permalink raw reply
* [PATCH] ibmveth: Fix leak when recycling skb and hypervisor returns error
From: Anton Blanchard @ 2011-08-24 23:20 UTC (permalink / raw)
To: santil; +Cc: netdev
If h_add_logical_lan_buffer returns an error we need to free
the skb.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable <stable@kernel.org>
---
Index: linux-net/drivers/net/ethernet/ibm/ibmveth.c
===================================================================
--- linux-net.orig/drivers/net/ethernet/ibm/ibmveth.c 2011-08-24 17:06:56.894207820 +1000
+++ linux-net/drivers/net/ethernet/ibm/ibmveth.c 2011-08-25 08:44:14.212105871 +1000
@@ -395,7 +395,7 @@ static inline struct sk_buff *ibmveth_rx
}
/* recycle the current buffer on the rx queue */
-static void ibmveth_rxq_recycle_buffer(struct ibmveth_adapter *adapter)
+static int ibmveth_rxq_recycle_buffer(struct ibmveth_adapter *adapter)
{
u32 q_index = adapter->rx_queue.index;
u64 correlator = adapter->rx_queue.queue_addr[q_index].correlator;
@@ -403,6 +403,7 @@ static void ibmveth_rxq_recycle_buffer(s
unsigned int index = correlator & 0xffffffffUL;
union ibmveth_buf_desc desc;
unsigned long lpar_rc;
+ int ret = 1;
BUG_ON(pool >= IBMVETH_NUM_BUFF_POOLS);
BUG_ON(index >= adapter->rx_buff_pool[pool].size);
@@ -410,7 +411,7 @@ static void ibmveth_rxq_recycle_buffer(s
if (!adapter->rx_buff_pool[pool].active) {
ibmveth_rxq_harvest_buffer(adapter);
ibmveth_free_buffer_pool(adapter, &adapter->rx_buff_pool[pool]);
- return;
+ goto out;
}
desc.fields.flags_len = IBMVETH_BUF_VALID |
@@ -423,12 +424,16 @@ static void ibmveth_rxq_recycle_buffer(s
netdev_dbg(adapter->netdev, "h_add_logical_lan_buffer failed "
"during recycle rc=%ld", lpar_rc);
ibmveth_remove_buffer_from_pool(adapter, adapter->rx_queue.queue_addr[adapter->rx_queue.index].correlator);
+ ret = 0;
}
if (++adapter->rx_queue.index == adapter->rx_queue.num_slots) {
adapter->rx_queue.index = 0;
adapter->rx_queue.toggle = !adapter->rx_queue.toggle;
}
+
+out:
+ return ret;
}
static void ibmveth_rxq_harvest_buffer(struct ibmveth_adapter *adapter)
@@ -1084,8 +1089,9 @@ restart_poll:
if (rx_flush)
ibmveth_flush_buffer(skb->data,
length + offset);
+ if (!ibmveth_rxq_recycle_buffer(adapter))
+ kfree_skb(skb);
skb = new_skb;
- ibmveth_rxq_recycle_buffer(adapter);
} else {
ibmveth_rxq_harvest_buffer(adapter);
skb_reserve(skb, offset);
^ permalink raw reply
* [PATCH] tcp: bound RTO to minimum
From: Hagen Paul Pfeifer @ 2011-08-24 23:41 UTC (permalink / raw)
To: netdev; +Cc: eric.dumazet, Hagen Paul Pfeifer
In-Reply-To: <1314226834.6797.5.camel@edumazet-laptop>
Check if calculated RTO is less then TCP_RTO_MIN. If this is true we
adjust the value to TCP_RTO_MIN.
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
---
include/net/tcp.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 149a415..9b5f4bf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -520,6 +520,8 @@ static inline void tcp_bound_rto(const struct sock *sk)
{
if (inet_csk(sk)->icsk_rto > TCP_RTO_MAX)
inet_csk(sk)->icsk_rto = TCP_RTO_MAX;
+ else if (inet_csk(sk)->icsk_rto < TCP_RTO_MIN)
+ inet_csk(sk)->icsk_rto = TCP_RTO_MIN;
}
static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
--
1.7.4.1.57.g0466.dirty
^ permalink raw reply related
* Re: [PATCH] tcp: bound RTO to minimum
From: Hagen Paul Pfeifer @ 2011-08-24 23:43 UTC (permalink / raw)
To: netdev; +Cc: eric.dumazet, Ilpo Järvinen
In-Reply-To: <1314229310-8074-1-git-send-email-hagen@jauu.net>
This should do the trick Eric, Ilpo?
Hagen
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox