Netdev List
 help / color / mirror / Atom feed
From: Neil Horman <nhorman@tuxdriver.com>
To: Satoru Moriya <satoru.moriya@hds.com>
Cc: "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"dle-develop@lists.sourceforge.net"
	<dle-develop@lists.sourceforge.net>,
	Seiji Aguchi <seiji.aguchi@hds.com>
Subject: Re: [RFC][PATCH] add tracepoint to __sk_mem_schedule
Date: Wed, 15 Jun 2011 07:07:23 -0400	[thread overview]
Message-ID: <20110615110723.GA23380@hmsreliant.think-freely.org> (raw)
In-Reply-To: <65795E11DBF1E645A09CEC7EAEE94B9C3FBC0707@USINDEVS02.corp.hds.com>

On Tue, Jun 14, 2011 at 03:24:14PM -0400, Satoru Moriya wrote:
> Hi,
> 
> kernel drops packets when the amount of memory which is used for socket buffer
> exceeds limitations such as /proc/sys/net/ipv4/udp_mem. But currently we can't
> catch that event and know why packets are dropped. And also it is difficult to
> configure sysctl knob appropriately because we don't know when/why packets
> dropped.
> 
There are several ways to do this already.  Every drop that occurs in the stack
should have a corresponding statistical counter exposed for it, and we also have
a tracepoint in kfree_skb that the dropwatch system monitors to inform us of
dropped packets in a certralized fashion.  Not saying this tracepoint isn't
worthwhile, only that it covers already covered ground.

> This patch adds tracepoint to __sk_mem_schedule(), which is called each time
> the socket memory usage exceeds limitations and kernel drops a packet.
> It allows us to hook in and examine when and why it happens.
> 
> Note that this patch only collects information which is needed for udp
> because it's a RFC patch to show its concept and acutually we need it(*).
> If you guys need to get other parameters, please let me know. I'll add it.
> 
> (*) Reason why we need this tracepoint for UDP
> Transaction data is sent by UDP multicast in finance systems because of its
> low overhead characteristics. UDP itself does not guarantee reliability,
> ordering and data integrity, but the system is designed not to drop any packets
> even when it is high load situation. And in that system if kernel drops packets,
> we need to find a root cause to avoid it next time.
> 
Again, this is why dropwatch exists.  UDP gets into this path from:
__udp_queue_rcv_skb
 ip_queue_rcv_skb
  sock_queue_rcv_skb
   sk_rmem_schedule
    __sk_mem_schedule

If ip_queue_rcv_skb fails we increment the UDP_MIB_RCVBUFERRORS counter as well
as the UDP_MIB_INERRORS counter, and on the kfree_skb call after those
increments, dropwatch will report the frame loss and the fact that it occured in
__udp_queue_rcv_skb

I still think its an interesting tracepoint, just because it might be nice to
know which sockets are expanding their snd/rcv buffer space, but why not modify
the tracepoint so that it accepts the return code of __sk_mem_schedule and call
it from both sk_rmem_schedule and sk_wmem_schedule.   That way you can use the
tracepoint to record both successfull expansion and failed expansions.
Neil
 
> Any comments are welcome.
> 
> Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
> ---
>  include/trace/events/sock.h |   46 +++++++++++++++++++++++++++++++++++++++++++
>  net/core/net-traces.c       |    1 +
>  net/core/sock.c             |    4 +++
>  3 files changed, 51 insertions(+), 0 deletions(-)
>  create mode 100644 include/trace/events/sock.h
> 
> diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
> new file mode 100644
> index 0000000..409735a
> --- /dev/null
> +++ b/include/trace/events/sock.h
> @@ -0,0 +1,46 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM sock
> +
> +#if !defined(_TRACE_SOCK_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_SOCK_H
> +
> +#include <net/sock.h>
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(sock_exceed_buf_limit,
> +
> +	TP_PROTO(struct sock *sk, struct proto *prot, long allocated),
> +
> +	TP_ARGS(sk, prot, allocated),
> +
> +	TP_STRUCT__entry(
> +		__array(char, name, 32)
> +		__field(long *, sysctl_mem)
> +		__field(long, allocated)
> +		__field(int, sysctl_rmem)
> +		__field(int, rmem_alloc)
> +	),
> +
> +	TP_fast_assign(
> +		strncpy(__entry->name, prot->name, 32);
> +		__entry->sysctl_mem = prot->sysctl_mem;
> +		__entry->allocated = allocated;
> +		__entry->sysctl_rmem = atomic_read(&sk->sk_rmem_alloc);
> +		__entry->rmem_alloc = prot->sysctl_rmem[0];
> +	),
> +
> +	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
> +		"sysctl_rmem=%d rmem_alloc=%d",
> +		__entry->name,
> +		__entry->sysctl_mem[0],
> +		__entry->sysctl_mem[1],
> +		__entry->sysctl_mem[2],
> +		__entry->allocated,
> +		__entry->sysctl_rmem,
> +		__entry->rmem_alloc)
> +);
> +
> +#endif /* _TRACE_SOCK_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/net/core/net-traces.c b/net/core/net-traces.c
> index 7f1bb2a..b9756f5 100644
> --- a/net/core/net-traces.c
> +++ b/net/core/net-traces.c
> @@ -28,6 +28,7 @@
>  #include <trace/events/skb.h>
>  #include <trace/events/net.h>
>  #include <trace/events/napi.h>
> +#include <trace/events/sock.h>
>  
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
>  
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 6e81978..8389032 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -128,6 +128,8 @@
>  
>  #include <linux/filter.h>
>  
> +#include <trace/events/sock.h>
> +
>  #ifdef CONFIG_INET
>  #include <net/tcp.h>
>  #endif
> @@ -1736,6 +1738,8 @@ suppress_allocation:
>  			return 1;
>  	}
>  
> +	trace_sock_exceed_buf_limit(sk, prot, allocated);
> +
>  	/* Alas. Undo changes. */
>  	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
>  	atomic_long_sub(amt, prot->memory_allocated);
> -- 
> 1.7.1
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

  reply	other threads:[~2011-06-15 11:07 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-14 19:24 [RFC][PATCH] add tracepoint to __sk_mem_schedule Satoru Moriya
2011-06-15 11:07 ` Neil Horman [this message]
2011-06-15 19:18   ` Satoru Moriya
2011-06-15 20:04     ` Neil Horman
2011-06-15 20:15       ` Satoru Moriya

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110615110723.GA23380@hmsreliant.think-freely.org \
    --to=nhorman@tuxdriver.com \
    --cc=davem@davemloft.net \
    --cc=dle-develop@lists.sourceforge.net \
    --cc=netdev@vger.kernel.org \
    --cc=satoru.moriya@hds.com \
    --cc=seiji.aguchi@hds.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox