public inbox for b.a.t.m.a.n@lists.open-mesh.org
 help / color / mirror / Atom feed
From: "Linus Lüssing" <linus.luessing@web.de>
To: The list for a Better Approach To Mobile Ad-hoc Networking
	<b.a.t.m.a.n@lists.open-mesh.org>
Subject: Re: [B.A.T.M.A.N.] [PATCHv3] batman-adv: Limit queue lengths for batman and broadcast	packets
Date: Sun, 11 Apr 2010 01:56:08 +0200	[thread overview]
Message-ID: <20100410235608.GA12381@Linus-Debian> (raw)
In-Reply-To: <20100407081123.GA2378@pandem0nium>

[-- Attachment #1: Type: text/plain, Size: 15190 bytes --]

Okay, this third version seems to get rid of the nasty OOMs very
well and after stopping the ping-broadcast on Node A in a setup of
A - B - C, the broadcast storm stops nearly immediately.

However, in my VM setup (no matter if qemu or kvm) I'm getting a
soft-lockup on B. After about 10 seconds I'm not able to send any
commands to B via ssh or via a serial console. But if I stop the
ping broadcast flood, B suddenly receives the input I've been
sending minutes before during the broadcast.

The kernel is outputting the following trace after 60 seconds of
soft-lockup:
--------------------------------------------------------------
mesh-node@mesh-node1:~$ uptime
 08:37:11 up 57 min,  2 users,  load average: 4.38, 4.40, 3.30
mesh-node@mesh-node1:~$ uptime
 08:37:12 up 57 min,  2 users,  load average: 4.38, 4.40, 3.30
mesh-node@mesh-node1:~$ uptime[ 3508.988304] BUG: soft lockup - CPU#0 stuck for 61s! [bat_events:3034]
[ 3508.988304] Modules linked in: batman_adv ipv6 loop button psmouse parport_pc parport i2c_piix4 i2c_core snd_pcsp serio_raw snd_pcm snd_timer snd e1000 soundcore snd_page_alloc evdev floppy ext3 jbd mbcache ide_cd_mod cdrom thermal processor fan thermal_sys ide_disk ide_pci_generic piix ide_core ata_generic libata scsi_mod dock [last unloaded: batman_adv]
[ 3508.988304]
[ 3508.988304] Pid: 3034, comm: bat_events Not tainted (2.6.26-2-686 #1)
[ 3508.988304] EIP: 0060:[<d091bc11>] EFLAGS: 00000286 CPU: 0
[ 3508.988304] EIP is at e1000_clean_tx_irq+0xfc/0x263 [e1000]
[ 3508.988304] EAX: 0000009c EBX: 0000009c ECX: 0000f601 EDX: cf23bba0
[ 3508.988304] ESI: cf23bba0 EDI: cfb5d3e0 EBP: cf3019c0 ESP: cf93bef0
[ 3508.988304]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 3508.988304] CR0: 8005003b CR2: 081eb008 CR3: 0f3d1000 CR4: 00000690
[ 3508.988304] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 3508.988304] DR6: 00000000 DR7: 00000000
[ 3508.988304]  [<d091e752>] ? e1000_clean+0x3c/0x216 [e1000]
[ 3508.988304]  [<c025a8af>] ? net_rx_action+0x9c/0x1b9
[ 3508.988304]  [<c0126629>] ? __do_softirq+0x66/0xd3
[ 3508.988304]  [<c01266db>] ? do_softirq+0x45/0x53
[ 3508.988304]  [<c0126912>] ? local_bh_enable+0x66/0x71
[ 3508.988304]  [<c025b279>] ? dev_queue_xmit+0x287/0x2be
[ 3508.988304]  [<d09b52bf>] ? send_outstanding_bcast_packet+0x0/0xa5 [batman_adv]
[ 3508.988304]  [<d09b531e>] ? send_outstanding_bcast_packet+0x5f/0xa5 [batman_adv]
[ 3508.988304]  [<c012f03e>] ? run_workqueue+0x74/0xf2
[ 3508.988304]  [<c012f719>] ? worker_thread+0x0/0xbd
[ 3508.988304]  [<c012f7cc>] ? worker_thread+0xb3/0xbd
[ 3508.988304]  [<c01319d4>] ? autoremove_wake_function+0x0/0x2d
[ 3508.988304]  [<c0131913>] ? kthread+0x38/0x5d
[ 3508.988304]  [<c01318db>] ? kthread+0x0/0x5d
[ 3508.988304]  [<c01044f7>] ? kernel_thread_helper+0x7/0x10
[ 3508.988304]  =======================

 08:38:55 up 59 min,  2 users,  load average: 8.59, 5.64, 3.84
mesh-node@mesh-node1:~$
mesh-node@mesh-node1:~$
mesh-node@mesh-node1:~$
--------------------------------------------------------------

I'm not sure yet, whether the issue is just that
send_broadcast_packet() is being called so often (~7000 times/
second) or if it is locking something somewhere for too long or
both.

I tried to investigate a little further what might take most of
the time send_outstanding_packets(). There I noticed, that at this
high burden send_skb_packet() in send_outstanding_packets() needs
most of the time: ~12 times (of ~7000) it needs more then 20ms
(~80ms usually) during one second. So those ~12 strange times are
holding the rcu_lock for 99% of the time on node B. Could this be
the cause for the soft-lockups? what is still confusing me is, if
it's send_skb_packet() and not send_outstanding_packets() itself
that takes so long, why isn't send_skb_packet() showing up in the
trace?

I've been using the following patch for the measurments on top of
rev 1616 + Simon's seqno patch v2 + Simon's limit patch v3:

--------------------------------------------------------------
--- a/batman-adv-kernelland/send.c
+++ b/batman-adv-kernelland/send.c
@@ -431,6 +431,7 @@ int add_bcast_packet_to_list(struct sk_buff *skb)

 void send_outstanding_bcast_packet(struct work_struct *work)
 {
+unsigned long start, timespent;
        struct batman_if *batman_if;
        struct delayed_work *delayed_work =
                container_of(work, struct delayed_work, work);
@@ -449,8 +450,15 @@ void send_outstanding_bcast_packet(struct work_struct *work)
                /* send a copy of the saved skb */
                skb1 = skb_copy(forw_packet->skb, GFP_ATOMIC);
                if (skb1)
+{
+start = jiffies;
                        send_skb_packet(skb1,
                                batman_if, broadcastAddr);
+timespent = jiffies - start;
+if(printk_ratelimit() || timespent > 5) {
+printk("Took %d jiffies (%d ms)\n", timespent, (timespent*1000)/HZ);
+}
+}
        }
        rcu_read_unlock();

--------------------------------------------------------------

The resulting output on node B is the following:
(hmm, okay, in this output it's about 8, not 12 times per second as it
had been in previous tests)
--------------------------------------------------------------
[ 4164.437463] Took 19 jiffies (76 ms)
[ 4164.552022] Took 22 jiffies (88 ms)
[ 4164.654244] Took 20 jiffies (80 ms)
[ 4164.758564] Took 20 jiffies (80 ms)
[ 4164.858450] Took 19 jiffies (76 ms)
[ 4164.962935] Took 22 jiffies (88 ms)
[ 4165.067755] Took 19 jiffies (76 ms)
[ 4165.172166] Took 23 jiffies (92 ms)
[ 4165.276810] Took 19 jiffies (76 ms)
[ 4165.381937] Took 19 jiffies (76 ms)
[ 4165.487952] Took 18 jiffies (72 ms)
[ 4165.591722] Took 19 jiffies (76 ms)
[ 4165.696384] Took 19 jiffies (76 ms)
[ 4165.800138] Took 19 jiffies (76 ms)
[ 4165.903559] Took 19 jiffies (76 ms)
[ 4166.007544] Took 22 jiffies (88 ms)
[ 4166.116471] Took 20 jiffies (80 ms)
[ 4169.372036] __ratelimit: 9633 messages suppressed
[ 4169.372036] Took 1 jiffies (4 ms)
[ 4169.452787] Took 20 jiffies (80 ms)
[ 4169.558822] Took 21 jiffies (84 ms)
[ 4169.662983] Took 20 jiffies (80 ms)
[ 4171.180022] Took 6 jiffies (24 ms)
[ 4171.281675] Took 20 jiffies (80 ms)
[ 4171.387203] Took 20 jiffies (80 ms)
[ 4171.486460] Took 19 jiffies (76 ms)
[ 4174.370508] __ratelimit: 12882 messages suppressed
[ 4174.372021] Took 1 jiffies (4 ms)
[ 4174.460143] Took 21 jiffies (84 ms)
[ 4174.564234] Took 23 jiffies (92 ms)
[ 4174.668185] Took 19 jiffies (76 ms)
[ 4174.772064] Took 19 jiffies (76 ms)
[ 4174.876539] Took 18 jiffies (72 ms)
[ 4174.984752] Took 23 jiffies (92 ms)
[ 4175.088186] Took 20 jiffies (80 ms)
[ 4175.200019] Took 21 jiffies (84 ms)
[ 4175.300615] Took 20 jiffies (80 ms)
[ 4175.404180] Took 20 jiffies (80 ms)
[ 4175.506979] Took 21 jiffies (84 ms)
[ 4175.611148] Took 21 jiffies (84 ms)
[ 4175.712018] Took 19 jiffies (76 ms)
[ 4175.816150] Took 22 jiffies (88 ms)
[ 4175.921259] Took 19 jiffies (76 ms)
[ 4176.026627] Took 20 jiffies (80 ms)
[ 4176.135124] Took 20 jiffies (80 ms)
[ 4176.243757] Took 21 jiffies (84 ms)
[ 4176.349231] Took 21 jiffies (84 ms)
[ 4176.444639] Took 18 jiffies (72 ms)
[ 4176.543197] Took 18 jiffies (72 ms)
[ 4176.643360] Took 18 jiffies (72 ms)
[ 4176.741147] Took 17 jiffies (68 ms)
[ 4176.840095] Took 18 jiffies (72 ms)
[ 4176.938419] Took 18 jiffies (72 ms)
[ 4177.036107] Took 17 jiffies (68 ms)
[ 4177.139641] Took 18 jiffies (72 ms)
--------------------------------------------------------------

Cheers, Linus


On Wed, Apr 07, 2010 at 10:11:23AM +0200, Simon Wunderlich wrote:
> This patch limits the queue lengths of batman and broadcast packets. BATMAN
> packets are held back for aggregation and jittered to avoid interferences.
> Broadcast packets are stored to be sent out multiple times to increase
> the probability to be received by other nodes in lossy environments.
> 
> Especially in extreme cases like broadcast storms, the queues have been seen
> to run full, eating up all the memory and triggering the infamous OOM killer.
> With the queue length limits introduced in this patch, this problem is
> avoided.
> 
> Each queue is limited to 256 entries for now, resulting in 1 MB of maximum
> space available in total for typical setups (assuming one packet including
> overhead does not require more than 2000 byte). This should also be reasonable
> for smaller routers, otherwise the defines can be tweaked later.
> 
> This third version of the patch does not increase the local broadcast 
> sequence number when the queue is already full.
> 
> Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
> ---
> Index: a/batman-adv-kernelland/send.c
> ===================================================================
> --- a/batman-adv-kernelland/send.c	(revision 1616)
> +++ a/batman-adv-kernelland/send.c	(working copy)
> @@ -378,18 +378,35 @@
>  			   send_time);
>  }
>  
> -void add_bcast_packet_to_list(struct sk_buff *skb)
> +#define atomic_dec_not_zero(v)          atomic_add_unless((v), -1, 0)
> +/* add a broadcast packet to the queue and setup timers. broadcast packets
> + * are sent multiple times to increase probability for beeing received.
> + *
> + * This function returns NETDEV_TX_OK on success and NETDEV_TX_BUSY on
> + * errors.
> + *
> + * The skb is not consumed, so the caller should make sure that the
> + * skb is freed. */
> +int add_bcast_packet_to_list(struct sk_buff *skb)
>  {
>  	struct forw_packet *forw_packet;
>  
> +	if (!atomic_dec_not_zero(&bcast_queue_left)) {
> +		bat_dbg(DBG_BATMAN, "bcast packet queue full\n");
> +		return NETDEV_TX_BUSY;
> +	}
> +
>  	forw_packet = kmalloc(sizeof(struct forw_packet), GFP_ATOMIC);
> -	if (!forw_packet)
> -		return;
> +	if (!forw_packet) {
> +		atomic_inc(&bcast_queue_left);
> +		return NETDEV_TX_BUSY;
> +	}
>  
>  	skb = skb_copy(skb, GFP_ATOMIC);
>  	if (!skb) {
> +		atomic_inc(&bcast_queue_left);
>  		kfree(forw_packet);
> -		return;
> +		return NETDEV_TX_BUSY;
>  	}
>  
>  	skb_reset_mac_header(skb);
> @@ -401,6 +418,7 @@
>  	forw_packet->num_packets = 0;
>  
>  	_add_bcast_packet_to_list(forw_packet, 1);
> +	return NETDEV_TX_OK;
>  }
>  
>  void send_outstanding_bcast_packet(struct work_struct *work)
> @@ -435,8 +453,10 @@
>  	if ((forw_packet->num_packets < 3) &&
>  	    (atomic_read(&module_state) != MODULE_DEACTIVATING))
>  		_add_bcast_packet_to_list(forw_packet, ((5 * HZ) / 1000));
> -	else
> +	else {
>  		forw_packet_free(forw_packet);
> +		atomic_inc(&bcast_queue_left);
> +	}
>  }
>  
>  void send_outstanding_bat_packet(struct work_struct *work)
> @@ -462,6 +482,10 @@
>  	    (atomic_read(&module_state) != MODULE_DEACTIVATING))
>  		schedule_own_packet(forw_packet->if_incoming);
>  
> +	/* don't count own packet */
> +	if (!forw_packet->own)
> +		atomic_inc(&batman_queue_left);
> +
>  	forw_packet_free(forw_packet);
>  }
>  
> Index: a/batman-adv-kernelland/send.h
> ===================================================================
> --- a/batman-adv-kernelland/send.h	(revision 1616)
> +++ a/batman-adv-kernelland/send.h	(working copy)
> @@ -33,7 +33,7 @@
>  			     struct batman_packet *batman_packet,
>  			     uint8_t directlink, int hna_buff_len,
>  			     struct batman_if *if_outgoing);
> -void add_bcast_packet_to_list(struct sk_buff *skb);
> +int  add_bcast_packet_to_list(struct sk_buff *skb);
>  void send_outstanding_bcast_packet(struct work_struct *work);
>  void send_outstanding_bat_packet(struct work_struct *work);
>  void purge_outstanding_packets(void);
> Index: a/batman-adv-kernelland/soft-interface.c
> ===================================================================
> --- a/batman-adv-kernelland/soft-interface.c	(revision 1616)
> +++ a/batman-adv-kernelland/soft-interface.c	(working copy)
> @@ -227,10 +227,10 @@
>  		/* set broadcast sequence number */
>  		bcast_packet->seqno = htons(bcast_seqno);
>  
> -		bcast_seqno++;
> +		/* broadcast packet. on success, increase seqno. */
> +		if (add_bcast_packet_to_list(skb) == NETDEV_TX_OK)
> +			bcast_seqno++;
>  
> -		/* broadcast packet */
> -		add_bcast_packet_to_list(skb);
>  		/* a copy is stored in the bcast list, therefore removing
>  		 * the original skb. */
>  		kfree_skb(skb);
> Index: a/batman-adv-kernelland/main.c
> ===================================================================
> --- a/batman-adv-kernelland/main.c	(revision 1616)
> +++ a/batman-adv-kernelland/main.c	(working copy)
> @@ -46,6 +46,9 @@
>  
>  atomic_t originator_interval;
>  atomic_t vis_interval;
> +atomic_t bcast_queue_left;
> +atomic_t batman_queue_left;
> +
>  int16_t num_hna;
>  int16_t num_ifs;
>  
> @@ -85,6 +88,8 @@
>  	atomic_set(&originator_interval, 1000);
>  	atomic_set(&vis_interval, 1000);/* TODO: raise this later, this is only
>  					 * for debugging now. */
> +	atomic_set(&bcast_queue_left, BCAST_QUEUE_LEN);
> +	atomic_set(&batman_queue_left, BATMAN_QUEUE_LEN);
>  
>  	/* the name should not be longer than 10 chars - see
>  	 * http://lwn.net/Articles/23634/ */
> Index: a/batman-adv-kernelland/aggregation.c
> ===================================================================
> --- a/batman-adv-kernelland/aggregation.c	(revision 1616)
> +++ a/batman-adv-kernelland/aggregation.c	(working copy)
> @@ -95,6 +95,7 @@
>  	return false;
>  }
>  
> +#define atomic_dec_not_zero(v)          atomic_add_unless((v), -1, 0)
>  /* create a new aggregated packet and add this packet to it */
>  static void new_aggregated_packet(unsigned char *packet_buff,
>  			   int packet_len,
> @@ -106,13 +107,26 @@
>  	struct forw_packet *forw_packet_aggr;
>  	unsigned long flags;
>  
> +	/* own packet should always be scheduled */
> +	if (!own_packet) {
> +		if (!atomic_dec_not_zero(&batman_queue_left)) {
> +			bat_dbg(DBG_BATMAN, "batman packet queue full\n");
> +			return;
> +		}
> +	}
> +
>  	forw_packet_aggr = kmalloc(sizeof(struct forw_packet), GFP_ATOMIC);
> -	if (!forw_packet_aggr)
> +	if (!forw_packet_aggr) {
> +		if (!own_packet)
> +			atomic_inc(&batman_queue_left);
>  		return;
> +	}
>  
>  	forw_packet_aggr->packet_buff = kmalloc(MAX_AGGREGATION_BYTES,
>  						GFP_ATOMIC);
>  	if (!forw_packet_aggr->packet_buff) {
> +		if (!own_packet)
> +			atomic_inc(&batman_queue_left);
>  		kfree(forw_packet_aggr);
>  		return;
>  	}
> Index: a/batman-adv-kernelland/main.h
> ===================================================================
> --- a/batman-adv-kernelland/main.h	(revision 1616)
> +++ a/batman-adv-kernelland/main.h	(working copy)
> @@ -70,6 +70,8 @@
>  #define MODULE_ACTIVE 1
>  #define MODULE_DEACTIVATING 2
>  
> +#define BCAST_QUEUE_LEN		256
> +#define BATMAN_QUEUE_LEN 	256
>  
>  /*
>   * Debug Messages
> @@ -133,6 +135,8 @@
>  
>  extern atomic_t originator_interval;
>  extern atomic_t vis_interval;
> +extern atomic_t bcast_queue_left;
> +extern atomic_t batman_queue_left;
>  extern int16_t num_hna;
>  extern int16_t num_ifs;
>  
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

  reply	other threads:[~2010-04-10 23:56 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-05  2:46 [B.A.T.M.A.N.] [PATCH] batman-adv: Limit queue lengths for batman and broadcast packets Simon Wunderlich
2010-04-05 11:21 ` Sven Eckelmann
2010-04-05 14:40   ` Simon Wunderlich
2010-04-05 15:06     ` [B.A.T.M.A.N.] [PATCHv2] " Simon Wunderlich
2010-04-07  8:11       ` [B.A.T.M.A.N.] [PATCHv3] " Simon Wunderlich
2010-04-10 23:56         ` Linus Lüssing [this message]
2010-04-11  0:42           ` Linus Lüssing
2010-04-12 19:15           ` Simon Wunderlich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100410235608.GA12381@Linus-Debian \
    --to=linus.luessing@web.de \
    --cc=b.a.t.m.a.n@lists.open-mesh.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox