thousands of classes, e1000 TX unit hang

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* thousands of classes, e1000 TX unit hang
@ 2008-08-05  7:47 Denys Fedoryshchenko
  2008-08-05  8:06 ` Denys Fedoryshchenko
  2008-08-06  1:13 ` Brandeburg, Jesse
  0 siblings, 2 replies; 14+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-05  7:47 UTC (permalink / raw)
  To: netdev

I did script, that looks something like this (to simulate SFQ by flow 
classifier):

$2 (is ppp interface)
echo "qdisc del dev $2 root ">>${TEMP}
echo "qdisc add dev $2 root handle 1: htb ">>${TEMP}
 echo "filter add dev $2 protocol ip pref 16 parent 1: u32 \
	match ip dst 0.0.0.0/0 police rate 8kbit burst 2048kb \
	peakrate 1024Kbit mtu 10000 \
	conform-exceed continue/ok">>${TEMP}

echo "filter add dev $2 protocol ip pref 32 parent 1: handle 1 \
	flow hash keys nfct divisor 128 baseclass 1:2">>${TEMP}

echo "class add dev $2 parent 1: classid 1:1 htb \
	rate ${rate}bit ceil ${rate}Kbit quantum 1514">>${TEMP}

#Cycle to add 128 classes
maxslot=130
for slot in `seq 2 $maxslot`; do
echo "class add dev $2 parent 1:1 classid 1:$slot htb \
	rate 8Kbit ceil 256Kbit quantum 1514">>${TEMP}
echo "qdisc add dev $2 handle $slot: parent 1:$slot bfifo limit 3000">>${TEMP}
done

After adding around 400-450 interfaces (ppp) server start to "crack". Sure 
there is packetloss to eth0 (but there is no filters or shapers on it). Even 
deleting all classes becomes a challenge. After deleting all root handles on 
ppp interfaces - it becomes ok. 


Traffic over host is 15-20Mbit/s at that moment, it is 1 CPU Xeon 3.0 Ghz on 
server motherboard SE7520 with 1GB ram available (at moment of testing more 
than 512Mb was free).

Kernel is 2.6.26.1-vanilla
Anything else i need to add to info?

Error message appearing in dmesg:
[149650.006939] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[149650.006943]   Tx Queue             <0>
[149650.006944]   TDH                  <a3>
[149650.006945]   TDT                  <a3>
[149650.006947]   next_to_use          <a3>
[149650.006948]   next_to_clean        <f8>
[149650.006949] buffer_info[next_to_clean]
[149650.006951]   time_stamp           <8e69a7c>
[149650.006952]   next_to_watch        <f8>
[149650.006953]   jiffies              <8e6a111>
[149650.006954]   next_to_watch.status <1>
[149655.964100] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[149655.964104]   Tx Queue             <0>
[149655.964105]   TDH                  <6c>
[149655.964107]   TDT                  <6c>
[149655.964108]   next_to_use          <6c>
[149655.964109]   next_to_clean        <c1>
[149655.964111] buffer_info[next_to_clean]
[149655.964112]   time_stamp           <8e6b198>
[149655.964113]   next_to_watch        <c1>
[149655.964115]   jiffies              <8e6b853>
[149655.964116]   next_to_watch.status <1>
[149666.765110] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[149666.765110]   Tx Queue             <0>
[149666.765110]   TDH                  <28>
[149666.765110]   TDT                  <28>
[149666.765110]   next_to_use          <28>
[149666.765110]   next_to_clean        <7e>
[149666.765110] buffer_info[next_to_clean]
[149666.765110]   time_stamp           <8e6db6a>
[149666.765110]   next_to_watch        <7e>
[149666.765110]   jiffies              <8e6e27f>
[149666.765110]   next_to_watch.status <1>
[149668.629051] e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
[149668.629056]   Tx Queue             <0>
[149668.629058]   TDH                  <1b>
[149668.629060]   TDT                  <1b>
[149668.629062]   next_to_use          <1b>
[149668.629064]   next_to_clean        <f1>
[149668.629066] buffer_info[next_to_clean]
[149668.629068]   time_stamp           <8e6e4c3>
[149668.629070]   next_to_watch        <f1>
[149668.629072]   jiffies              <8e6e9c7>
[149668.629074]   next_to_watch.status <1>
[149676.606031] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[149676.606035]   Tx Queue             <0>
[149676.606037]   TDH                  <9b>
[149676.606038]   TDT                  <9b>
[149676.606039]   next_to_use          <9b>
[149676.606040]   next_to_clean        <f0>
[149676.606042] buffer_info[next_to_clean]
[149676.606043]   time_stamp           <8e7024c>
[149676.606044]   next_to_watch        <f0>
[149676.606046]   jiffies              <8e708eb>
[149676.606047]   next_to_watch.status <1>
[149680.151750] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[149680.151750]   Tx Queue             <0>
[149680.151750]   TDH                  <84>
[149680.151750]   TDT                  <84>
[149680.151750]   next_to_use          <84>
[149680.151750]   next_to_clean        <d9>
[149680.151750] buffer_info[next_to_clean]
[149680.151750]   time_stamp           <8e7100d>
[149680.151750]   next_to_watch        <d9>
[149680.151750]   jiffies              <8e716c3>
[149680.151750]   next_to_watch.status <1>
[149680.153751] e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
[149680.153751]   Tx Queue             <0>
[149680.153751]   TDH                  <aa>
[149680.153751]   TDT                  <d2>
[149680.153751]   next_to_use          <d2>
[149680.153751]   next_to_clean        <2d>
[149680.153751] buffer_info[next_to_clean]
[149680.153751]   time_stamp           <8e710db>
[149680.153751]   next_to_watch        <2d>
[149680.153751]   jiffies              <8e716c5>
[149680.153751]   next_to_watch.status <1>
[149702.565549] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[149702.565549]   Tx Queue             <0>
[149702.565549]   TDH                  <3c>
[149702.565549]   TDT                  <3c>
[149702.565549]   next_to_use          <3c>
[149702.565549]   next_to_clean        <91>
[149702.565549] buffer_info[next_to_clean]
[149702.565549]   time_stamp           <8e7676e>
[149702.565549]   next_to_watch        <91>
[149702.565549]   jiffies              <8e76e48>
[149702.565549]   next_to_watch.status <1>
[149708.020581] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[149708.020581]   Tx Queue             <0>
[149708.020581]   TDH                  <4c>
[149708.020581]   TDT                  <4c>
[149708.020581]   next_to_use          <4c>
[149708.020581]   next_to_clean        <a1>
[149708.020581] buffer_info[next_to_clean]
[149708.020581]   time_stamp           <8e77cc3>
[149708.020581]   next_to_watch        <a1>
[149708.020581]   jiffies              <8e78394>
[149708.020581]   next_to_watch.status <1>
[149713.864829] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[149713.864833]   Tx Queue             <0>
[149713.864835]   TDH                  <b0>
[149713.864836]   TDT                  <b0>
[149713.864837]   next_to_use          <b0>
[149713.864839]   next_to_clean        <5>
[149713.864840] buffer_info[next_to_clean]
[149713.864841]   time_stamp           <8e7937b>
[149713.864842]   next_to_watch        <5>
[149713.864844]   jiffies              <8e79a64>
[149713.864845]   next_to_watch.status <1>
[149759.710721] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[149759.710726]   Tx Queue             <0>
[149759.710729]   TDH                  <88>
[149759.710730]   TDT                  <88>
[149759.710732]   next_to_use          <88>
[149759.710734]   next_to_clean        <dd>
[149759.710736] buffer_info[next_to_clean]
[149759.710738]   time_stamp           <8e8465c>
[149759.710740]   next_to_watch        <dd>
[149759.710742]   jiffies              <8e84d6f>
[149759.710744]   next_to_watch.status <1>
[149759.712712] e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
[149759.712715]   Tx Queue             <0>
[149759.712717]   TDH                  <84>
[149759.712719]   TDT                  <90>
[149759.712721]   next_to_use          <90>
[149759.712723]   next_to_clean        <e5>
[149759.712725] buffer_info[next_to_clean]
[149759.712726]   time_stamp           <8e84782>
[149759.712728]   next_to_watch        <e5>
[149759.712730]   jiffies              <8e84d71>
[149759.712732]   next_to_watch.status <1>
[149768.334753] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[149768.334757]   Tx Queue             <0>
[149768.334758]   TDH                  <92>
[149768.334760]   TDT                  <92>
[149768.334761]   next_to_use          <92>
[149768.334762]   next_to_clean        <e7>
[149768.334764] buffer_info[next_to_clean]
[149768.334765]   time_stamp           <8e86829>
[149768.334766]   next_to_watch        <e7>
[149768.334767]   jiffies              <8e86f1c>
[149768.334769]   next_to_watch.status <1>
[149776.537825] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
[149776.537825]   Tx Queue             <0>
[149776.537825]   TDH                  <4e>
[149776.537825]   TDT                  <4e>
[149776.537825]   next_to_use          <4e>
[149776.537825]   next_to_clean        <a3>
[149776.537825] buffer_info[next_to_clean]
[149776.537825]   time_stamp           <8e8882b>
[149776.537825]   next_to_watch        <a3>
[149776.537825]   jiffies              <8e88f21>
[149776.537825]   next_to_watch.status <1>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: thousands of classes, e1000 TX unit hang
  2008-08-05  7:47 thousands of classes, e1000 TX unit hang Denys Fedoryshchenko
@ 2008-08-05  8:06 ` Denys Fedoryshchenko
  2008-08-05 10:05   ` Denys Fedoryshchenko
  2008-08-06  1:13 ` Brandeburg, Jesse
  1 sibling, 1 reply; 14+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-05  8:06 UTC (permalink / raw)
  To: netdev

A little bit more info:

On oprofile i run on another machine (which doesn't suffer much, but i can 
notice also drops on eth0 after adding around 100 interfaces). On first 
machine clocksources is TSC, on machine where i read stats acpi_pm.

CPU: P4 / Xeon with 2 hyper-threads, speed 3200.53 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not 
stopped) with a unit mask of 0x01 (mandatory) count 100000
GLOBAL_POWER_E...|
  samples|      %|
------------------
   973464 75.7644 vmlinux
    97703  7.6042 libc-2.6.1.so
    36166  2.8148 cls_fw
    18290  1.4235 nf_conntrack
    17946  1.3967 busybox
        GLOBAL_POWER_E...|

PU: P4 / Xeon with 2 hyper-threads, speed 3200.53 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not 
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        symbol name
245545   23.1963  acpi_pm_read
143863   13.5905  __copy_to_user_ll
121269   11.4561  ioread16
58609     5.5367  gen_kill_estimator
40153     3.7932  ioread32
33923     3.2047  ioread8
16491     1.5579  arch_task_cache_init
16067     1.5178  sysenter_past_esp
11604     1.0962  find_get_page
10631     1.0043  est_timer
9038      0.8538  get_page_from_freelist
8681      0.8201  sk_run_filter
8077      0.7630  irq_entries_start
7711      0.7284  schedule
6451      0.6094  copy_to_user




On Tuesday 05 August 2008, Denys Fedoryshchenko wrote:
> I did script, that looks something like this (to simulate SFQ by flow
> classifier):
>
> $2 (is ppp interface)
> echo "qdisc del dev $2 root ">>${TEMP}
> echo "qdisc add dev $2 root handle 1: htb ">>${TEMP}
>  echo "filter add dev $2 protocol ip pref 16 parent 1: u32 \
> 	match ip dst 0.0.0.0/0 police rate 8kbit burst 2048kb \
> 	peakrate 1024Kbit mtu 10000 \
> 	conform-exceed continue/ok">>${TEMP}
>
> echo "filter add dev $2 protocol ip pref 32 parent 1: handle 1 \
> 	flow hash keys nfct divisor 128 baseclass 1:2">>${TEMP}
>
> echo "class add dev $2 parent 1: classid 1:1 htb \
> 	rate ${rate}bit ceil ${rate}Kbit quantum 1514">>${TEMP}
>
> #Cycle to add 128 classes
> maxslot=130
> for slot in `seq 2 $maxslot`; do
> echo "class add dev $2 parent 1:1 classid 1:$slot htb \
> 	rate 8Kbit ceil 256Kbit quantum 1514">>${TEMP}
> echo "qdisc add dev $2 handle $slot: parent 1:$slot bfifo limit
> 3000">>${TEMP} done
>
> After adding around 400-450 interfaces (ppp) server start to "crack". Sure
> there is packetloss to eth0 (but there is no filters or shapers on it).
> Even deleting all classes becomes a challenge. After deleting all root
> handles on ppp interfaces - it becomes ok.
>
>
> Traffic over host is 15-20Mbit/s at that moment, it is 1 CPU Xeon 3.0 Ghz
> on server motherboard SE7520 with 1GB ram available (at moment of testing
> more than 512Mb was free).
>
> Kernel is 2.6.26.1-vanilla
> Anything else i need to add to info?
>
> Error message appearing in dmesg:
> [149650.006939] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149650.006943]   Tx Queue             <0>
> [149650.006944]   TDH                  <a3>
> [149650.006945]   TDT                  <a3>
> [149650.006947]   next_to_use          <a3>
> [149650.006948]   next_to_clean        <f8>
> [149650.006949] buffer_info[next_to_clean]
> [149650.006951]   time_stamp           <8e69a7c>
> [149650.006952]   next_to_watch        <f8>
> [149650.006953]   jiffies              <8e6a111>
> [149650.006954]   next_to_watch.status <1>
> [149655.964100] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149655.964104]   Tx Queue             <0>
> [149655.964105]   TDH                  <6c>
> [149655.964107]   TDT                  <6c>
> [149655.964108]   next_to_use          <6c>
> [149655.964109]   next_to_clean        <c1>
> [149655.964111] buffer_info[next_to_clean]
> [149655.964112]   time_stamp           <8e6b198>
> [149655.964113]   next_to_watch        <c1>
> [149655.964115]   jiffies              <8e6b853>
> [149655.964116]   next_to_watch.status <1>
> [149666.765110] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149666.765110]   Tx Queue             <0>
> [149666.765110]   TDH                  <28>
> [149666.765110]   TDT                  <28>
> [149666.765110]   next_to_use          <28>
> [149666.765110]   next_to_clean        <7e>
> [149666.765110] buffer_info[next_to_clean]
> [149666.765110]   time_stamp           <8e6db6a>
> [149666.765110]   next_to_watch        <7e>
> [149666.765110]   jiffies              <8e6e27f>
> [149666.765110]   next_to_watch.status <1>
> [149668.629051] e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149668.629056]   Tx Queue             <0>
> [149668.629058]   TDH                  <1b>
> [149668.629060]   TDT                  <1b>
> [149668.629062]   next_to_use          <1b>
> [149668.629064]   next_to_clean        <f1>
> [149668.629066] buffer_info[next_to_clean]
> [149668.629068]   time_stamp           <8e6e4c3>
> [149668.629070]   next_to_watch        <f1>
> [149668.629072]   jiffies              <8e6e9c7>
> [149668.629074]   next_to_watch.status <1>
> [149676.606031] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149676.606035]   Tx Queue             <0>
> [149676.606037]   TDH                  <9b>
> [149676.606038]   TDT                  <9b>
> [149676.606039]   next_to_use          <9b>
> [149676.606040]   next_to_clean        <f0>
> [149676.606042] buffer_info[next_to_clean]
> [149676.606043]   time_stamp           <8e7024c>
> [149676.606044]   next_to_watch        <f0>
> [149676.606046]   jiffies              <8e708eb>
> [149676.606047]   next_to_watch.status <1>
> [149680.151750] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149680.151750]   Tx Queue             <0>
> [149680.151750]   TDH                  <84>
> [149680.151750]   TDT                  <84>
> [149680.151750]   next_to_use          <84>
> [149680.151750]   next_to_clean        <d9>
> [149680.151750] buffer_info[next_to_clean]
> [149680.151750]   time_stamp           <8e7100d>
> [149680.151750]   next_to_watch        <d9>
> [149680.151750]   jiffies              <8e716c3>
> [149680.151750]   next_to_watch.status <1>
> [149680.153751] e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149680.153751]   Tx Queue             <0>
> [149680.153751]   TDH                  <aa>
> [149680.153751]   TDT                  <d2>
> [149680.153751]   next_to_use          <d2>
> [149680.153751]   next_to_clean        <2d>
> [149680.153751] buffer_info[next_to_clean]
> [149680.153751]   time_stamp           <8e710db>
> [149680.153751]   next_to_watch        <2d>
> [149680.153751]   jiffies              <8e716c5>
> [149680.153751]   next_to_watch.status <1>
> [149702.565549] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149702.565549]   Tx Queue             <0>
> [149702.565549]   TDH                  <3c>
> [149702.565549]   TDT                  <3c>
> [149702.565549]   next_to_use          <3c>
> [149702.565549]   next_to_clean        <91>
> [149702.565549] buffer_info[next_to_clean]
> [149702.565549]   time_stamp           <8e7676e>
> [149702.565549]   next_to_watch        <91>
> [149702.565549]   jiffies              <8e76e48>
> [149702.565549]   next_to_watch.status <1>
> [149708.020581] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149708.020581]   Tx Queue             <0>
> [149708.020581]   TDH                  <4c>
> [149708.020581]   TDT                  <4c>
> [149708.020581]   next_to_use          <4c>
> [149708.020581]   next_to_clean        <a1>
> [149708.020581] buffer_info[next_to_clean]
> [149708.020581]   time_stamp           <8e77cc3>
> [149708.020581]   next_to_watch        <a1>
> [149708.020581]   jiffies              <8e78394>
> [149708.020581]   next_to_watch.status <1>
> [149713.864829] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149713.864833]   Tx Queue             <0>
> [149713.864835]   TDH                  <b0>
> [149713.864836]   TDT                  <b0>
> [149713.864837]   next_to_use          <b0>
> [149713.864839]   next_to_clean        <5>
> [149713.864840] buffer_info[next_to_clean]
> [149713.864841]   time_stamp           <8e7937b>
> [149713.864842]   next_to_watch        <5>
> [149713.864844]   jiffies              <8e79a64>
> [149713.864845]   next_to_watch.status <1>
> [149759.710721] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149759.710726]   Tx Queue             <0>
> [149759.710729]   TDH                  <88>
> [149759.710730]   TDT                  <88>
> [149759.710732]   next_to_use          <88>
> [149759.710734]   next_to_clean        <dd>
> [149759.710736] buffer_info[next_to_clean]
> [149759.710738]   time_stamp           <8e8465c>
> [149759.710740]   next_to_watch        <dd>
> [149759.710742]   jiffies              <8e84d6f>
> [149759.710744]   next_to_watch.status <1>
> [149759.712712] e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149759.712715]   Tx Queue             <0>
> [149759.712717]   TDH                  <84>
> [149759.712719]   TDT                  <90>
> [149759.712721]   next_to_use          <90>
> [149759.712723]   next_to_clean        <e5>
> [149759.712725] buffer_info[next_to_clean]
> [149759.712726]   time_stamp           <8e84782>
> [149759.712728]   next_to_watch        <e5>
> [149759.712730]   jiffies              <8e84d71>
> [149759.712732]   next_to_watch.status <1>
> [149768.334753] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149768.334757]   Tx Queue             <0>
> [149768.334758]   TDH                  <92>
> [149768.334760]   TDT                  <92>
> [149768.334761]   next_to_use          <92>
> [149768.334762]   next_to_clean        <e7>
> [149768.334764] buffer_info[next_to_clean]
> [149768.334765]   time_stamp           <8e86829>
> [149768.334766]   next_to_watch        <e7>
> [149768.334767]   jiffies              <8e86f1c>
> [149768.334769]   next_to_watch.status <1>
> [149776.537825] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149776.537825]   Tx Queue             <0>
> [149776.537825]   TDH                  <4e>
> [149776.537825]   TDT                  <4e>
> [149776.537825]   next_to_use          <4e>
> [149776.537825]   next_to_clean        <a3>
> [149776.537825] buffer_info[next_to_clean]
> [149776.537825]   time_stamp           <8e8882b>
> [149776.537825]   next_to_watch        <a3>
> [149776.537825]   jiffies              <8e88f21>
> [149776.537825]   next_to_watch.status <1>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: thousands of classes, e1000 TX unit hang
  2008-08-05  8:06 ` Denys Fedoryshchenko
@ 2008-08-05 10:05   ` Denys Fedoryshchenko
  2008-08-05 11:04     ` Jarek Poplawski
  0 siblings, 1 reply; 14+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-05 10:05 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 11579 bytes --]

I found, that packetloss happening when i am deleting/adding classes.
I attach result of oprofile as file.
																																																																						 

On Tuesday 05 August 2008, Denys Fedoryshchenko wrote:
> A little bit more info:
>
> On oprofile i run on another machine (which doesn't suffer much, but i can
> notice also drops on eth0 after adding around 100 interfaces). On first
> machine clocksources is TSC, on machine where i read stats acpi_pm.
>
> CPU: P4 / Xeon with 2 hyper-threads, speed 3200.53 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not
> stopped) with a unit mask of 0x01 (mandatory) count 100000
> GLOBAL_POWER_E...|
>   samples|      %|
> ------------------
>    973464 75.7644 vmlinux
>     97703  7.6042 libc-2.6.1.so
>     36166  2.8148 cls_fw
>     18290  1.4235 nf_conntrack
>     17946  1.3967 busybox
>         GLOBAL_POWER_E...|
>
> PU: P4 / Xeon with 2 hyper-threads, speed 3200.53 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not
> stopped) with a unit mask of 0x01 (mandatory) count 100000
> samples  %        symbol name
> 245545   23.1963  acpi_pm_read
> 143863   13.5905  __copy_to_user_ll
> 121269   11.4561  ioread16
> 58609     5.5367  gen_kill_estimator
> 40153     3.7932  ioread32
> 33923     3.2047  ioread8
> 16491     1.5579  arch_task_cache_init
> 16067     1.5178  sysenter_past_esp
> 11604     1.0962  find_get_page
> 10631     1.0043  est_timer
> 9038      0.8538  get_page_from_freelist
> 8681      0.8201  sk_run_filter
> 8077      0.7630  irq_entries_start
> 7711      0.7284  schedule
> 6451      0.6094  copy_to_user
>
> On Tuesday 05 August 2008, Denys Fedoryshchenko wrote:
> > I did script, that looks something like this (to simulate SFQ by flow
> > classifier):
> >
> > $2 (is ppp interface)
> > echo "qdisc del dev $2 root ">>${TEMP}
> > echo "qdisc add dev $2 root handle 1: htb ">>${TEMP}
> >  echo "filter add dev $2 protocol ip pref 16 parent 1: u32 \
> > 	match ip dst 0.0.0.0/0 police rate 8kbit burst 2048kb \
> > 	peakrate 1024Kbit mtu 10000 \
> > 	conform-exceed continue/ok">>${TEMP}
> >
> > echo "filter add dev $2 protocol ip pref 32 parent 1: handle 1 \
> > 	flow hash keys nfct divisor 128 baseclass 1:2">>${TEMP}
> >
> > echo "class add dev $2 parent 1: classid 1:1 htb \
> > 	rate ${rate}bit ceil ${rate}Kbit quantum 1514">>${TEMP}
> >
> > #Cycle to add 128 classes
> > maxslot=130
> > for slot in `seq 2 $maxslot`; do
> > echo "class add dev $2 parent 1:1 classid 1:$slot htb \
> > 	rate 8Kbit ceil 256Kbit quantum 1514">>${TEMP}
> > echo "qdisc add dev $2 handle $slot: parent 1:$slot bfifo limit
> > 3000">>${TEMP} done
> >
> > After adding around 400-450 interfaces (ppp) server start to "crack".
> > Sure there is packetloss to eth0 (but there is no filters or shapers on
> > it). Even deleting all classes becomes a challenge. After deleting all
> > root handles on ppp interfaces - it becomes ok.
> >
> >
> > Traffic over host is 15-20Mbit/s at that moment, it is 1 CPU Xeon 3.0 Ghz
> > on server motherboard SE7520 with 1GB ram available (at moment of testing
> > more than 512Mb was free).
> >
> > Kernel is 2.6.26.1-vanilla
> > Anything else i need to add to info?
> >
> > Error message appearing in dmesg:
> > [149650.006939] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149650.006943]   Tx Queue             <0>
> > [149650.006944]   TDH                  <a3>
> > [149650.006945]   TDT                  <a3>
> > [149650.006947]   next_to_use          <a3>
> > [149650.006948]   next_to_clean        <f8>
> > [149650.006949] buffer_info[next_to_clean]
> > [149650.006951]   time_stamp           <8e69a7c>
> > [149650.006952]   next_to_watch        <f8>
> > [149650.006953]   jiffies              <8e6a111>
> > [149650.006954]   next_to_watch.status <1>
> > [149655.964100] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149655.964104]   Tx Queue             <0>
> > [149655.964105]   TDH                  <6c>
> > [149655.964107]   TDT                  <6c>
> > [149655.964108]   next_to_use          <6c>
> > [149655.964109]   next_to_clean        <c1>
> > [149655.964111] buffer_info[next_to_clean]
> > [149655.964112]   time_stamp           <8e6b198>
> > [149655.964113]   next_to_watch        <c1>
> > [149655.964115]   jiffies              <8e6b853>
> > [149655.964116]   next_to_watch.status <1>
> > [149666.765110] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149666.765110]   Tx Queue             <0>
> > [149666.765110]   TDH                  <28>
> > [149666.765110]   TDT                  <28>
> > [149666.765110]   next_to_use          <28>
> > [149666.765110]   next_to_clean        <7e>
> > [149666.765110] buffer_info[next_to_clean]
> > [149666.765110]   time_stamp           <8e6db6a>
> > [149666.765110]   next_to_watch        <7e>
> > [149666.765110]   jiffies              <8e6e27f>
> > [149666.765110]   next_to_watch.status <1>
> > [149668.629051] e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149668.629056]   Tx Queue             <0>
> > [149668.629058]   TDH                  <1b>
> > [149668.629060]   TDT                  <1b>
> > [149668.629062]   next_to_use          <1b>
> > [149668.629064]   next_to_clean        <f1>
> > [149668.629066] buffer_info[next_to_clean]
> > [149668.629068]   time_stamp           <8e6e4c3>
> > [149668.629070]   next_to_watch        <f1>
> > [149668.629072]   jiffies              <8e6e9c7>
> > [149668.629074]   next_to_watch.status <1>
> > [149676.606031] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149676.606035]   Tx Queue             <0>
> > [149676.606037]   TDH                  <9b>
> > [149676.606038]   TDT                  <9b>
> > [149676.606039]   next_to_use          <9b>
> > [149676.606040]   next_to_clean        <f0>
> > [149676.606042] buffer_info[next_to_clean]
> > [149676.606043]   time_stamp           <8e7024c>
> > [149676.606044]   next_to_watch        <f0>
> > [149676.606046]   jiffies              <8e708eb>
> > [149676.606047]   next_to_watch.status <1>
> > [149680.151750] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149680.151750]   Tx Queue             <0>
> > [149680.151750]   TDH                  <84>
> > [149680.151750]   TDT                  <84>
> > [149680.151750]   next_to_use          <84>
> > [149680.151750]   next_to_clean        <d9>
> > [149680.151750] buffer_info[next_to_clean]
> > [149680.151750]   time_stamp           <8e7100d>
> > [149680.151750]   next_to_watch        <d9>
> > [149680.151750]   jiffies              <8e716c3>
> > [149680.151750]   next_to_watch.status <1>
> > [149680.153751] e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149680.153751]   Tx Queue             <0>
> > [149680.153751]   TDH                  <aa>
> > [149680.153751]   TDT                  <d2>
> > [149680.153751]   next_to_use          <d2>
> > [149680.153751]   next_to_clean        <2d>
> > [149680.153751] buffer_info[next_to_clean]
> > [149680.153751]   time_stamp           <8e710db>
> > [149680.153751]   next_to_watch        <2d>
> > [149680.153751]   jiffies              <8e716c5>
> > [149680.153751]   next_to_watch.status <1>
> > [149702.565549] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149702.565549]   Tx Queue             <0>
> > [149702.565549]   TDH                  <3c>
> > [149702.565549]   TDT                  <3c>
> > [149702.565549]   next_to_use          <3c>
> > [149702.565549]   next_to_clean        <91>
> > [149702.565549] buffer_info[next_to_clean]
> > [149702.565549]   time_stamp           <8e7676e>
> > [149702.565549]   next_to_watch        <91>
> > [149702.565549]   jiffies              <8e76e48>
> > [149702.565549]   next_to_watch.status <1>
> > [149708.020581] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149708.020581]   Tx Queue             <0>
> > [149708.020581]   TDH                  <4c>
> > [149708.020581]   TDT                  <4c>
> > [149708.020581]   next_to_use          <4c>
> > [149708.020581]   next_to_clean        <a1>
> > [149708.020581] buffer_info[next_to_clean]
> > [149708.020581]   time_stamp           <8e77cc3>
> > [149708.020581]   next_to_watch        <a1>
> > [149708.020581]   jiffies              <8e78394>
> > [149708.020581]   next_to_watch.status <1>
> > [149713.864829] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149713.864833]   Tx Queue             <0>
> > [149713.864835]   TDH                  <b0>
> > [149713.864836]   TDT                  <b0>
> > [149713.864837]   next_to_use          <b0>
> > [149713.864839]   next_to_clean        <5>
> > [149713.864840] buffer_info[next_to_clean]
> > [149713.864841]   time_stamp           <8e7937b>
> > [149713.864842]   next_to_watch        <5>
> > [149713.864844]   jiffies              <8e79a64>
> > [149713.864845]   next_to_watch.status <1>
> > [149759.710721] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149759.710726]   Tx Queue             <0>
> > [149759.710729]   TDH                  <88>
> > [149759.710730]   TDT                  <88>
> > [149759.710732]   next_to_use          <88>
> > [149759.710734]   next_to_clean        <dd>
> > [149759.710736] buffer_info[next_to_clean]
> > [149759.710738]   time_stamp           <8e8465c>
> > [149759.710740]   next_to_watch        <dd>
> > [149759.710742]   jiffies              <8e84d6f>
> > [149759.710744]   next_to_watch.status <1>
> > [149759.712712] e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149759.712715]   Tx Queue             <0>
> > [149759.712717]   TDH                  <84>
> > [149759.712719]   TDT                  <90>
> > [149759.712721]   next_to_use          <90>
> > [149759.712723]   next_to_clean        <e5>
> > [149759.712725] buffer_info[next_to_clean]
> > [149759.712726]   time_stamp           <8e84782>
> > [149759.712728]   next_to_watch        <e5>
> > [149759.712730]   jiffies              <8e84d71>
> > [149759.712732]   next_to_watch.status <1>
> > [149768.334753] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149768.334757]   Tx Queue             <0>
> > [149768.334758]   TDH                  <92>
> > [149768.334760]   TDT                  <92>
> > [149768.334761]   next_to_use          <92>
> > [149768.334762]   next_to_clean        <e7>
> > [149768.334764] buffer_info[next_to_clean]
> > [149768.334765]   time_stamp           <8e86829>
> > [149768.334766]   next_to_watch        <e7>
> > [149768.334767]   jiffies              <8e86f1c>
> > [149768.334769]   next_to_watch.status <1>
> > [149776.537825] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [149776.537825]   Tx Queue             <0>
> > [149776.537825]   TDH                  <4e>
> > [149776.537825]   TDT                  <4e>
> > [149776.537825]   next_to_use          <4e>
> > [149776.537825]   next_to_clean        <a3>
> > [149776.537825] buffer_info[next_to_clean]
> > [149776.537825]   time_stamp           <8e8882b>
> > [149776.537825]   next_to_watch        <a3>
> > [149776.537825]   jiffies              <8e88f21>
> > [149776.537825]   next_to_watch.status <1>
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



[-- Attachment #2: oprofile_result1.txt --]
[-- Type: text/plain, Size: 6113 bytes --]

CPU: P4 / Xeon with 2 hyper-threads, speed 3200.53 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 10000
samples  %        symbol name
348827   33.2260  gen_kill_estimator
124818   11.8890  acpi_pm_read
44388     4.2280  ioread16
35074     3.3408  __copy_to_user_ll
33251     3.1672  rtnl_fill_ifinfo
18708     1.7819  ioread32
18350     1.7478  netif_receive_skb
17045     1.6235  copy_to_user
15829     1.5077  __nla_reserve
15261     1.4536  page_fault
14644     1.3948  rtnl_dump_ifinfo
12348     1.1762  __nla_put
11149     1.0619  dev_queue_xmit
11077     1.0551  ioread8
8327      0.7932  sysenter_past_esp
7279      0.6933  est_timer
6631      0.6316  get_page_from_freelist
4863      0.4632  handle_mm_fault
4817      0.4588  csum_partial
4782      0.4555  strlen
4644      0.4423  find_vma
4422      0.4212  nla_put
4406      0.4197  ip_route_input
4200      0.4001  skb_put

Here is details from opannotate

c024dc30 <gen_kill_estimator>: /* gen_kill_estimator total: 266925 23.2740 */
               :c024dc30:       push   %ebp
     1 8.7e-05 :c024dc31:       mov    %esp,%ebp
               :c024dc33:       push   %edi
               :c024dc34:       xor    %edi,%edi
               :c024dc36:       push   %esi
               :c024dc37:       push   %ebx
               :c024dc38:       sub    $0x10,%esp
               :c024dc3b:       mov    %eax,0xffffffe8(%ebp)
               :c024dc3e:       mov    %edx,0xffffffe4(%ebp)
               :c024dc41:       movl   $0xc08630bc,0xfffffff0(%ebp)
               :c024dc48:       mov    0xfffffff0(%ebp),%eax
     1 8.7e-05 :c024dc4b:       cmpl   $0x0,(%eax)
    26  0.0023 :c024dc4e:       je     c024dcb6 <gen_kill_estimator+0x86>
     3 2.6e-04 :c024dc50:       mov    0xc(%eax),%ebx
     3 2.6e-04 :c024dc53:       mov    %edi,%eax
               :c024dc55:       shl    $0x5,%eax
               :c024dc58:       add    $0xc08630c8,%eax
     1 8.7e-05 :c024dc5d:       mov    (%ebx),%esi
     4 3.5e-04 :c024dc5f:       mov    %eax,0xffffffec(%ebp)
               :c024dc62:       jmp    c024dcb1 <gen_kill_estimator+0x81>
   782  0.0682 :c024dc64:       mov    0xffffffe4(%ebp),%eax
   390  0.0340 :c024dc67:       cmp    %eax,0xc(%ebx)
 12239  1.0672 :c024dc6a:       jne    c024dcad <gen_kill_estimator+0x7d>
               :c024dc6c:       mov    0xffffffe8(%ebp),%eax
               :c024dc6f:       cmp    %eax,0x8(%ebx)
               :c024dc72:       jne    c024dcad <gen_kill_estimator+0x7d>
               :c024dc74:       mov    $0xc036d620,%eax
               :c024dc79:       call   c02a7817 <_write_lock_bh>
     3 2.6e-04 :c024dc7e:       mov    $0xc036d620,%eax
               :c024dc83:       movl   $0x0,0x8(%ebx)
               :c024dc8a:       call   c02a78ac <_write_unlock_bh>
               :c024dc8f:       mov    0x4(%ebx),%edx
               :c024dc92:       mov    (%ebx),%eax
               :c024dc94:       mov    %edx,0x4(%eax)
               :c024dc97:       mov    %eax,(%edx)
               :c024dc99:       lea    0x2c(%ebx),%eax
               :c024dc9c:       mov    $0xc024dcc8,%edx
               :c024dca1:       movl   $0x200200,0x4(%ebx)
               :c024dca8:       call   c014279c <call_rcu>
  6905  0.6021 :c024dcad:       mov    %esi,%ebx
  1716  0.1496 :c024dcaf:       mov    (%esi),%esi
232351 20.2593 :c024dcb1:       cmp    0xffffffec(%ebp),%ebx
 12492  1.0892 :c024dcb4:       jne    c024dc64 <gen_kill_estimator+0x34>
     2 1.7e-04 :c024dcb6:       inc    %edi
     5 4.4e-04 :c024dcb7:       addl   $0x20,0xfffffff0(%ebp)
               :c024dcbb:       cmp    $0x6,%edi
               :c024dcbe:       jne    c024dc48 <gen_kill_estimator+0x18>
               :c024dcc0:       add    $0x10,%esp
               :c024dcc3:       pop    %ebx
               :c024dcc4:       pop    %esi
     1 8.7e-05 :c024dcc5:       pop    %edi
																																															               :c024dc94:       mov    %edx,0x4(%eax)
																																																                      :c024dc97:       mov    %eax,(%edx)
																																																		                     :c024dc99:       lea    0x2c(%ebx),%eax
																																																				                    :c024dc9c:       mov    $0xc024dcc8,%edx
																																																						                   :c024dca1:       movl   $0x200200,0x4(%ebx)
																																																								                  :c024dca8:       call   c014279c <call_rcu>
																																																										    6905  0.6021 :c024dcad:       mov    %esi,%ebx
																																																										      1716  0.1496 :c024dcaf:       mov    (%esi),%esi
																																																										      232351 20.2593 :c024dcb1:       cmp    0xffffffec(%ebp),%ebx
																																																										       12492  1.0892 :c024dcb4:       jne    c024dc64 <gen_kill_estimator+0x34>
																																																										            2 1.7e-04 :c024dcb6:       inc    %edi
																																																											         5 4.4e-04 :c024dcb7:       addl   $0x20,0xfffffff0(%ebp)
																																																												                :c024dcbb:       cmp    $0x6,%edi
																																																														               :c024dcbe:       jne    c024dc48 <gen_kill_estimator+0x18>
																																																															                      :c024dcc0:       add    $0x10,%esp
																																																																	                     :c024dcc3:       pop    %ebx
																																																																			                    :c024dcc4:       pop    %esi
																																																																					         1 8.7e-05 :c024dcc5:       pop    %edi
																																																																						 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: thousands of classes, e1000 TX unit hang
  2008-08-05 10:05   ` Denys Fedoryshchenko
@ 2008-08-05 11:04     ` Jarek Poplawski
  2008-08-05 11:13       ` Denys Fedoryshchenko
  0 siblings, 1 reply; 14+ messages in thread
From: Jarek Poplawski @ 2008-08-05 11:04 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

On 05-08-2008 12:05, Denys Fedoryshchenko wrote:
> I found, that packetloss happening when i am deleting/adding classes.
> I attach result of oprofile as file.
...

Deleting of estimators (gen_kill_estimator) isn't optimized for
a large number of them, and it's a known issue. Adding of classes
shouldn't be such a problem, but maybe you could try to do this
before adding filters directing to those classes.

Since you can control rate with htb, I'm not sure you really need
policing: at least you could try if removing this changes anything.
And I'm not sure: do these tx hangs happen only when classes are
added/deleted or otherwise too?

Jarek P.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: thousands of classes, e1000 TX unit hang
  2008-08-05 11:04     ` Jarek Poplawski
@ 2008-08-05 11:13       ` Denys Fedoryshchenko
  2008-08-05 12:23         ` Jarek Poplawski
  0 siblings, 1 reply; 14+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-05 11:13 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

On Tuesday 05 August 2008, Jarek Poplawski wrote:
> On 05-08-2008 12:05, Denys Fedoryshchenko wrote:
> > I found, that packetloss happening when i am deleting/adding classes.
> > I attach result of oprofile as file.
>
> ...
>
> Deleting of estimators (gen_kill_estimator) isn't optimized for
> a large number of them, and it's a known issue. Adding of classes
> shouldn't be such a problem, but maybe you could try to do this
> before adding filters directing to those classes.
>
> Since you can control rate with htb, I'm not sure you really need
> policing: at least you could try if removing this changes anything.
> And I'm not sure: do these tx hangs happen only when classes are
> added/deleted or otherwise too?
>
> Jarek P.

Policer is creating burst for me.
For example first 2Mbyte(+rate*time if need more precision) will pass on high 
speed (1Mbit), then if flow is still using maximum bandwidth will be 
throttled to rate of HTB. When i tried to play with cburst/burst values in 
HTB i was not able to archieve same results. I can do same with TBF and his 
peakrate/burst, but not with HTB.

It happens when root qdisc deleted(which holds around 130 child classes). 
Probably gen_kill_estimator taking all resources while i am deleting root 
class.
I did some test, on machine with 150 ppp interfaces (Pentium 4 3.2 Ghz),
just by deleting root qdisc and i got huge packetloss. When i am just adding 
classes - there is no significant packetloss.
Probably it is not right thing, when i am deleting qdisc on ppp - causing 
packetloss on whole system? Is it possible to workaround, till 
gen_kill_estimator will be rewritten?

But sure i can try to avoid "mass deleting" classes, but i think many people 
will hit this bug, especially newbies, who implement "many class" setup.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: thousands of classes, e1000 TX unit hang
  2008-08-05 11:13       ` Denys Fedoryshchenko
@ 2008-08-05 12:23         ` Jarek Poplawski
  2008-08-05 13:02           ` Denys Fedoryshchenko
  2008-08-05 14:07           ` Denys Fedoryshchenko
  0 siblings, 2 replies; 14+ messages in thread
From: Jarek Poplawski @ 2008-08-05 12:23 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

On Tue, Aug 05, 2008 at 02:13:58PM +0300, Denys Fedoryshchenko wrote:
> On Tuesday 05 August 2008, Jarek Poplawski wrote:
> > On 05-08-2008 12:05, Denys Fedoryshchenko wrote:
> > > I found, that packetloss happening when i am deleting/adding classes.
> > > I attach result of oprofile as file.
> >
> > ...
> >
> > Deleting of estimators (gen_kill_estimator) isn't optimized for
> > a large number of them, and it's a known issue. Adding of classes
> > shouldn't be such a problem, but maybe you could try to do this
> > before adding filters directing to those classes.
> >
> > Since you can control rate with htb, I'm not sure you really need
> > policing: at least you could try if removing this changes anything.
> > And I'm not sure: do these tx hangs happen only when classes are
> > added/deleted or otherwise too?
> >
> > Jarek P.
> 
> Policer is creating burst for me.
> For example first 2Mbyte(+rate*time if need more precision) will pass on high 
> speed (1Mbit), then if flow is still using maximum bandwidth will be 
> throttled to rate of HTB. When i tried to play with cburst/burst values in 
> HTB i was not able to archieve same results. I can do same with TBF and his 
> peakrate/burst, but not with HTB.

Very interesting. Anyway tbf doesn't use gen estimators, so you could
test if it makes big difference for you.

> 
> It happens when root qdisc deleted(which holds around 130 child classes). 
> Probably gen_kill_estimator taking all resources while i am deleting root 
> class.
> I did some test, on machine with 150 ppp interfaces (Pentium 4 3.2 Ghz),
> just by deleting root qdisc and i got huge packetloss. When i am just adding 
> classes - there is no significant packetloss.
> Probably it is not right thing, when i am deleting qdisc on ppp - causing 
> packetloss on whole system? Is it possible to workaround, till 
> gen_kill_estimator will be rewritten?
> 
> But sure i can try to avoid "mass deleting" classes, but i think many people 
> will hit this bug, especially newbies, who implement "many class" setup.

Actually, gen_kill_estimator was rewritten already, but for some
reason it wasn't merged. Maybe there isn't so much users with such a
number of classes or they don't delete them, anyway this subject isn't
reported often to the list (I remember once). Some workaround could be
probably deleting individual classes (and filters) to give away a lock
and soft interrupts for a while), before deleting the root, but I
didn't test this. BTW, you are using quite long queues (3000), so there
would be interesting to make them less and check if doesn't add to the
problem (with retransmits).

Jarek P.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: thousands of classes, e1000 TX unit hang
  2008-08-05 12:23         ` Jarek Poplawski
@ 2008-08-05 13:02           ` Denys Fedoryshchenko
  2008-08-05 16:41             ` Jarek Poplawski
  2008-08-05 21:14             ` Jarek Poplawski
  2008-08-05 14:07           ` Denys Fedoryshchenko
  1 sibling, 2 replies; 14+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-05 13:02 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

>
> Very interesting. Anyway tbf doesn't use gen estimators, so you could
> test if it makes big difference for you.
I can prepare provide details with graphs to private email and reasons why i 
do that. With TBF i think i cannot use flow classifier, if i am not wrong, i 
must attach classful disciplines. I tried to attach just pfifo - and i fail.

> Actually, gen_kill_estimator was rewritten already, but for some
> reason it wasn't merged. Maybe there isn't so much users with such a
> number of classes or they don't delete them, anyway this subject isn't
> reported often to the list (I remember once). Some workaround could be
> probably deleting individual classes (and filters) to give away a lock
> and soft interrupts for a while), before deleting the root, but I
> didn't test this. BTW, you are using quite long queues (3000), so there
> would be interesting to make them less and check if doesn't add to the
> problem (with retransmits).
>
> Jarek P.
Well i am first :-) Many people prefer buy Cisco and put static shaper on 
their NAS for customer. With iproute2 it is possible to make MUCH better 
feeling of using maximum bandwidth, and at same time not feeling it is 
reached maximum. Since in some applications i have much more than amount of 
flows SFQ can provide (and especially inefficient hashing it uses), i need 
flow classifier and a lot of classes.

Many users on my experience dont know much about kernel maillist, and they 
didn't got used to report about problems. Now i am collecting some feedbacks 
from my friends and trying to reproduce and report here.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: thousands of classes, e1000 TX unit hang
  2008-08-05 12:23         ` Jarek Poplawski
  2008-08-05 13:02           ` Denys Fedoryshchenko
@ 2008-08-05 14:07           ` Denys Fedoryshchenko
  2008-08-05 16:48             ` Jarek Poplawski
  1 sibling, 1 reply; 14+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-05 14:07 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

On Tuesday 05 August 2008, Jarek Poplawski wrote:
> reported often to the list (I remember once). Some workaround could be
> probably deleting individual classes (and filters) to give away a lock
> and soft interrupts for a while), before deleting the root, but I
> didn't test this. 
Btw even if i optimize my scripts, still when ppp interface going down and 
disappearing - all classes will be deleted, and system locked up for short 
time (with all related TX hang, packetloss and etc).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: thousands of classes, e1000 TX unit hang
  2008-08-05 13:02           ` Denys Fedoryshchenko
@ 2008-08-05 16:41             ` Jarek Poplawski
  2008-08-05 16:48               ` Denys Fedoryshchenko
  2008-08-05 21:14             ` Jarek Poplawski
  1 sibling, 1 reply; 14+ messages in thread
From: Jarek Poplawski @ 2008-08-05 16:41 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

On Tue, Aug 05, 2008 at 04:02:05PM +0300, Denys Fedoryshchenko wrote:
> >
> > Very interesting. Anyway tbf doesn't use gen estimators, so you could
> > test if it makes big difference for you.
> I can prepare provide details with graphs to private email and reasons why i 
> do that. With TBF i think i cannot use flow classifier, if i am not wrong, i 
> must attach classful disciplines. I tried to attach just pfifo - and i fail.

If such config works for you then why bother? I've some doubts, but
not enough to study some graphics ;)

> > Actually, gen_kill_estimator was rewritten already, but for some
> > reason it wasn't merged. Maybe there isn't so much users with such a
> > number of classes or they don't delete them, anyway this subject isn't
> > reported often to the list (I remember once). Some workaround could be
> > probably deleting individual classes (and filters) to give away a lock
> > and soft interrupts for a while), before deleting the root, but I
> > didn't test this. BTW, you are using quite long queues (3000), so there
> > would be interesting to make them less and check if doesn't add to the
> > problem (with retransmits).
> >
> > Jarek P.
> Well i am first :-) Many people prefer buy Cisco and put static shaper on 
> their NAS for customer. With iproute2 it is possible to make MUCH better 
> feeling of using maximum bandwidth, and at same time not feeling it is 
> reached maximum. Since in some applications i have much more than amount of 
> flows SFQ can provide (and especially inefficient hashing it uses), i need 
> flow classifier and a lot of classes.
> 
> Many users on my experience dont know much about kernel maillist, and they 
> didn't got used to report about problems. Now i am collecting some feedbacks 
> from my friends and trying to reproduce and report here.

Alas you are the second:
http://www.mail-archive.com/netdev@vger.kernel.org/msg60101.html 

If you think this gen_kill_estimator() fix will solve your problems
you can try to upgrade this patch and resend as yours or ours, it's
not copyrighted. Of course, no guarantee it'll be accepted this time.

Jarek P.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: thousands of classes, e1000 TX unit hang
  2008-08-05 14:07           ` Denys Fedoryshchenko
@ 2008-08-05 16:48             ` Jarek Poplawski
  2008-08-05 17:18               ` Denys Fedoryshchenko
  0 siblings, 1 reply; 14+ messages in thread
From: Jarek Poplawski @ 2008-08-05 16:48 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

On Tue, Aug 05, 2008 at 05:07:25PM +0300, Denys Fedoryshchenko wrote:
> On Tuesday 05 August 2008, Jarek Poplawski wrote:
> > reported often to the list (I remember once). Some workaround could be
> > probably deleting individual classes (and filters) to give away a lock
> > and soft interrupts for a while), before deleting the root, but I
> > didn't test this. 
> Btw even if i optimize my scripts, still when ppp interface going down and 
> disappearing - all classes will be deleted, and system locked up for short 
> time (with all related TX hang, packetloss and etc).

Are you sure you can't let pppd to run this script when the link goes
down?

Jarek P.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: thousands of classes, e1000 TX unit hang
  2008-08-05 16:41             ` Jarek Poplawski
@ 2008-08-05 16:48               ` Denys Fedoryshchenko
  0 siblings, 0 replies; 14+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-05 16:48 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

On Tuesday 05 August 2008, Jarek Poplawski wrote:
> >
> > I can prepare provide details with graphs to private email and reasons
> > why i do that. With TBF i think i cannot use flow classifier, if i am not
> > wrong, i must attach classful disciplines. I tried to attach just pfifo -
> > and i fail.
>
> If such config works for you then why bother? I've some doubts, but
> not enough to study some graphics ;)
>
TFB just make linear fifo, if user put downloading large file, and he have 
limit 256Kbit/s - his bandwidth and other apps will suffer.

If i put flow, SFQ or like i show rules - it will balance each flow to small 
fifo, and provide fair bandwidth to each flow. Flow with huge load just will 
have packets dropped more :-)
Means even if customer uses 256Kbit/s flat - his browsing still amazing 
fast... (it is checked).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: thousands of classes, e1000 TX unit hang
  2008-08-05 16:48             ` Jarek Poplawski
@ 2008-08-05 17:18               ` Denys Fedoryshchenko
  0 siblings, 0 replies; 14+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-05 17:18 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev

On Tuesday 05 August 2008, Jarek Poplawski wrote:
> Are you sure you can't let pppd to run this script when the link goes
> down?
>
> Jarek P.
Probably i can, over /etc/ppp/ip-down, i will try it.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: thousands of classes, e1000 TX unit hang
  2008-08-05 13:02           ` Denys Fedoryshchenko
  2008-08-05 16:41             ` Jarek Poplawski
@ 2008-08-05 21:14             ` Jarek Poplawski
  1 sibling, 0 replies; 14+ messages in thread
From: Jarek Poplawski @ 2008-08-05 21:14 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev

> > BTW, you are using quite long queues (3000), so there

BTW, sorry for my gibberish about long queues. I missed this "b"fifo.

Jarek P.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: thousands of classes, e1000 TX unit hang
  2008-08-05  7:47 thousands of classes, e1000 TX unit hang Denys Fedoryshchenko
  2008-08-05  8:06 ` Denys Fedoryshchenko
@ 2008-08-06  1:13 ` Brandeburg, Jesse
  1 sibling, 0 replies; 14+ messages in thread
From: Brandeburg, Jesse @ 2008-08-06  1:13 UTC (permalink / raw)
  To: Denys Fedoryshchenko, netdev

Denys Fedoryshchenko wrote:
> I did script, that looks something like this (to simulate SFQ by flow
> classifier):

<snip>

Just to clarify for e1000 message, this is a "false hang" as indicated
by .status = 1 and TDH==TDT and both are still moving, which means
adapter is still transmitting.

In this case it appears that the system took longer than two seconds to
allow the e1000 driver to clean up packets that it transmitted, in fact
it appears to be delayed by about a constant 700 jiffies or so.

> Error message appearing in dmesg:
> [149650.006939] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149650.006943]   Tx Queue             <0>
> [149650.006944]   TDH                  <a3>
> [149650.006945]   TDT                  <a3>
> [149650.006947]   next_to_use          <a3>
> [149650.006948]   next_to_clean        <f8>
> [149650.006949] buffer_info[next_to_clean]
> [149650.006951]   time_stamp           <8e69a7c>
> [149650.006952]   next_to_watch        <f8>
> [149650.006953]   jiffies              <8e6a111>
> [149650.006954]   next_to_watch.status <1>
> [149655.964100] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> [149655.964104]   Tx Queue             <0>
> [149655.964105]   TDH                  <6c>
> [149655.964107]   TDT                  <6c>
> [149655.964108]   next_to_use          <6c>
> [149655.964109]   next_to_clean        <c1>
> [149655.964111] buffer_info[next_to_clean]
> [149655.964112]   time_stamp           <8e6b198>
> [149655.964113]   next_to_watch        <c1>
> [149655.964115]   jiffies              <8e6b853>
> [149655.964116]   next_to_watch.status <1>

So I don't think this is any e1000 problem as it appears the rest of
your thread confirms, as it appears the system gets too busy trying to
traverse your tc filters and can't work on e1000 driver packet clean up.

Jesse

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-08-06  1:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-05  7:47 thousands of classes, e1000 TX unit hang Denys Fedoryshchenko
2008-08-05  8:06 ` Denys Fedoryshchenko
2008-08-05 10:05   ` Denys Fedoryshchenko
2008-08-05 11:04     ` Jarek Poplawski
2008-08-05 11:13       ` Denys Fedoryshchenko
2008-08-05 12:23         ` Jarek Poplawski
2008-08-05 13:02           ` Denys Fedoryshchenko
2008-08-05 16:41             ` Jarek Poplawski
2008-08-05 16:48               ` Denys Fedoryshchenko
2008-08-05 21:14             ` Jarek Poplawski
2008-08-05 14:07           ` Denys Fedoryshchenko
2008-08-05 16:48             ` Jarek Poplawski
2008-08-05 17:18               ` Denys Fedoryshchenko
2008-08-06  1:13 ` Brandeburg, Jesse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).