xennet: skb rides the rocket messages in domU dmesg

All of lore.kernel.org
 help / color / mirror / Atom feed

* xennet: skb rides the rocket messages in domU dmesg
@ 2010-05-26 21:21 Mark Hurenkamp
  2010-05-26 22:39 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Hurenkamp @ 2010-05-26 21:21 UTC (permalink / raw)
  To: xen-devel

Hi,


On my home server i am running Xen-4.0.1-rc1 with a recent xen/next kernel,
and a pvm domU with the same kernel, and 4 tuners passed through.
Because the mythtv backend domain would sometimes become unstable, i 
decided
to split up my mythtv backend into 3 seperate virtual machines, one master
backend with the database, and 2 slave backends with the tuners.
One of the slave backends has a cx23885 based dvb tuner card, the other 
slave
backend runs 3 ivtv based tuners.
To keep consistency with old recording data, and since i would like to 
have all
recordings in a single volume, i tried to use an nfs mount of the 
recordings volume
from the dom0 to mount on all backends. This resulted in a very unstable 
system,
to the point where my most important slave backend became unusable.
So i tried it the other way, have the slave backends each mount their own
recordings volume as a block device via xen, and for backwards 
compatibility mount
the volume which holds the old recordings via nfs on the master backend.

Now i see many "xennet: skb rides the rocket" messages appear in the 
(pv) slave
backend which exports the recordings volume to the master backend. These
messages i did not see when there was only a single mythtv backend.
(both the dom0 as well as the mythtv domUs are ubuntu lucid server based)
Overall the system seems to perform ok, and the messages are not causing the
system to become unusable or more unstable, so it is not a major issue.

Note that both the master backend, and the slave backend which exports the
volume, are paravirtualised domains. The slave backend has the following
xen config:

kernel = '/boot/vmlinuz-2.6.32m5'
ramdisk = '/boot/initrd.img-2.6.32m5'
extra = 'root=/dev/xvda1 ro console=hvc0 noirqdebug iommu=soft 
swiotlb=force'
maxmem = '1000'
memory = '500'
device_model='/usr/lib/xen/bin/qemu-dm'
serial='pty'
disk = [
     'phy:/dev/vm/tilnes-lucid,hda,w',
     'phy:/dev/mythtv/recordings,hdb,w',
]
boot='c'
name = 'tilnes'
vif = [ 'mac=aa:20:00:00:01:72, bridge=loc' ]
vfb = [ 'vnc=1,vnclisten=0.0.0.0,vncdisplay=5' ]

usb=1
usbdevice='tablet'
monitor=1
pci = [
     '0000:08:02.0',
     '0000:09:08.0',
     '0000:09:09.0',
     ]
vcpus=8


The messagedump i see (this is only 1 example, my dmesg is full of these):

xennet: skb rides the rocket: 20 frags
Pid: 3237, comm: nfsd Tainted: G      D    2.6.32m5 #9
Call Trace:
<IRQ>  [<ffffffffa005e2d4>] xennet_start_xmit+0x75/0x678 [xen_netfront]
  [<ffffffff8100eff2>] ? check_events+0x12/0x20
  [<ffffffff8136bd3a>] ? rcu_read_unlock+0x0/0x1e
  [<ffffffff8100efdf>] ? xen_restore_fl_direct_end+0x0/0x1
  [<ffffffff81076983>] ? lock_release+0x1e0/0x1ed
  [<ffffffff8136ee0e>] dev_hard_start_xmit+0x236/0x2e1
  [<ffffffff8138114e>] sch_direct_xmit+0x68/0x16f
  [<ffffffff8136f240>] dev_queue_xmit+0x274/0x3de
  [<ffffffff8136f130>] ? dev_queue_xmit+0x164/0x3de
  [<ffffffff8139cf30>] ? dst_output+0x0/0xd
  [<ffffffff8139e16d>] ip_finish_output2+0x1df/0x222
  [<ffffffff8139e218>] ip_finish_output+0x68/0x6a
  [<ffffffff8139e503>] ip_output+0x9c/0xa0
  [<ffffffff8139e5a2>] ip_local_out+0x20/0x24
  [<ffffffff8139ebfe>] ip_queue_xmit+0x309/0x37a
  [<ffffffff8100e871>] ? xen_force_evtchn_callback+0xd/0xf
  [<ffffffff8100eff2>] ? check_events+0x12/0x20
  [<ffffffff8100e871>] ? xen_force_evtchn_callback+0xd/0xf
  [<ffffffff8100eff2>] ? check_events+0x12/0x20
  [<ffffffff813b012a>] tcp_transmit_skb+0x648/0x686
  [<ffffffff813b2654>] tcp_write_xmit+0x808/0x8f7
  [<ffffffff8100eff2>] ? check_events+0x12/0x20
  [<ffffffff813af7e2>] ? tcp_established_options+0x2e/0xa9
  [<ffffffff813b279e>] __tcp_push_pending_frames+0x2a/0x58
  [<ffffffff813ac124>] tcp_data_snd_check+0x24/0xea
  [<ffffffff813ae464>] tcp_rcv_established+0xdd/0x6d4
  [<ffffffff8100eff2>] ? check_events+0x12/0x20
  [<ffffffff813b535b>] tcp_v4_do_rcv+0x1ba/0x375
  [<ffffffff8100efdf>] ? xen_restore_fl_direct_end+0x0/0x1
  [<ffffffff813b62d1>] ? tcp_v4_rcv+0x2b3/0x6b7
  [<ffffffff813b6474>] tcp_v4_rcv+0x456/0x6b7
  [<ffffffff8139ad27>] ? ip_local_deliver_finish+0x0/0x235
  [<ffffffff8139ae9b>] ip_local_deliver_finish+0x174/0x235
  [<ffffffff8139ad6b>] ? ip_local_deliver_finish+0x44/0x235
  [<ffffffff8139afce>] ip_local_deliver+0x72/0x7c
  [<ffffffff8139a89d>] ip_rcv_finish+0x3cd/0x3fb
  [<ffffffff8139ab84>] ip_rcv+0x2b9/0x2f9
  [<ffffffff813ec764>] ? packet_rcv_spkt+0xd6/0xe1
  [<ffffffff8136e065>] netif_receive_skb+0x445/0x46f
  [<ffffffff810c0b92>] ? free_hot_page+0x3a/0x3f
  [<ffffffffa005f41d>] xennet_poll+0xaf4/0xc7b [xen_netfront]
  [<ffffffff8136e7ac>] net_rx_action+0xab/0x1df
  [<ffffffff81076983>] ? lock_release+0x1e0/0x1ed
  [<ffffffff81053842>] __do_softirq+0xe0/0x1a2
  [<ffffffff8109e984>] ? handle_level_irq+0xd1/0xda
  [<ffffffff8126a152>] ? __xen_evtchn_do_upcall+0x12e/0x163
  [<ffffffff81012cac>] call_softirq+0x1c/0x30
  [<ffffffff8101428d>] do_softirq+0x41/0x81
  [<ffffffff81053694>] irq_exit+0x36/0x78
  [<ffffffff8126a646>] xen_evtchn_do_upcall+0x37/0x47
  [<ffffffff81012cfe>] xen_do_hypervisor_callback+0x1e/0x30
<EOI>  [<ffffffff8111dc4a>] ? __bio_add_page+0xee/0x212
  [<ffffffff8111df9b>] ? bio_alloc+0x10/0x1f
  [<ffffffff811217e1>] ? mpage_alloc+0x25/0x7d
  [<ffffffff8111dd9f>] ? bio_add_page+0x31/0x33
  [<ffffffff81121dde>] ? do_mpage_readpage+0x3d3/0x488
  [<ffffffff810bba3b>] ? add_to_page_cache_locked+0xcc/0x108
  [<ffffffff81121fbb>] ? mpage_readpages+0xcb/0x10f
  [<ffffffff81159aee>] ? ext3_get_block+0x0/0xf9
  [<ffffffff81159aee>] ? ext3_get_block+0x0/0xf9
  [<ffffffff81157bbc>] ? ext3_readpages+0x18/0x1a
  [<ffffffff810c387b>] ? __do_page_cache_readahead+0x140/0x1cd
  [<ffffffff8100eff2>] ? check_events+0x12/0x20
  [<ffffffff8100efdf>] ? xen_restore_fl_direct_end+0x0/0x1
  [<ffffffff810c3924>] ? ra_submit+0x1c/0x20
  [<ffffffff810c3d1c>] ? ondemand_readahead+0x1de/0x1f1
  [<ffffffff810c3dc3>] ? page_cache_sync_readahead+0x17/0x1c
  [<ffffffff81118790>] ? __generic_file_splice_read+0xf0/0x41a
  [<ffffffff8100eff2>] ? check_events+0x12/0x20
  [<ffffffff811c0c07>] ? rcu_read_unlock+0x0/0x1e
  [<ffffffff8100efdf>] ? xen_restore_fl_direct_end+0x0/0x1
  [<ffffffff81076983>] ? lock_release+0x1e0/0x1ed
  [<ffffffff811c0c23>] ? rcu_read_unlock+0x1c/0x1e
  [<ffffffff811c16e2>] ? avc_has_perm_noaudit+0x3b5/0x3c7
  [<ffffffff810ec8eb>] ? check_object+0x170/0x1a9
  [<ffffffff8100e871>] ? xen_force_evtchn_callback+0xd/0xf
  [<ffffffff8111770d>] ? spd_release_page+0x0/0x14
  [<ffffffff811c4d62>] ? selinux_file_permission+0x57/0xae
  [<ffffffff81118afe>] ? generic_file_splice_read+0x44/0x72
  [<ffffffff8100eff2>] ? check_events+0x12/0x20
  [<ffffffff811171d4>] ? do_splice_to+0x6c/0x79
  [<ffffffff8100eff2>] ? check_events+0x12/0x20
  [<ffffffff811178ed>] ? splice_direct_to_actor+0xc2/0x1a1
  [<ffffffff8100efdf>] ? xen_restore_fl_direct_end+0x0/0x1
  [<ffffffffa01a0b53>] ? nfsd_direct_splice_actor+0x0/0x12 [nfsd]
  [<ffffffffa01a0a44>] ? nfsd_vfs_read+0x276/0x385 [nfsd]
  [<ffffffffa01a115a>] ? nfsd_read+0xa1/0xbf [nfsd]
  [<ffffffffa00de128>] ? svc_xprt_enqueue+0x22b/0x238 [sunrpc]
  [<ffffffffa01a7cbf>] ? nfsd3_proc_read+0xe2/0x121 [nfsd]
  [<ffffffffa00d6551>] ? cache_put+0x2d/0x2f [sunrpc]
  [<ffffffffa019c36f>] ? nfsd_dispatch+0xec/0x1c7 [nfsd]
  [<ffffffffa00d2e99>] ? svc_process+0x436/0x637 [sunrpc]
  [<ffffffffa01a4418>] ? exp_readlock+0x10/0x12 [nfsd]
  [<ffffffffa019c8c0>] ? nfsd+0xf3/0x13e [nfsd]
  [<ffffffffa019c7cd>] ? nfsd+0x0/0x13e [nfsd]
  [<ffffffff8106601d>] ? kthread+0x7a/0x82
  [<ffffffff81012baa>] ? child_rip+0xa/0x20
  [<ffffffff81011ce6>] ? int_ret_from_sys_call+0x7/0x1b
  [<ffffffff81012526>] ? retint_restore_args+0x5/0x6
  [<ffffffff81012ba0>] ? child_rip+0x0/0x20

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: xennet: skb rides the rocket messages in domU dmesg
  2010-05-26 21:21 xennet: skb rides the rocket messages in domU dmesg Mark Hurenkamp
@ 2010-05-26 22:39 ` Jeremy Fitzhardinge
  2010-05-29 21:43   ` Mark Hurenkamp
  0 siblings, 1 reply; 4+ messages in thread
From: Jeremy Fitzhardinge @ 2010-05-26 22:39 UTC (permalink / raw)
  To: Mark Hurenkamp; +Cc: xen-devel

On 05/26/2010 02:21 PM, Mark Hurenkamp wrote:
> Hi,
>
>
> On my home server i am running Xen-4.0.1-rc1 with a recent xen/next
> kernel,

Are you actually using the "xen/next" branch?  I recommend you use
xen/stable-2.6.32.x, since that's tracking all the other bugfixes going
into Linux 2.6.32.

> and a pvm domU with the same kernel, and 4 tuners passed through.
> Because the mythtv backend domain would sometimes become unstable, i
> decided
> to split up my mythtv backend into 3 seperate virtual machines, one
> master
> backend with the database, and 2 slave backends with the tuners.
> One of the slave backends has a cx23885 based dvb tuner card, the
> other slave
> backend runs 3 ivtv based tuners.
> To keep consistency with old recording data, and since i would like to
> have all
> recordings in a single volume, i tried to use an nfs mount of the
> recordings volume
> from the dom0 to mount on all backends. This resulted in a very
> unstable system,
> to the point where my most important slave backend became unusable.

Unstable how?

> So i tried it the other way, have the slave backends each mount their own
> recordings volume as a block device via xen, and for backwards
> compatibility mount
> the volume which holds the old recordings via nfs on the master backend.
>
> Now i see many "xennet: skb rides the rocket" messages appear in the
> (pv) slave
> backend which exports the recordings volume to the master backend. These
> messages i did not see when there was only a single mythtv backend.
> (both the dom0 as well as the mythtv domUs are ubuntu lucid server based)
> Overall the system seems to perform ok, and the messages are not
> causing the
> system to become unusable or more unstable, so it is not a major issue.

That appears to mean that you're getting single packets which are larger
than 18 pages long (72k).  I'm not quite sure how that's possible, since
I thought the datagram limit is 64k..

Are you using nfs over udp or tcp?  (I think tcp, from your stack trace.)

Does turning of tso/gso with ethtool make a difference?

    J

>
> Note that both the master backend, and the slave backend which exports
> the
> volume, are paravirtualised domains. The slave backend has the following
> xen config:
>
> kernel = '/boot/vmlinuz-2.6.32m5'
> ramdisk = '/boot/initrd.img-2.6.32m5'
> extra = 'root=/dev/xvda1 ro console=hvc0 noirqdebug iommu=soft
> swiotlb=force'
> maxmem = '1000'
> memory = '500'
> device_model='/usr/lib/xen/bin/qemu-dm'
> serial='pty'
> disk = [
>     'phy:/dev/vm/tilnes-lucid,hda,w',
>     'phy:/dev/mythtv/recordings,hdb,w',
> ]
> boot='c'
> name = 'tilnes'
> vif = [ 'mac=aa:20:00:00:01:72, bridge=loc' ]
> vfb = [ 'vnc=1,vnclisten=0.0.0.0,vncdisplay=5' ]
>
> usb=1
> usbdevice='tablet'
> monitor=1
> pci = [
>     '0000:08:02.0',
>     '0000:09:08.0',
>     '0000:09:09.0',
>     ]
> vcpus=8
>
>
> The messagedump i see (this is only 1 example, my dmesg is full of
> these):
>
> xennet: skb rides the rocket: 20 frags
> Pid: 3237, comm: nfsd Tainted: G      D    2.6.32m5 #9
> Call Trace:
> <IRQ>  [<ffffffffa005e2d4>] xennet_start_xmit+0x75/0x678 [xen_netfront]
>  [<ffffffff8100eff2>] ? check_events+0x12/0x20
>  [<ffffffff8136bd3a>] ? rcu_read_unlock+0x0/0x1e
>  [<ffffffff8100efdf>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81076983>] ? lock_release+0x1e0/0x1ed
>  [<ffffffff8136ee0e>] dev_hard_start_xmit+0x236/0x2e1
>  [<ffffffff8138114e>] sch_direct_xmit+0x68/0x16f
>  [<ffffffff8136f240>] dev_queue_xmit+0x274/0x3de
>  [<ffffffff8136f130>] ? dev_queue_xmit+0x164/0x3de
>  [<ffffffff8139cf30>] ? dst_output+0x0/0xd
>  [<ffffffff8139e16d>] ip_finish_output2+0x1df/0x222
>  [<ffffffff8139e218>] ip_finish_output+0x68/0x6a
>  [<ffffffff8139e503>] ip_output+0x9c/0xa0
>  [<ffffffff8139e5a2>] ip_local_out+0x20/0x24
>  [<ffffffff8139ebfe>] ip_queue_xmit+0x309/0x37a
>  [<ffffffff8100e871>] ? xen_force_evtchn_callback+0xd/0xf
>  [<ffffffff8100eff2>] ? check_events+0x12/0x20
>  [<ffffffff8100e871>] ? xen_force_evtchn_callback+0xd/0xf
>  [<ffffffff8100eff2>] ? check_events+0x12/0x20
>  [<ffffffff813b012a>] tcp_transmit_skb+0x648/0x686
>  [<ffffffff813b2654>] tcp_write_xmit+0x808/0x8f7
>  [<ffffffff8100eff2>] ? check_events+0x12/0x20
>  [<ffffffff813af7e2>] ? tcp_established_options+0x2e/0xa9
>  [<ffffffff813b279e>] __tcp_push_pending_frames+0x2a/0x58
>  [<ffffffff813ac124>] tcp_data_snd_check+0x24/0xea
>  [<ffffffff813ae464>] tcp_rcv_established+0xdd/0x6d4
>  [<ffffffff8100eff2>] ? check_events+0x12/0x20
>  [<ffffffff813b535b>] tcp_v4_do_rcv+0x1ba/0x375
>  [<ffffffff8100efdf>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff813b62d1>] ? tcp_v4_rcv+0x2b3/0x6b7
>  [<ffffffff813b6474>] tcp_v4_rcv+0x456/0x6b7
>  [<ffffffff8139ad27>] ? ip_local_deliver_finish+0x0/0x235
>  [<ffffffff8139ae9b>] ip_local_deliver_finish+0x174/0x235
>  [<ffffffff8139ad6b>] ? ip_local_deliver_finish+0x44/0x235
>  [<ffffffff8139afce>] ip_local_deliver+0x72/0x7c
>  [<ffffffff8139a89d>] ip_rcv_finish+0x3cd/0x3fb
>  [<ffffffff8139ab84>] ip_rcv+0x2b9/0x2f9
>  [<ffffffff813ec764>] ? packet_rcv_spkt+0xd6/0xe1
>  [<ffffffff8136e065>] netif_receive_skb+0x445/0x46f
>  [<ffffffff810c0b92>] ? free_hot_page+0x3a/0x3f
>  [<ffffffffa005f41d>] xennet_poll+0xaf4/0xc7b [xen_netfront]
>  [<ffffffff8136e7ac>] net_rx_action+0xab/0x1df
>  [<ffffffff81076983>] ? lock_release+0x1e0/0x1ed
>  [<ffffffff81053842>] __do_softirq+0xe0/0x1a2
>  [<ffffffff8109e984>] ? handle_level_irq+0xd1/0xda
>  [<ffffffff8126a152>] ? __xen_evtchn_do_upcall+0x12e/0x163
>  [<ffffffff81012cac>] call_softirq+0x1c/0x30
>  [<ffffffff8101428d>] do_softirq+0x41/0x81
>  [<ffffffff81053694>] irq_exit+0x36/0x78
>  [<ffffffff8126a646>] xen_evtchn_do_upcall+0x37/0x47
>  [<ffffffff81012cfe>] xen_do_hypervisor_callback+0x1e/0x30
> <EOI>  [<ffffffff8111dc4a>] ? __bio_add_page+0xee/0x212
>  [<ffffffff8111df9b>] ? bio_alloc+0x10/0x1f
>  [<ffffffff811217e1>] ? mpage_alloc+0x25/0x7d
>  [<ffffffff8111dd9f>] ? bio_add_page+0x31/0x33
>  [<ffffffff81121dde>] ? do_mpage_readpage+0x3d3/0x488
>  [<ffffffff810bba3b>] ? add_to_page_cache_locked+0xcc/0x108
>  [<ffffffff81121fbb>] ? mpage_readpages+0xcb/0x10f
>  [<ffffffff81159aee>] ? ext3_get_block+0x0/0xf9
>  [<ffffffff81159aee>] ? ext3_get_block+0x0/0xf9
>  [<ffffffff81157bbc>] ? ext3_readpages+0x18/0x1a
>  [<ffffffff810c387b>] ? __do_page_cache_readahead+0x140/0x1cd
>  [<ffffffff8100eff2>] ? check_events+0x12/0x20
>  [<ffffffff8100efdf>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff810c3924>] ? ra_submit+0x1c/0x20
>  [<ffffffff810c3d1c>] ? ondemand_readahead+0x1de/0x1f1
>  [<ffffffff810c3dc3>] ? page_cache_sync_readahead+0x17/0x1c
>  [<ffffffff81118790>] ? __generic_file_splice_read+0xf0/0x41a
>  [<ffffffff8100eff2>] ? check_events+0x12/0x20
>  [<ffffffff811c0c07>] ? rcu_read_unlock+0x0/0x1e
>  [<ffffffff8100efdf>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81076983>] ? lock_release+0x1e0/0x1ed
>  [<ffffffff811c0c23>] ? rcu_read_unlock+0x1c/0x1e
>  [<ffffffff811c16e2>] ? avc_has_perm_noaudit+0x3b5/0x3c7
>  [<ffffffff810ec8eb>] ? check_object+0x170/0x1a9
>  [<ffffffff8100e871>] ? xen_force_evtchn_callback+0xd/0xf
>  [<ffffffff8111770d>] ? spd_release_page+0x0/0x14
>  [<ffffffff811c4d62>] ? selinux_file_permission+0x57/0xae
>  [<ffffffff81118afe>] ? generic_file_splice_read+0x44/0x72
>  [<ffffffff8100eff2>] ? check_events+0x12/0x20
>  [<ffffffff811171d4>] ? do_splice_to+0x6c/0x79
>  [<ffffffff8100eff2>] ? check_events+0x12/0x20
>  [<ffffffff811178ed>] ? splice_direct_to_actor+0xc2/0x1a1
>  [<ffffffff8100efdf>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffffa01a0b53>] ? nfsd_direct_splice_actor+0x0/0x12 [nfsd]
>  [<ffffffffa01a0a44>] ? nfsd_vfs_read+0x276/0x385 [nfsd]
>  [<ffffffffa01a115a>] ? nfsd_read+0xa1/0xbf [nfsd]
>  [<ffffffffa00de128>] ? svc_xprt_enqueue+0x22b/0x238 [sunrpc]
>  [<ffffffffa01a7cbf>] ? nfsd3_proc_read+0xe2/0x121 [nfsd]
>  [<ffffffffa00d6551>] ? cache_put+0x2d/0x2f [sunrpc]
>  [<ffffffffa019c36f>] ? nfsd_dispatch+0xec/0x1c7 [nfsd]
>  [<ffffffffa00d2e99>] ? svc_process+0x436/0x637 [sunrpc]
>  [<ffffffffa01a4418>] ? exp_readlock+0x10/0x12 [nfsd]
>  [<ffffffffa019c8c0>] ? nfsd+0xf3/0x13e [nfsd]
>  [<ffffffffa019c7cd>] ? nfsd+0x0/0x13e [nfsd]
>  [<ffffffff8106601d>] ? kthread+0x7a/0x82
>  [<ffffffff81012baa>] ? child_rip+0xa/0x20
>  [<ffffffff81011ce6>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff81012526>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81012ba0>] ? child_rip+0x0/0x20
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: xennet: skb rides the rocket messages in domU dmesg
  2010-05-26 22:39 ` Jeremy Fitzhardinge
@ 2010-05-29 21:43   ` Mark Hurenkamp
  2010-06-01 16:42     ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Hurenkamp @ 2010-05-29 21:43 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: xen-devel

Hi

> Are you actually using the "xen/next" branch?  I recommend you use
> xen/stable-2.6.32.x, since that's tracking all the other bugfixes going
> into Linux 2.6.32.
>    
I was using xen/next since some of the features i use were not
in xen/stable at the time. I built a new xen/stable-2.6.32.x yesterday,
which does seem to work fine, so i guess i can follow that branch
now.

>> To keep consistency with old recording data, and since i would like to
>> have all
>> recordings in a single volume, i tried to use an nfs mount of the
>> recordings volume
>> from the dom0 to mount on all backends. This resulted in a very
>> unstable system,
>> to the point where my most important slave backend became unusable.
>>      
> Unstable how?
>    
The mythtv backends would not be able to reliably record shows on an
nfs mounted filesystem. Ivtv driver would complain about application not
reading fast enough. This made the backends unusable.

> That appears to mean that you're getting single packets which are larger
> than 18 pages long (72k).  I'm not quite sure how that's possible, since
> I thought the datagram limit is 64k..
>
> Are you using nfs over udp or tcp?  (I think tcp, from your stack trace.)
>
> Does turning of tso/gso with ethtool make a difference?
>    
Ok, i tried this on the running system, and it did seem to improve
things, but still i'd see some (other) messages.
After a reboot, with the new xen/stable-2.6.32.13.x based kernel
and switching tso and gso off with ethtool, these messages are
now completely gone (have the system up for about a day now).

I do notice something else though (might have been there before,
but now it is the only message in domU dmesg), just after starting
nfs during boot of the domU:

BUG: unable to handle kernel paging request at 00000002dcf32198
IP: [<ffffffff811cf09a>] bitmap_scnprintf+0x5c/0xb6
PGD a777067 PUD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci-0/pci0000:08/0000:08:02.0/local_cpus
CPU 0
Modules linked in: nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss 
autofs4 ipv6 wm8775 tea5767 cx25840 tuner_simple sunrpc tuner_types 
tda9887 tda8290 tuner msp3400 saa7127 saa7115 ivtv i2c_algo_bit cx2341x 
v4l2_common videodev v4l1_compat xen_fbfront v4l2_compat_ioctl32 
fb_sys_fops tveeprom sysimgblt joydev i2c_core sysfillrect xen_kbdfront 
syscopyarea xen_netfront raid10 raid456 async_raid6_recov async_pq 
raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear
Pid: 3468, comm: irqbalance Not tainted 2.6.32.13m7.1 #1
RIP: e030:[<ffffffff811cf09a>]  [<ffffffff811cf09a>] 
bitmap_scnprintf+0x5c/0xb6
RSP: e02b:ffff88001cbd9e18  EFLAGS: 00010246
RAX: ffffffff81527f2b RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000ffe RDI: 0000000000000000
RBP: ffff88001cbd9e48 R08: 0000000000000010 R09: 0000000000000001
R10: 0000000000000357 R11: dead000000200200 R12: 0000000000000000
R13: 0000000000000ffe R14: 00000002dcf32198 R15: ffff880002bbd000
FS:  00007fc142b6d720(0000) GS:ffff8800046e0000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000002dcf32198 CR3: 000000001ca58000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process irqbalance (pid: 3468, threadinfo ffff88001cbd8000, task 
ffff88001ded2920)
Stack:
  0000000000000200 ffff880002bbd000 ffff88001cbd9f58 ffff880002eeb858
<0> ffff88001ce8ed10 ffffffff81616230 ffff88001cbd9e68 ffffffff811dd333
<0> ffff880002eeb878 ffffffff81606368 ffff88001cbd9e98 ffffffff81273574
Call Trace:
  [<ffffffff811dd333>] local_cpus_show+0x44/0x57
  [<ffffffff81273574>] dev_attr_show+0x22/0x49
  [<ffffffff810a4e8e>] ? __get_free_pages+0x9/0x46
  [<ffffffff8112fbc2>] sysfs_read_file+0xb4/0x139
  [<ffffffff810da927>] vfs_read+0xa6/0x103
  [<ffffffff810daa3a>] sys_read+0x45/0x69
  [<ffffffff81011b02>] system_call_fastpath+0x16/0x1b
Code: e0 48 c7 c0 2b 7f 52 81 41 83 ec 20 31 db eb 60 44 89 e2 44 89 e1 
48 63 fb 83 e1 3f c1 fa 06 41 b9 01 00 00 00 48 63 d2 44 89 ee <49> 8b 
14 d6 29 de 48 d3 ea 49 8d 3c 3f 44 88 c1 41 83 ec 20 49
RIP  [<ffffffff811cf09a>] bitmap_scnprintf+0x5c/0xb6
  RSP <ffff88001cbd9e18>
CR2: 00000002dcf32198
---[ end trace 5f520ed1e48e5394 ]---


During boot of dom0 i see the following when it is starting my domU 
(seems to be more of a warning):
BUG: MAX_LOCK_DEPTH too low!
turning off the locking correctness validator.
Pid: 5861, comm: qemu-dm Not tainted 2.6.32.13m7.1 #1
Call Trace:
  [<ffffffff8106a625>] __lock_acquire+0x431/0x459
  [<ffffffff810b029d>] ? vma_prio_tree_remove+0x27/0xda
  [<ffffffff8106a6b1>] lock_acquire+0x64/0x81
  [<ffffffff810b939d>] ? mm_take_all_locks+0xe5/0x11c
  [<ffffffff813cdb70>] _spin_lock_nest_lock+0x31/0x66
  [<ffffffff810b939d>] ? mm_take_all_locks+0xe5/0x11c
  [<ffffffff813ccc0e>] ? mutex_lock_nested+0x34/0x39
  [<ffffffff810b939d>] mm_take_all_locks+0xe5/0x11c
  [<ffffffff810cbcbc>] ? do_mmu_notifier_register+0x56/0x113
  [<ffffffff810cbcc4>] do_mmu_notifier_register+0x5e/0x113
  [<ffffffff810cbd94>] mmu_notifier_register+0xe/0x10
  [<ffffffff8123acdb>] gntdev_open+0x8f/0xcc
  [<ffffffff81257dc2>] misc_open+0x188/0x21e
  [<ffffffff810dd1f6>] chrdev_open+0x164/0x185
  [<ffffffff810dd092>] ? chrdev_open+0x0/0x185
  [<ffffffff810d8bd5>] __dentry_open+0x149/0x27f
  [<ffffffff810d8dd1>] nameidata_to_filp+0x3d/0x4e
  [<ffffffff810e59ed>] do_filp_open+0x4ee/0x9e9
  [<ffffffff8100e871>] ? xen_force_evtchn_callback+0xd/0xf
  [<ffffffff8100eff2>] ? check_events+0x12/0x20
  [<ffffffff811d0637>] ? _raw_spin_unlock+0x8f/0x98
  [<ffffffff813cdb3a>] ? _spin_unlock+0x26/0x2b
  [<ffffffff810eedf2>] ? alloc_fd+0x111/0x123
  [<ffffffff810d89a3>] do_sys_open+0x5e/0x10a
  [<ffffffff810d8a78>] sys_open+0x1b/0x1d
  [<ffffffff81011b02>] system_call_fastpath+0x16/0x1b


Probably not related, i see the following message in my dom0 from time 
to time, and if it appears at the 'wrong' moment, it causes my system to 
become completely unusable as soon as a process needs disk access.

ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata4.00: BMDMA stat 0x64
ata4.00: failed command: READ DMA
ata4.00: cmd c8/00:08:99:13:5c/00:00:00:00:00/ef tag 0 dma 4096 in
          res 51/40:00:a0:13:5c/00:00:00:00:00/ef Emask 0x9 (media error)
ata4.00: status: { DRDY ERR }
ata4.00: error: { UNC }
ata4.00: configured for UDMA/133
ata4.01: configured for UDMA/133
ata4: EH complete

Not sure if this is related though, it could be just a bad disk (it 
seems to be always related to the same disk), i'm going to replace the 
disk, and see if that makes a difference.


Regards,
Mark

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: xennet: skb rides the rocket messages in domU dmesg
  2010-05-29 21:43   ` Mark Hurenkamp
@ 2010-06-01 16:42     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 4+ messages in thread
From: Jeremy Fitzhardinge @ 2010-06-01 16:42 UTC (permalink / raw)
  To: Mark Hurenkamp; +Cc: xen-devel

On 05/29/2010 02:43 PM, Mark Hurenkamp wrote:
>> That appears to mean that you're getting single packets which are larger
>> than 18 pages long (72k).  I'm not quite sure how that's possible, since
>> I thought the datagram limit is 64k..
>>
>> Are you using nfs over udp or tcp?  (I think tcp, from your stack
>> trace.)
>>
>> Does turning of tso/gso with ethtool make a difference?
>>    
> Ok, i tried this on the running system, and it did seem to improve
> things, but still i'd see some (other) messages.
> After a reboot, with the new xen/stable-2.6.32.13.x based kernel
> and switching tso and gso off with ethtool, these messages are
> now completely gone (have the system up for about a day now).

Hm.  I don't think disabling them should be necessary, but the only
downside in doing so is slightly higher per-packet processing cost.

>
> I do notice something else though (might have been there before,
> but now it is the only message in domU dmesg), just after starting
> nfs during boot of the domU:
>
> BUG: unable to handle kernel paging request at 00000002dcf32198
> IP: [<ffffffff811cf09a>] bitmap_scnprintf+0x5c/0xb6
> PGD a777067 PUD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/pci-0/pci0000:08/0000:08:02.0/local_cpus

What device is 0000:08:02.0?

> CPU 0
> Modules linked in: nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss
> autofs4 ipv6 wm8775 tea5767 cx25840 tuner_simple sunrpc tuner_types
> tda9887 tda8290 tuner msp3400 saa7127 saa7115 ivtv i2c_algo_bit
> cx2341x v4l2_common videodev v4l1_compat xen_fbfront
> v4l2_compat_ioctl32 fb_sys_fops tveeprom sysimgblt joydev i2c_core
> sysfillrect xen_kbdfront syscopyarea xen_netfront raid10 raid456
> async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy
> async_tx raid1 raid0 multipath linear
> Pid: 3468, comm: irqbalance Not tainted 2.6.32.13m7.1 #1
> RIP: e030:[<ffffffff811cf09a>]  [<ffffffff811cf09a>]
> bitmap_scnprintf+0x5c/0xb6
> RSP: e02b:ffff88001cbd9e18  EFLAGS: 00010246
> RAX: ffffffff81527f2b RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000ffe RDI: 0000000000000000
> RBP: ffff88001cbd9e48 R08: 0000000000000010 R09: 0000000000000001
> R10: 0000000000000357 R11: dead000000200200 R12: 0000000000000000
> R13: 0000000000000ffe R14: 00000002dcf32198 R15: ffff880002bbd000
> FS:  00007fc142b6d720(0000) GS:ffff8800046e0000(0000)
> knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000002dcf32198 CR3: 000000001ca58000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process irqbalance (pid: 3468, threadinfo ffff88001cbd8000, task
> ffff88001ded2920)
> Stack:
>  0000000000000200 ffff880002bbd000 ffff88001cbd9f58 ffff880002eeb858
> <0> ffff88001ce8ed10 ffffffff81616230 ffff88001cbd9e68 ffffffff811dd333
> <0> ffff880002eeb878 ffffffff81606368 ffff88001cbd9e98 ffffffff81273574
> Call Trace:
>  [<ffffffff811dd333>] local_cpus_show+0x44/0x57
>  [<ffffffff81273574>] dev_attr_show+0x22/0x49
>  [<ffffffff810a4e8e>] ? __get_free_pages+0x9/0x46
>  [<ffffffff8112fbc2>] sysfs_read_file+0xb4/0x139
>  [<ffffffff810da927>] vfs_read+0xa6/0x103
>  [<ffffffff810daa3a>] sys_read+0x45/0x69
>  [<ffffffff81011b02>] system_call_fastpath+0x16/0x1b
> Code: e0 48 c7 c0 2b 7f 52 81 41 83 ec 20 31 db eb 60 44 89 e2 44 89
> e1 48 63 fb 83 e1 3f c1 fa 06 41 b9 01 00 00 00 48 63 d2 44 89 ee <49>
> 8b 14 d6 29 de 48 d3 ea 49 8d 3c 3f 44 88 c1 41 83 ec 20 49
> RIP  [<ffffffff811cf09a>] bitmap_scnprintf+0x5c/0xb6
>  RSP <ffff88001cbd9e18>
> CR2: 00000002dcf32198
> ---[ end trace 5f520ed1e48e5394 ]---
>
>
> During boot of dom0 i see the following when it is starting my domU
> (seems to be more of a warning):
> BUG: MAX_LOCK_DEPTH too low!
> turning off the locking correctness validator.

Interesting.  That looks like a bug in the core kernel's mmu notifier
machinery that we're using, but the only side-effect is that it will
disable lockdep checking.

> Pid: 5861, comm: qemu-dm Not tainted 2.6.32.13m7.1 #1
> Call Trace:
>  [<ffffffff8106a625>] __lock_acquire+0x431/0x459
>  [<ffffffff810b029d>] ? vma_prio_tree_remove+0x27/0xda
>  [<ffffffff8106a6b1>] lock_acquire+0x64/0x81
>  [<ffffffff810b939d>] ? mm_take_all_locks+0xe5/0x11c
>  [<ffffffff813cdb70>] _spin_lock_nest_lock+0x31/0x66
>  [<ffffffff810b939d>] ? mm_take_all_locks+0xe5/0x11c
>  [<ffffffff813ccc0e>] ? mutex_lock_nested+0x34/0x39
>  [<ffffffff810b939d>] mm_take_all_locks+0xe5/0x11c
>  [<ffffffff810cbcbc>] ? do_mmu_notifier_register+0x56/0x113
>  [<ffffffff810cbcc4>] do_mmu_notifier_register+0x5e/0x113
>  [<ffffffff810cbd94>] mmu_notifier_register+0xe/0x10
>  [<ffffffff8123acdb>] gntdev_open+0x8f/0xcc
>  [<ffffffff81257dc2>] misc_open+0x188/0x21e
>  [<ffffffff810dd1f6>] chrdev_open+0x164/0x185
>  [<ffffffff810dd092>] ? chrdev_open+0x0/0x185
>  [<ffffffff810d8bd5>] __dentry_open+0x149/0x27f
>  [<ffffffff810d8dd1>] nameidata_to_filp+0x3d/0x4e
>  [<ffffffff810e59ed>] do_filp_open+0x4ee/0x9e9
>  [<ffffffff8100e871>] ? xen_force_evtchn_callback+0xd/0xf
>  [<ffffffff8100eff2>] ? check_events+0x12/0x20
>  [<ffffffff811d0637>] ? _raw_spin_unlock+0x8f/0x98
>  [<ffffffff813cdb3a>] ? _spin_unlock+0x26/0x2b
>  [<ffffffff810eedf2>] ? alloc_fd+0x111/0x123
>  [<ffffffff810d89a3>] do_sys_open+0x5e/0x10a
>  [<ffffffff810d8a78>] sys_open+0x1b/0x1d
>  [<ffffffff81011b02>] system_call_fastpath+0x16/0x1b
>
>
> Probably not related, i see the following message in my dom0 from time
> to time, and if it appears at the 'wrong' moment, it causes my system
> to become completely unusable as soon as a process needs disk access.
>
> ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> ata4.00: BMDMA stat 0x64
> ata4.00: failed command: READ DMA
> ata4.00: cmd c8/00:08:99:13:5c/00:00:00:00:00/ef tag 0 dma 4096 in
>          res 51/40:00:a0:13:5c/00:00:00:00:00/ef Emask 0x9 (media error)
> ata4.00: status: { DRDY ERR }
> ata4.00: error: { UNC }
> ata4.00: configured for UDMA/133
> ata4.01: configured for UDMA/133
> ata4: EH complete
>
> Not sure if this is related though, it could be just a bad disk (it
> seems to be always related to the same disk), i'm going to replace the
> disk, and see if that makes a difference.

That looks like a real disk error - it's getting uncorrectable read errors.

    J

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-06-01 16:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-26 21:21 xennet: skb rides the rocket messages in domU dmesg Mark Hurenkamp
2010-05-26 22:39 ` Jeremy Fitzhardinge
2010-05-29 21:43   ` Mark Hurenkamp
2010-06-01 16:42     ` Jeremy Fitzhardinge

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.