From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoffer Dall Subject: Dom0 crash with apache bench (ab) Date: Tue, 28 Jul 2015 15:09:31 +0200 Message-ID: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6739562362146219345==" Return-path: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org --===============6739562362146219345== Content-Type: multipart/alternative; boundary=001a113552ee4783c9051bef2efb --001a113552ee4783c9051bef2efb Content-Type: text/plain; charset=UTF-8 Hi, I've been doing some performance comparisons lately, and wanted to compare the performance overhead of using Xen with apache bench, but unfortunately the Dom0 kernel crashes when hitting it with ab from a remote machine. Most other workloads seem to be stable, however, I do see similar crashes if hitting Dom0 mysql with a mysql benchmark with a high level of parallelism. I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for networking on a Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM. Interestingly, we had a similarly looking issue on arm64 recently, but that was fixed with an APM-soecific fix to the hypervisor, so I am guessing this is unrelated, see: http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02731.html and the fix: http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=50dcb3de603927db2fd87ba09e29c817415aaa44 I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, and v4.1, same issue. I have tried with Xen 4.5-0 release, and the Ubuntu packaged Xen 4.4 release, same issue. Examples of crash: http://pastebin.ubuntu.com/11953498/ http://pastebin.ubuntu.com/11953443/ Running DomU with a bridge and running ab against apache running in a DomU also causes the system to crash. Note: The server also has an embedded 1G Broadcom NIC (although not suitable for testing due to it being 1G and on a control network), and using that for the test does not cause a system crash, so this points to some difficulties with the Mellanox device and Xen. Any ideas or advice is greatly appreciated, thanks. -Christoffer --001a113552ee4783c9051bef2efb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

I've been doing some performance comparison= s lately, and wanted to compare the performance overhead of using Xen with = apache bench, but unfortunately the Dom0 kernel crashes when hitting it wit= h ab from a remote machine.=C2=A0 Most other workloads seem to be stable, h= owever, I do see similar crashes if hitting Dom0 mysql with a mysql benchma= rk with a high level of parallelism.

I use a 10G Mellanox MX354A Dua= l port FDR CX3 adapter for networking on a Dell PowerEdge R320 system with = a Xeon E5-2450 and 16 GB of RAM.

Interestingly, we had a similarly l= ooking issue on arm64 recently, but that was fixed with an APM-soecific fix= to the hypervisor, so I am guessing this is unrelated, see: htt= p://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02731.html = and the fix: http://xenbits.xen.org/git= web/?p=3Dxen.git;a=3Dcommit;h=3D50dcb3de603927db2fd87ba09e29c817415aaa44

I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, = and v4.1, same issue.=C2=A0 I have tried with Xen 4.5-0 release, and the Ub= untu packaged Xen 4.4 release, same issue.

Examples of crash:
http://pastebin.ubuntu.com/11= 953498/
http://past= ebin.ubuntu.com/11953443/

Running DomU with a bridge and running= ab against apache running in a DomU also causes the system to crash.
Note: The server also has an embedded 1G Broadcom NIC (although not suita= ble for testing due to it being 1G and on a control network), and using tha= t for the test does not cause a system crash, so this points to some diffic= ulties with the Mellanox device and Xen.

Any ideas or advice is grea= tly appreciated, thanks.

-Christoffer
--001a113552ee4783c9051bef2efb-- --===============6739562362146219345== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============6739562362146219345==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: Dom0 crash with apache bench (ab) Date: Tue, 28 Jul 2015 10:50:22 -0400 Message-ID: <20150728145022.GE26623@x230.dumpdata.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Christoffer Dall Cc: xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On Tue, Jul 28, 2015 at 03:09:31PM +0200, Christoffer Dall wrote: > Hi, > > I've been doing some performance comparisons lately, and wanted to compare > the performance overhead of using Xen with apache bench, but unfortunately > the Dom0 kernel crashes when hitting it with ab from a remote machine. > Most other workloads seem to be stable, however, I do see similar crashes > if hitting Dom0 mysql with a mysql benchmark with a high level of > parallelism. > > I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for networking on a > Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM. > > Interestingly, we had a similarly looking issue on arm64 recently, but that > was fixed with an APM-soecific fix to the hypervisor, so I am guessing this > is unrelated, see: > http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02731.html > and the fix: > http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=50dcb3de603927db2fd87ba09e29c817415aaa44 > > I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, and v4.1, > same issue. I have tried with Xen 4.5-0 release, and the Ubuntu packaged > Xen 4.4 release, same issue. > > Examples of crash: > http://pastebin.ubuntu.com/11953498/ > http://pastebin.ubuntu.com/11953443/ 4.0-rc4? Have you tried 4.1? > > Running DomU with a bridge and running ab against apache running in a DomU > also causes the system to crash. > > Note: The server also has an embedded 1G Broadcom NIC (although not > suitable for testing due to it being 1G and on a control network), and > using that for the test does not cause a system crash, so this points to > some difficulties with the Mellanox device and Xen. > > Any ideas or advice is greatly appreciated, thanks. > > -Christoffer > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ian Campbell Subject: Re: Dom0 crash with apache bench (ab) Date: Tue, 28 Jul 2015 15:55:19 +0100 Message-ID: <1438095319.11600.165.camel@citrix.com> References: <20150728145022.GE26623@x230.dumpdata.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20150728145022.GE26623@x230.dumpdata.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Konrad Rzeszutek Wilk , Christoffer Dall Cc: xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On Tue, 2015-07-28 at 10:50 -0400, Konrad Rzeszutek Wilk wrote: > On Tue, Jul 28, 2015 at 03:09:31PM +0200, Christoffer Dall wrote: > > Hi, > > > > I've been doing some performance comparisons lately, and wanted to > > compare > > the performance overhead of using Xen with apache bench, but > > unfortunately > > the Dom0 kernel crashes when hitting it with ab from a remote machine. > > Most other workloads seem to be stable, however, I do see similar > > crashes > > if hitting Dom0 mysql with a mysql benchmark with a high level of > > parallelism. > > > > I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for networking on > > a > > Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM. > > > > Interestingly, we had a similarly looking issue on arm64 recently, but > > that > > was fixed with an APM-soecific fix to the hypervisor, so I am guessing > > this > > is unrelated, see: > > http://lists.xenproject.org/archives/html/xen-devel/2015 > > -03/msg02731.html > > and the fix: > > http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=50dcb3de603927db2fd > > 87ba09e29c817415aaa44 > > > > I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, and > > v4.1, > > same issue. I have tried with Xen 4.5-0 release, and the Ubuntu > > packaged > > Xen 4.4 release, same issue. > > > > Examples of crash: > > http://pastebin.ubuntu.com/11953498/ > > http://pastebin.ubuntu.com/11953443/ > > 4.0-rc4? > > Have you tried 4.1? According to the previous paragraph, yes he has. > > > > > Running DomU with a bridge and running ab against apache running in a > > DomU > > also causes the system to crash. > > > > Note: The server also has an embedded 1G Broadcom NIC (although not > > suitable for testing due to it being 1G and on a control network), and > > using that for the test does not cause a system crash, so this points > > to > > some difficulties with the Mellanox device and Xen. > > > > Any ideas or advice is greatly appreciated, thanks. > > > > -Christoffer > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xen.org > > http://lists.xen.org/xen-devel > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoffer Dall Subject: Re: Dom0 crash with apache bench (ab) Date: Tue, 28 Jul 2015 17:00:50 +0200 Message-ID: References: <20150728145022.GE26623@x230.dumpdata.com> <1438095319.11600.165.camel@citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============6860915096488304807==" Return-path: In-Reply-To: <1438095319.11600.165.camel@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Campbell Cc: xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org --===============6860915096488304807== Content-Type: multipart/alternative; boundary=001a113592566e62e3051bf0bc4c --001a113592566e62e3051bf0bc4c Content-Type: text/plain; charset=UTF-8 On Tue, Jul 28, 2015 at 4:55 PM, Ian Campbell wrote: > On Tue, 2015-07-28 at 10:50 -0400, Konrad Rzeszutek Wilk wrote: > > On Tue, Jul 28, 2015 at 03:09:31PM +0200, Christoffer Dall wrote: > > > Hi, > > > > > > I've been doing some performance comparisons lately, and wanted to > > > compare > > > the performance overhead of using Xen with apache bench, but > > > unfortunately > > > the Dom0 kernel crashes when hitting it with ab from a remote machine. > > > Most other workloads seem to be stable, however, I do see similar > > > crashes > > > if hitting Dom0 mysql with a mysql benchmark with a high level of > > > parallelism. > > > > > > I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for networking on > > > a > > > Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM. > > > > > > Interestingly, we had a similarly looking issue on arm64 recently, but > > > that > > > was fixed with an APM-soecific fix to the hypervisor, so I am guessing > > > this > > > is unrelated, see: > > > http://lists.xenproject.org/archives/html/xen-devel/2015 > > > -03/msg02731.html > > > and the fix: > > > > http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=50dcb3de603927db2fd > > > 87ba09e29c817415aaa44 > > > > > > I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, and > > > v4.1, > > > same issue. I have tried with Xen 4.5-0 release, and the Ubuntu > > > packaged > > > Xen 4.4 release, same issue. > > > > > > Examples of crash: > > > http://pastebin.ubuntu.com/11953498/ > > > http://pastebin.ubuntu.com/11953443/ > > > > 4.0-rc4? > > > > Have you tried 4.1? > > According to the previous paragraph, yes he has. > > yes, I have. Just for clarify, I used 4.0-rc4 because that's a branch which contained arm64 PCI support and has been used for other measurements, so this was simply my 'working tree'. Thanks, -Christoffer --001a113592566e62e3051bf0bc4c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Tue, Jul 28, 2015 at 4:55 PM, Ian Campbell <ian.campbell@citr= ix.com> wrote:
On Tue, 2015-07-28 at 10:50 -0400, Konrad Rze= szutek Wilk wrote:
> On Tue, Jul 28, 2015 at 03:09:31PM +0200, Christoffer Dall wrote:
> > Hi,
> >
> > I've been doing some performance comparisons lately, and want= ed to
> > compare
> > the performance overhead of using Xen with apache bench, but
> > unfortunately
> > the Dom0 kernel crashes when hitting it with ab from a remote mac= hine.
> > Most other workloads seem to be stable, however, I do see similar=
> > crashes
> > if hitting Dom0 mysql with a mysql benchmark with a high level of=
> > parallelism.
> >
> > I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for network= ing on
> > a
> > Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM.<= br> > >
> > Interestingly, we had a similarly looking issue on arm64 recently= , but
> > that
> > was fixed with an APM-soecific fix to the hypervisor, so I am gue= ssing
> > this
> > is unrelated, see:
> > http://lists.xenproject.org/archiv= es/html/xen-devel/2015
> > -03/msg02731.html
> > and the fix:
> > http://xenbit= s.xen.org/gitweb/?p=3Dxen.git;a=3Dcommit;h=3D50dcb3de603927db2fd
> > 87ba09e29c817415aaa44
> >
> > I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4,= and
> > v4.1,
> > same issue.=C2=A0 I have tried with Xen 4.5-0 release, and the Ub= untu
> > packaged
> > Xen 4.4 release, same issue.
> >
> > Examples of crash:
> > http://pastebin.ubuntu.com/11953498/
> > http://pastebin.ubuntu.com/11953443/
>
> 4.0-rc4?
>
> Have you tried 4.1?

According to the previous paragraph, yes he has.

y= es, I have.=C2=A0 Just for clarify, I used 4.0-rc4 because that's a bra= nch which contained arm64 PCI support and has been used for other measureme= nts, so this was simply my 'working tree'.

Thanks,
-Christoffer=C2=A0
--001a113592566e62e3051bf0bc4c-- --===============6860915096488304807== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============6860915096488304807==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefano Stabellini Subject: Re: Dom0 crash with apache bench (ab) Date: Fri, 31 Jul 2015 11:24:59 +0100 Message-ID: References: <20150728145022.GE26623@x230.dumpdata.com> <1438095319.11600.165.camel@citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="1342847746-109243297-1438338305=:11337" Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Christoffer Dall Cc: David Vrabel , Wei Liu , Ian Campbell , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org --1342847746-109243297-1438338305=:11337 Content-Type: text/plain; charset="UTF-8" Content-Length: 4107 Content-Transfer-Encoding: quoted-printable This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5-2450), CC'ing relevant people. As you can see from the links below the crash is: [ 253.619326] Call Trace: [ 253.619330] [ 253.619332] [] =3F skb_copy_ubufs+0xa5/0x230 [ 253.619347] [] __netif_receive_skb_core+0x6f5/0x940 [ 253.619353] [] __netif_receive_skb+0x18/0x60 [ 253.619360] [] netif_receive_skb_internal+0x28/0x90 [ 253.619366] [] napi_gro_frags+0x125/0x1a0 [ 253.619378] [] mlx4_en_process_rx_cq+0x753/0xb50 [mlx4_en] [ 253.619387] [] mlx4_en_poll_rx_cq+0x97/0x160 [mlx4_en] [ 253.619393] [] net_rx_action+0x13d/0x2f0 [ 253.619400] [] __do_softirq+0xda/0x1f0 [ 253.619406] [] irq_exit+0x9d/0xb0 [ 253.619412] [] xen_evtchn_do_upcall+0x35/0x50 [ 253.619420] [] xen_do_hypervisor_callback+0x1e/0x40 [ 253.619423] [ 253.619426] [] =3F shrink_dcache_for_umount+0x90/0x90 [ 253.619437] [] =3F d_alloc_pseudo+0x9/0x10 [ 253.619443] [] =3F sock_alloc_file+0x4d/0x120 [ 253.619448] [] =3F SYSC_accept4+0xb8/0x200 [ 253.619454] [] =3F SyS_epoll_wait+0x87/0xe0 [ 253.619459] [] =3F SyS_accept4+0x9/0x10 [ 253.619465] [] =3F system_call_fastpath+0x16/0x1b [ 253.619469] Code: 4e 48 83 c4 08 5b 5d c3 66 0f 1f 44 00 00 e8 6b fc ff ff eb e1 90 90 90 90 90 90 90 90 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 48 a5 89 d1 f3 a4 c3 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c [ 253.619513] RIP [] __memcpy+0xd/0x110 [ 253.619520] RSP [ 253.619524] ---[ end trace ba5d35a466b03856 ]--- On Tue, 28 Jul 2015, Christoffer Dall wrote: > On Tue, Jul 28, 2015 at 4:55 PM, Ian Campbell wrote: > On Tue, 2015-07-28 at 10:50 -0400, Konrad Rzeszutek Wilk wrote: > > On Tue, Jul 28, 2015 at 03:09:31PM +0200, Christoffer Dall wrote: > > > Hi, > > > > > > I've been doing some performance comparisons lately, and wanted to > > > compare > > > the performance overhead of using Xen with apache bench, but > > > unfortunately > > > the Dom0 kernel crashes when hitting it with ab from a remote machine. > > > Most other workloads seem to be stable, however, I do see similar > > > crashes > > > if hitting Dom0 mysql with a mysql benchmark with a high level of > > > parallelism. > > > > > > I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for networking on > > > a > > > Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM. > > > > > > Interestingly, we had a similarly looking issue on arm64 recently, but > > > that > > > was fixed with an APM-soecific fix to the hypervisor, so I am guessing > > > this > > > is unrelated, see: > > > http://lists.xenproject.org/archives/html/xen-devel/2015 > > > -03/msg02731.html > > > and the fix: > > > http://xenbits.xen.org/gitweb/=3Fp=3Dxen.git;a=3Dcommit;h=3D50dcb3de603927db2fd > > > 87ba09e29c817415aaa44 > > > > > > I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, and > > > v4.1, > > > same issue.=C2=A0 I have tried with Xen 4.5-0 release, and the Ubuntu > > > packaged > > > Xen 4.4 release, same issue. > > > > > > Examples of crash: > > > http://pastebin.ubuntu.com/11953498/ > > > http://pastebin.ubuntu.com/11953443/ > > > > 4.0-rc4=3F > > > > Have you tried 4.1=3F > > According to the previous paragraph, yes he has. > > yes, I have.=C2=A0 Just for clarify, I used 4.0-rc4 because that's a branch which contained arm64 PCI support and has > been used for other measurements, so this was simply my 'working tree'. --1342847746-109243297-1438338305=:11337 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --1342847746-109243297-1438338305=:11337-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Vrabel Subject: Re: Dom0 crash with apache bench (ab) Date: Fri, 31 Jul 2015 11:28:10 +0100 Message-ID: <55BB4DBA.2040909@citrix.com> References: <20150728145022.GE26623@x230.dumpdata.com> <1438095319.11600.165.camel@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Stefano Stabellini , Christoffer Dall Cc: Wei Liu , Ian Campbell , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On 31/07/15 11:24, Stefano Stabellini wrote: > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5-2450), > CC'ing relevant people. As you can see from the links below the crash > is: > > [ 253.619326] Call Trace: > [ 253.619330] > [ 253.619332] [] ? skb_copy_ubufs+0xa5/0x230 > [ 253.619347] [] __netif_receive_skb_core+0x6f5/0x940 > [ 253.619353] [] __netif_receive_skb+0x18/0x60 > [ 253.619360] [] netif_receive_skb_internal+0x28/0x90 > [ 253.619366] [] napi_gro_frags+0x125/0x1a0 > [ 253.619378] [] mlx4_en_process_rx_cq+0x753/0xb50 [mlx4_en] > [ 253.619387] [] mlx4_en_poll_rx_cq+0x97/0x160 [mlx4_en] What makes you think this is Xen specific? I suggest raising this the the mlx4 maintainers. David > [ 253.619393] [] net_rx_action+0x13d/0x2f0 > [ 253.619400] [] __do_softirq+0xda/0x1f0 > [ 253.619406] [] irq_exit+0x9d/0xb0 > [ 253.619412] [] xen_evtchn_do_upcall+0x35/0x50 > [ 253.619420] [] xen_do_hypervisor_callback+0x1e/0x40 > [ 253.619423] > [ 253.619426] [] ? shrink_dcache_for_umount+0x90/0x90 > [ 253.619437] [] ? d_alloc_pseudo+0x9/0x10 > [ 253.619443] [] ? sock_alloc_file+0x4d/0x120 > [ 253.619448] [] ? SYSC_accept4+0xb8/0x200 > [ 253.619454] [] ? SyS_epoll_wait+0x87/0xe0 > [ 253.619459] [] ? SyS_accept4+0x9/0x10 > [ 253.619465] [] ? system_call_fastpath+0x16/0x1b > [ 253.619469] Code: 4e 48 83 c4 08 5b 5d c3 66 0f 1f 44 00 00 e8 6b fc > ff ff eb e1 90 90 90 90 90 90 90 90 90 48 89 f8 48 89 d1 48 c1 e9 03 83 > e2 07 > 48 a5 89 d1 f3 a4 c3 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c > [ 253.619513] RIP [] __memcpy+0xd/0x110 > [ 253.619520] RSP > [ 253.619524] ---[ end trace ba5d35a466b03856 ]--- > > On Tue, 28 Jul 2015, Christoffer Dall wrote: >> On Tue, Jul 28, 2015 at 4:55 PM, Ian Campbell wrote: >> On Tue, 2015-07-28 at 10:50 -0400, Konrad Rzeszutek Wilk wrote: >> > On Tue, Jul 28, 2015 at 03:09:31PM +0200, Christoffer Dall wrote: >> > > Hi, >> > > >> > > I've been doing some performance comparisons lately, and wanted to >> > > compare >> > > the performance overhead of using Xen with apache bench, but >> > > unfortunately >> > > the Dom0 kernel crashes when hitting it with ab from a remote machine. >> > > Most other workloads seem to be stable, however, I do see similar >> > > crashes >> > > if hitting Dom0 mysql with a mysql benchmark with a high level of >> > > parallelism. >> > > >> > > I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for networking on >> > > a >> > > Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM. >> > > >> > > Interestingly, we had a similarly looking issue on arm64 recently, but >> > > that >> > > was fixed with an APM-soecific fix to the hypervisor, so I am guessing >> > > this >> > > is unrelated, see: >> > > http://lists.xenproject.org/archives/html/xen-devel/2015 >> > > -03/msg02731.html >> > > and the fix: >> > > http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=50dcb3de603927db2fd >> > > 87ba09e29c817415aaa44 >> > > >> > > I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, and >> > > v4.1, >> > > same issue. I have tried with Xen 4.5-0 release, and the Ubuntu >> > > packaged >> > > Xen 4.4 release, same issue. >> > > >> > > Examples of crash: >> > > http://pastebin.ubuntu.com/11953498/ >> > > http://pastebin.ubuntu.com/11953443/ >> > >> > 4.0-rc4? >> > >> > Have you tried 4.1? >> >> According to the previous paragraph, yes he has. >> >> yes, I have. Just for clarify, I used 4.0-rc4 because that's a branch which contained arm64 PCI support and has >> been used for other measurements, so this was simply my 'working tree'. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoffer Dall Subject: Re: Dom0 crash with apache bench (ab) Date: Fri, 31 Jul 2015 15:17:56 +0200 Message-ID: References: <20150728145022.GE26623@x230.dumpdata.com> <1438095319.11600.165.camel@citrix.com> <55BB4DBA.2040909@citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============7600058446644444964==" Return-path: In-Reply-To: <55BB4DBA.2040909@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: David Vrabel Cc: xen-devel@lists.xen.org, Wei Liu , Ian Campbell , Stefano Stabellini List-Id: xen-devel@lists.xenproject.org --===============7600058446644444964== Content-Type: multipart/alternative; boundary=001a1134f258f33362051c2ba536 --001a1134f258f33362051c2ba536 Content-Type: text/plain; charset=UTF-8 On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel wrote: > On 31/07/15 11:24, Stefano Stabellini wrote: > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5-2450), > > CC'ing relevant people. As you can see from the links below the crash > > is: > > > > [ 253.619326] Call Trace: > > [ 253.619330] > > [ 253.619332] [] ? skb_copy_ubufs+0xa5/0x230 > > [ 253.619347] [] __netif_receive_skb_core+0x6f5/0x940 > > [ 253.619353] [] __netif_receive_skb+0x18/0x60 > > [ 253.619360] [] netif_receive_skb_internal+0x28/0x90 > > [ 253.619366] [] napi_gro_frags+0x125/0x1a0 > > [ 253.619378] [] mlx4_en_process_rx_cq+0x753/0xb50 > [mlx4_en] > > [ 253.619387] [] mlx4_en_poll_rx_cq+0x97/0x160 > [mlx4_en] > > What makes you think this is Xen specific? I suggest raising this the > the mlx4 maintainers. > > Linux native and KVM guests (same hw, same kernel version+config) run just fine under the same workload. Thanks, -Christoffer --001a1134f258f33362051c2ba536 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel <david.vrabel@cit= rix.com> wrote:
On 31/07/15 11:24, Stefano Stabellini wrote:
> This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5-2450),=
> CC'ing relevant people. As you can see from the links below the cr= ash
> is:
>
> [ 253.619326] Call Trace:
> [ 253.619330] <IRQ>
> [ 253.619332] [<ffffffff815d7c25>] ? skb_copy_ubufs+0xa5/0x230 > [ 253.619347] [<ffffffff815e8525>] __netif_receive_skb_core+0x6f= 5/0x940
> [ 253.619353] [<ffffffff815e8788>] __netif_receive_skb+0x18/0x60=
> [ 253.619360] [<ffffffff815e87f8>] netif_receive_skb_internal+0x= 28/0x90
> [ 253.619366] [<ffffffff815e91f5>] napi_gro_frags+0x125/0x1a0 > [ 253.619378] [<ffffffffa01b1173>] mlx4_en_process_rx_cq+0x753/0= xb50 [mlx4_en]
> [ 253.619387] [<ffffffffa01b1657>] mlx4_en_poll_rx_cq+0x97/0x160= [mlx4_en]

What makes you think this is Xen specific?=C2=A0 I suggest raising t= his the
the mlx4 maintainers.


Linux native and KVM guests (same hw, same kernel ve= rsion+config) run just fine under the same workload.

Thanks,
-Christoffer
--001a1134f258f33362051c2ba536-- --===============7600058446644444964== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============7600058446644444964==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoffer Dall Subject: Re: Dom0 crash with apache bench (ab) Date: Mon, 14 Sep 2015 14:40:08 +0200 Message-ID: <20150914124008.GA17195@cbox> References: <20150728145022.GE26623@x230.dumpdata.com> <1438095319.11600.165.camel@citrix.com> <55BB4DBA.2040909@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Christoffer Dall Cc: Ian Campbell , xen-devel@lists.xen.org, Wei Liu , David Vrabel , Stefano Stabellini List-Id: xen-devel@lists.xenproject.org On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote: > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel > wrote: > > > On 31/07/15 11:24, Stefano Stabellini wrote: > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5-2450), > > > CC'ing relevant people. As you can see from the links below the crash > > > is: > > > > > > [ 253.619326] Call Trace: > > > [ 253.619330] > > > [ 253.619332] [] ? skb_copy_ubufs+0xa5/0x230 > > > [ 253.619347] [] __netif_receive_skb_core+0x6f5/0x940 > > > [ 253.619353] [] __netif_receive_skb+0x18/0x60 > > > [ 253.619360] [] netif_receive_skb_internal+0x28/0x90 > > > [ 253.619366] [] napi_gro_frags+0x125/0x1a0 > > > [ 253.619378] [] mlx4_en_process_rx_cq+0x753/0xb50 > > [mlx4_en] > > > [ 253.619387] [] mlx4_en_poll_rx_cq+0x97/0x160 > > [mlx4_en] > > > > What makes you think this is Xen specific? I suggest raising this the > > the mlx4 maintainers. > > > > > Linux native and KVM guests (same hw, same kernel version+config) run just > fine under the same workload. > Ping? >>From the fact that bare-metal and KVM works fine with this hardware I still think it's reasonable to assume that it's a Xen issue and not a mlx4 issue. Is this completely flawed? Thanks, -Christoffer From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: Dom0 crash with apache bench (ab) Date: Mon, 14 Sep 2015 11:11:08 -0400 Message-ID: <20150914144957.GC19667@x230> References: <20150728145022.GE26623@x230.dumpdata.com> <1438095319.11600.165.camel@citrix.com> <55BB4DBA.2040909@citrix.com> <20150914124008.GA17195@cbox> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: <20150914124008.GA17195@cbox> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Christoffer Dall Cc: Wei Liu , Ian Campbell , Stefano Stabellini , xen-devel@lists.xen.org, Christoffer Dall , David Vrabel List-Id: xen-devel@lists.xenproject.org On Mon, Sep 14, 2015 at 02:40:08PM +0200, Christoffer Dall wrote: > On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote: > > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel > > wrote: > > > > > On 31/07/15 11:24, Stefano Stabellini wrote: > > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5-2450), > > > > CC'ing relevant people. As you can see from the links below the crash > > > > is: > > > > > > > > [ 253.619326] Call Trace: > > > > [ 253.619330] > > > > [ 253.619332] [] ? skb_copy_ubufs+0xa5/0x230 > > > > [ 253.619347] [] __netif_receive_skb_core+0x6f5/0x940 > > > > [ 253.619353] [] __netif_receive_skb+0x18/0x60 > > > > [ 253.619360] [] netif_receive_skb_internal+0x28/0x90 > > > > [ 253.619366] [] napi_gro_frags+0x125/0x1a0 > > > > [ 253.619378] [] mlx4_en_process_rx_cq+0x753/0xb50 > > > [mlx4_en] > > > > [ 253.619387] [] mlx4_en_poll_rx_cq+0x97/0x160 > > > [mlx4_en] > > > > > > What makes you think this is Xen specific? I suggest raising this the > > > the mlx4 maintainers. > > > > > > > > Linux native and KVM guests (same hw, same kernel version+config) run just > > fine under the same workload. > > > Ping? > > >From the fact that bare-metal and KVM works fine with this hardware I > still think it's reasonable to assume that it's a Xen issue and not a > mlx4 issue. > > Is this completely flawed? I have a feeling it is an mlx4 issue but you don't easily reproduce it under baremetal. Is there any way you could boot baremetal with 'iommu=soft swiotlb=force' to see if you can reproduce it under those conditions? thanks! > > Thanks, > -Christoffer > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ian Campbell Subject: Re: Dom0 crash with apache bench (ab) Date: Mon, 14 Sep 2015 16:20:24 +0100 Message-ID: <1442244024.3549.300.camel@citrix.com> References: <20150728145022.GE26623@x230.dumpdata.com> <1438095319.11600.165.camel@citrix.com> <55BB4DBA.2040909@citrix.com> <20150914124008.GA17195@cbox> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20150914124008.GA17195@cbox> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Christoffer Dall , Christoffer Dall Cc: xen-devel@lists.xen.org, Wei Liu , David Vrabel , Stefano Stabellini List-Id: xen-devel@lists.xenproject.org On Mon, 2015-09-14 at 14:40 +0200, Christoffer Dall wrote: > On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote: > > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel > > > > wrote: > > > > > On 31/07/15 11:24, Stefano Stabellini wrote: > > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5 > > > > -2450), > > > > CC'ing relevant people. As you can see from the links below the > > > > crash > > > > is: > > > > > > > > [ 253.619326] Call Trace: > > > > [ 253.619330] > > > > [ 253.619332] [] ? skb_copy_ubufs+0xa5/0x230 > > > > [ 253.619347] [] > > > > __netif_receive_skb_core+0x6f5/0x940 > > > > [ 253.619353] [] __netif_receive_skb+0x18/0x60 > > > > [ 253.619360] [] > > > > netif_receive_skb_internal+0x28/0x90 > > > > [ 253.619366] [] napi_gro_frags+0x125/0x1a0 > > > > [ 253.619378] [] > > > > mlx4_en_process_rx_cq+0x753/0xb50 > > > [mlx4_en] > > > > [ 253.619387] [] mlx4_en_poll_rx_cq+0x97/0x160 > > > [mlx4_en] > > > > > > What makes you think this is Xen specific? I suggest raising this > > > the > > > the mlx4 maintainers. > > > > > > > > Linux native and KVM guests (same hw, same kernel version+config) run > > just > > fine under the same workload. > > > Ping? > > From the fact that bare-metal and KVM works fine with this hardware I > still think it's reasonable to assume that it's a Xen issue and not a > mlx4 issue. > > Is this completely flawed? My (somewhat educated) guess is that this is to do with the difference between (pseudo-)physical addresses and machine (AKA real-physical) addresses when running under Xen. The way this often shows up is in drivers which do not make correct use of the kernels DMA APIs but which happen to work on native x86 because physical==bus address on x86. Sometimes booting natively with 'iommu=soft swiotlb=force' can expose these sorts of issues. You are running 64-bit so I don't think the recent "config: Enable NEED_DMA_MAP_STATE by default when SWIOTLB is selected" is likely to be relevant (it's already unconditionally on for 64-bit). The trace appears to be on rx from a physical nic, there shouldn't be any magic Xen stuff (granted pages etc) getting themselves into that path at all. If it were tx then maybe it might be an issue with foreign pages. In any case I think you are able to repro with just dom0, i.e. never having started a domU, is that right? Ian. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoffer Dall Subject: Re: Dom0 crash with apache bench (ab) Date: Mon, 14 Sep 2015 18:16:18 +0200 Message-ID: References: <20150728145022.GE26623@x230.dumpdata.com> <1438095319.11600.165.camel@citrix.com> <55BB4DBA.2040909@citrix.com> <20150914124008.GA17195@cbox> <1442244024.3549.300.camel@citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============2073014019314485857==" Return-path: In-Reply-To: <1442244024.3549.300.camel@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Campbell Cc: xen-devel@lists.xen.org, Wei Liu , David Vrabel , Stefano Stabellini List-Id: xen-devel@lists.xenproject.org --===============2073014019314485857== Content-Type: multipart/alternative; boundary=001a1135b85eb4d309051fb762f6 --001a1135b85eb4d309051fb762f6 Content-Type: text/plain; charset=UTF-8 On Mon, Sep 14, 2015 at 5:20 PM, Ian Campbell wrote: > On Mon, 2015-09-14 at 14:40 +0200, Christoffer Dall wrote: > > On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote: > > > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel < > david.vrabel@citrix.com > > > > > > > wrote: > > > > > > > On 31/07/15 11:24, Stefano Stabellini wrote: > > > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5 > > > > > -2450), > > > > > CC'ing relevant people. As you can see from the links below the > > > > > crash > > > > > is: > > > > > > > > > > [ 253.619326] Call Trace: > > > > > [ 253.619330] > > > > > [ 253.619332] [] ? skb_copy_ubufs+0xa5/0x230 > > > > > [ 253.619347] [] > > > > > __netif_receive_skb_core+0x6f5/0x940 > > > > > [ 253.619353] [] __netif_receive_skb+0x18/0x60 > > > > > [ 253.619360] [] > > > > > netif_receive_skb_internal+0x28/0x90 > > > > > [ 253.619366] [] napi_gro_frags+0x125/0x1a0 > > > > > [ 253.619378] [] > > > > > mlx4_en_process_rx_cq+0x753/0xb50 > > > > [mlx4_en] > > > > > [ 253.619387] [] mlx4_en_poll_rx_cq+0x97/0x160 > > > > [mlx4_en] > > > > > > > > What makes you think this is Xen specific? I suggest raising this > > > > the > > > > the mlx4 maintainers. > > > > > > > > > > > Linux native and KVM guests (same hw, same kernel version+config) run > > > just > > > fine under the same workload. > > > > > Ping? > > > > From the fact that bare-metal and KVM works fine with this hardware I > > still think it's reasonable to assume that it's a Xen issue and not a > > mlx4 issue. > > > > Is this completely flawed? > > My (somewhat educated) guess is that this is to do with the difference > between (pseudo-)physical addresses and machine (AKA real-physical) > addresses when running under Xen. > > The way this often shows up is in drivers which do not make correct use of > the kernels DMA APIs but which happen to work on native x86 because > physical==bus address on x86. > > Sometimes booting natively with 'iommu=soft swiotlb=force' can expose these > sorts of issues. > I'll give this a try. > > You are running 64-bit so I don't think the recent "config: Enable > NEED_DMA_MAP_STATE by default when SWIOTLB is selected" is likely to be > relevant (it's already unconditionally on for 64-bit). > > The trace appears to be on rx from a physical nic, there shouldn't be any > magic Xen stuff (granted pages etc) getting themselves into that path at > all. If it were tx then maybe it might be an issue with foreign pages. In > any case I think you are able to repro with just dom0, i.e. never having > started a domU, is that right? > As far as I remember and as far as I can interpret my own e-mail, yes. Thanks for the feedback, I'll try the suggested approaches and also try using v4.3-rc1 and take it up with the mlx4 maintainers if I still see the issue. -Christoffer --001a1135b85eb4d309051fb762f6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Mon, Sep 14, 2015 at 5:20 PM, Ian Campbell <ian.campbell@citr= ix.com> wrote:
On Mon, 2015-09-14 at 14:40 +0200, Christoffe= r Dall wrote:
> On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote:
> > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel <david.vrabel@citrix.com
> > >
> > wrote:
> >
> > > On 31/07/15 11:24, Stefano Stabellini wrote:
> > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320,= Xeon E5
> > > > -2450),
> > > > CC'ing relevant people. As you can see from the lin= ks below the
> > > > crash
> > > > is:
> > > >
> > > > [ 253.619326] Call Trace:
> > > > [ 253.619330] <IRQ>
> > > > [ 253.619332] [<ffffffff815d7c25>] ? skb_copy_ubu= fs+0xa5/0x230
> > > > [ 253.619347] [<ffffffff815e8525>]
> > > > __netif_receive_skb_core+0x6f5/0x940
> > > > [ 253.619353] [<ffffffff815e8788>] __netif_receiv= e_skb+0x18/0x60
> > > > [ 253.619360] [<ffffffff815e87f8>]
> > > > netif_receive_skb_internal+0x28/0x90
> > > > [ 253.619366] [<ffffffff815e91f5>] napi_gro_frags= +0x125/0x1a0
> > > > [ 253.619378] [<ffffffffa01b1173>]
> > > > mlx4_en_process_rx_cq+0x753/0xb50
> > > [mlx4_en]
> > > > [ 253.619387] [<ffffffffa01b1657>] mlx4_en_poll_r= x_cq+0x97/0x160
> > > [mlx4_en]
> > >
> > > What makes you think this is Xen specific?=C2=A0 I suggest r= aising this
> > > the
> > > the mlx4 maintainers.
> > >
> > >
> > Linux native and KVM guests (same hw, same kernel version+config)= run
> > just
> > fine under the same workload.
> >
> Ping?
>
> From the fact that bare-metal and KVM works fine with this hardware I<= br> > still think it's reasonable to assume that it's a Xen issue an= d not a
> mlx4 issue.
>
> Is this completely flawed?

My (somewhat educated) guess is that this is to do with the dif= ference
between (pseudo-)physical addresses and machine (AKA real-physical)
addresses when running under Xen.

The way this often shows up is in drivers which do not make correct use of<= br> the kernels DMA APIs but which happen to work on native x86 because
physical=3D=3Dbus address on x86.

Sometimes booting natively with 'iommu=3Dsoft swiotlb=3Dforce' can = expose these
sorts of issues.

I'll give this a t= ry.
=C2=A0

You are running 64-bit so I don't think the recent "config: Enable=
NEED_DMA_MAP_STATE by default when SWIOTLB is selected" is likely to b= e
relevant (it's already unconditionally on for 64-bit).

The trace appears to be on rx from a physical nic, there shouldn't be a= ny
magic Xen stuff (granted pages etc) getting themselves into that path at all. If it were tx then maybe it might be an issue with foreign pages. In any case I think you are able to repro with just dom0, i.e. never having started a domU, is that right?

As far a= s I remember and as far as I can interpret my own e-mail, yes.=C2=A0
<= div>
Thanks for the feedback, I'll try the suggested appr= oaches and also try using v4.3-rc1 and take it up with the mlx4 maintainers= if I still see the issue.

-Christoffer
--001a1135b85eb4d309051fb762f6-- --===============2073014019314485857== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============2073014019314485857==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoffer Dall Subject: Re: Dom0 crash with apache bench (ab) Date: Mon, 28 Sep 2015 22:53:33 +0200 Message-ID: References: <20150728145022.GE26623@x230.dumpdata.com> <1438095319.11600.165.camel@citrix.com> <55BB4DBA.2040909@citrix.com> <20150914124008.GA17195@cbox> <1442244024.3549.300.camel@citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============8619645868240721185==" Return-path: In-Reply-To: <1442244024.3549.300.camel@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Campbell Cc: xen-devel@lists.xen.org, Wei Liu , David Vrabel , Stefano Stabellini List-Id: xen-devel@lists.xenproject.org --===============8619645868240721185== Content-Type: multipart/alternative; boundary=001a1149021ef6bff30520d4e3eb --001a1149021ef6bff30520d4e3eb Content-Type: text/plain; charset=UTF-8 On Mon, Sep 14, 2015 at 5:20 PM, Ian Campbell wrote: > On Mon, 2015-09-14 at 14:40 +0200, Christoffer Dall wrote: > > On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote: > > > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel < > david.vrabel@citrix.com > > > > > > > wrote: > > > > > > > On 31/07/15 11:24, Stefano Stabellini wrote: > > > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5 > > > > > -2450), > > > > > CC'ing relevant people. As you can see from the links below the > > > > > crash > > > > > is: > > > > > > > > > > [ 253.619326] Call Trace: > > > > > [ 253.619330] > > > > > [ 253.619332] [] ? skb_copy_ubufs+0xa5/0x230 > > > > > [ 253.619347] [] > > > > > __netif_receive_skb_core+0x6f5/0x940 > > > > > [ 253.619353] [] __netif_receive_skb+0x18/0x60 > > > > > [ 253.619360] [] > > > > > netif_receive_skb_internal+0x28/0x90 > > > > > [ 253.619366] [] napi_gro_frags+0x125/0x1a0 > > > > > [ 253.619378] [] > > > > > mlx4_en_process_rx_cq+0x753/0xb50 > > > > [mlx4_en] > > > > > [ 253.619387] [] mlx4_en_poll_rx_cq+0x97/0x160 > > > > [mlx4_en] > > > > > > > > What makes you think this is Xen specific? I suggest raising this > > > > the > > > > the mlx4 maintainers. > > > > > > > > > > > Linux native and KVM guests (same hw, same kernel version+config) run > > > just > > > fine under the same workload. > > > > > Ping? > > > > From the fact that bare-metal and KVM works fine with this hardware I > > still think it's reasonable to assume that it's a Xen issue and not a > > mlx4 issue. > > > > Is this completely flawed? > > My (somewhat educated) guess is that this is to do with the difference > between (pseudo-)physical addresses and machine (AKA real-physical) > addresses when running under Xen. > > The way this often shows up is in drivers which do not make correct use of > the kernels DMA APIs but which happen to work on native x86 because > physical==bus address on x86. > > Sometimes booting natively with 'iommu=soft swiotlb=force' can expose these > sorts of issues. > Indeed it does, on both v4.0 and v4.3-rc2. > > You are running 64-bit so I don't think the recent "config: Enable > NEED_DMA_MAP_STATE by default when SWIOTLB is selected" is likely to be > relevant (it's already unconditionally on for 64-bit). > > The trace appears to be on rx from a physical nic, there shouldn't be any > magic Xen stuff (granted pages etc) getting themselves into that path at > all. If it were tx then maybe it might be an issue with foreign pages. In > any case I think you are able to repro with just dom0, i.e. never having > started a domU, is that right? > > Yes, I can reproduce on Dom0. I will send this to the Mellanox people. Thanks, -Christoffer --001a1149021ef6bff30520d4e3eb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Mon, Sep 14, 2015 at 5:20 PM, Ian Campbell <ian.campbell@citr= ix.com> wrote:
On Mon, 2015-09-14 at 14:40 +0200, Christoffe= r Dall wrote:
> On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote:
> > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel <david.vrabel@citrix.com
> > >
> > wrote:
> >
> > > On 31/07/15 11:24, Stefano Stabellini wrote:
> > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320,= Xeon E5
> > > > -2450),
> > > > CC'ing relevant people. As you can see from the lin= ks below the
> > > > crash
> > > > is:
> > > >
> > > > [ 253.619326] Call Trace:
> > > > [ 253.619330] <IRQ>
> > > > [ 253.619332] [<ffffffff815d7c25>] ? skb_copy_ubu= fs+0xa5/0x230
> > > > [ 253.619347] [<ffffffff815e8525>]
> > > > __netif_receive_skb_core+0x6f5/0x940
> > > > [ 253.619353] [<ffffffff815e8788>] __netif_receiv= e_skb+0x18/0x60
> > > > [ 253.619360] [<ffffffff815e87f8>]
> > > > netif_receive_skb_internal+0x28/0x90
> > > > [ 253.619366] [<ffffffff815e91f5>] napi_gro_frags= +0x125/0x1a0
> > > > [ 253.619378] [<ffffffffa01b1173>]
> > > > mlx4_en_process_rx_cq+0x753/0xb50
> > > [mlx4_en]
> > > > [ 253.619387] [<ffffffffa01b1657>] mlx4_en_poll_r= x_cq+0x97/0x160
> > > [mlx4_en]
> > >
> > > What makes you think this is Xen specific?=C2=A0 I suggest r= aising this
> > > the
> > > the mlx4 maintainers.
> > >
> > >
> > Linux native and KVM guests (same hw, same kernel version+config)= run
> > just
> > fine under the same workload.
> >
> Ping?
>
> From the fact that bare-metal and KVM works fine with this hardware I<= br> > still think it's reasonable to assume that it's a Xen issue an= d not a
> mlx4 issue.
>
> Is this completely flawed?

My (somewhat educated) guess is that this is to do with the dif= ference
between (pseudo-)physical addresses and machine (AKA real-physical)
addresses when running under Xen.

The way this often shows up is in drivers which do not make correct use of<= br> the kernels DMA APIs but which happen to work on native x86 because
physical=3D=3Dbus address on x86.

Sometimes booting natively with 'iommu=3Dsoft swiotlb=3Dforce' can = expose these
sorts of issues.

Indeed it does, on bot= h v4.0 and v4.3-rc2.
=C2=A0

You are running 64-bit so I don't think the recent "config: Enable=
NEED_DMA_MAP_STATE by default when SWIOTLB is selected" is likely to b= e
relevant (it's already unconditionally on for 64-bit).

The trace appears to be on rx from a physical nic, there shouldn't be a= ny
magic Xen stuff (granted pages etc) getting themselves into that path at all. If it were tx then maybe it might be an issue with foreign pages. In any case I think you are able to repro with just dom0, i.e. never having started a domU, is that right?


Yes, I can reprodu= ce on Dom0.

I will send this to the Mellanox peopl= e.

Thanks,
-Christoffer=C2=A0

--001a1149021ef6bff30520d4e3eb-- --===============8619645868240721185== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============8619645868240721185==-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: Dom0 crash with apache bench (ab) Date: Wed, 30 Sep 2015 11:12:13 -0400 Message-ID: <20150930151213.GC30549@localhost.localdomain> References: <20150728145022.GE26623@x230.dumpdata.com> <1438095319.11600.165.camel@citrix.com> <55BB4DBA.2040909@citrix.com> <20150914124008.GA17195@cbox> <1442244024.3549.300.camel@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Christoffer Dall Cc: David Vrabel , Wei Liu , Ian Campbell , Stefano Stabellini , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On Mon, Sep 28, 2015 at 10:53:33PM +0200, Christoffer Dall wrote: > On Mon, Sep 14, 2015 at 5:20 PM, Ian Campbell > wrote: > > > On Mon, 2015-09-14 at 14:40 +0200, Christoffer Dall wrote: > > > On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote: > > > > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel < > > david.vrabel@citrix.com > > > > > > > > > wrote: > > > > > > > > > On 31/07/15 11:24, Stefano Stabellini wrote: > > > > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5 > > > > > > -2450), > > > > > > CC'ing relevant people. As you can see from the links below the > > > > > > crash > > > > > > is: > > > > > > > > > > > > [ 253.619326] Call Trace: > > > > > > [ 253.619330] > > > > > > [ 253.619332] [] ? skb_copy_ubufs+0xa5/0x230 > > > > > > [ 253.619347] [] > > > > > > __netif_receive_skb_core+0x6f5/0x940 > > > > > > [ 253.619353] [] __netif_receive_skb+0x18/0x60 > > > > > > [ 253.619360] [] > > > > > > netif_receive_skb_internal+0x28/0x90 > > > > > > [ 253.619366] [] napi_gro_frags+0x125/0x1a0 > > > > > > [ 253.619378] [] > > > > > > mlx4_en_process_rx_cq+0x753/0xb50 > > > > > [mlx4_en] > > > > > > [ 253.619387] [] mlx4_en_poll_rx_cq+0x97/0x160 > > > > > [mlx4_en] > > > > > > > > > > What makes you think this is Xen specific? I suggest raising this > > > > > the > > > > > the mlx4 maintainers. > > > > > > > > > > > > > > Linux native and KVM guests (same hw, same kernel version+config) run > > > > just > > > > fine under the same workload. > > > > > > > Ping? > > > > > > From the fact that bare-metal and KVM works fine with this hardware I > > > still think it's reasonable to assume that it's a Xen issue and not a > > > mlx4 issue. > > > > > > Is this completely flawed? > > > > My (somewhat educated) guess is that this is to do with the difference > > between (pseudo-)physical addresses and machine (AKA real-physical) > > addresses when running under Xen. > > > > The way this often shows up is in drivers which do not make correct use of > > the kernels DMA APIs but which happen to work on native x86 because > > physical==bus address on x86. > > > > Sometimes booting natively with 'iommu=soft swiotlb=force' can expose these > > sorts of issues. > > > > Indeed it does, on both v4.0 and v4.3-rc2. Yeeey! > > > > > > You are running 64-bit so I don't think the recent "config: Enable > > NEED_DMA_MAP_STATE by default when SWIOTLB is selected" is likely to be > > relevant (it's already unconditionally on for 64-bit). > > > > The trace appears to be on rx from a physical nic, there shouldn't be any > > magic Xen stuff (granted pages etc) getting themselves into that path at > > all. If it were tx then maybe it might be an issue with foreign pages. In > > any case I think you are able to repro with just dom0, i.e. never having > > started a domU, is that right? > > > > > Yes, I can reproduce on Dom0. > > I will send this to the Mellanox people. Thank you :-) Thought please do keep us (or at least me) CC, this is an interesting bug. > > Thanks, > -Christoffer > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel