From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Martin Cerveny <M.Cerveny@computer.org>
Cc: Gordan Bobic <gordan@bobich.net>, xen-devel@lists.xen.org
Subject: Re: Some trouble to use NVIDIA CUDA with Xen
Date: Thu, 15 Aug 2013 09:19:04 -0400 [thread overview]
Message-ID: <20130815131904.GB3545@konrad-lan.dumpdata.com> (raw)
In-Reply-To: <alpine.GSO.2.00.1308142336360.17540@dmz.c-home.cz>
On Thu, Aug 15, 2013 at 12:21:41AM +0200, Martin Cerveny wrote:
> Hello.
>
> Partial SUCCSESS !
>
> On Tue, 13 Aug 2013, Konrad Rzeszutek Wilk wrote:
> >>>>>>NVRM: PAT configuration unsupported.
> >>>Right, so there are couple of patches that can enable that back.
> >>>
> >>>You need to revert these two:
> >>>8eaffa67b43e99ae581622c5133e20b0f48bcef1
> >>>c79c49826270b8b0061b2fca840fc3f013c8a78a
> >>>
> >>>And apply this patch:
> >>>
> >>>https://lkml.org/lkml/2012/2/10/229
> >>>
> >>>That should re-enable PAT. Try that and please report back.
> >>
> >>I applied the patch to 3.9.11-200.PAT.fc18.x86_64 (3.10 is not
> >>working due to incompatibilities with nvidia driver source code).
> >
> >Did you revert the other two git commits?
>
> Yes, double check (combined patch is in the attachment to
> rpmbuild/SOURCE/, rpmbuild patch too).
Looks correct.
>
> # rdmsr 0x277
> 50100070406
>
> I look to nvidia source code.
>
> The error is on nvidia side:
>
> snip from /usr/src/nvidia-319.37/nv-pat.c
> =================
> ....
> #if defined(HAVE_NV_XEN) && defined(CONFIG_XEN) && defined(CONFIG_PARAVIRT)
> if (PAT_WC_index == 4)
> return NV_PAT_MODE_KERNEL;
> #endif
>
> if (PAT_WC_index == 1)
> return NV_PAT_MODE_KERNEL;
> else if (PAT_WC_index != 0xf)
> {
> nv_printf(NV_DBG_ERRORS,
> "NVRM: PAT configuration unsupported.\n");
> return NV_PAT_MODE_DISABLED;
> }
> ....
> ===================
>
> HAVE_NV_XEN is NOT defined.
>
> HAVE_NV_XEN is defined only if "nv-xen.h" is present (tested in
> /usr/src/nvidia-319.37/conftest.h) and it seems to be removed from
> distributed source (~ in nvidia driver 19x.x.x versions).
>
> Ok, i downloaded some older version "nv-xen.h" from net to
Do you know what it contains? Perhaps there are some oddities in there?
> /usr/src/nvidia-319.37/nv-xen.h recompile driver
> ("cd /usr/src/nvidia-319.37; make clean module; rmmod nvidia;
> cp nvidia.ko /lib/modules/3.9.11-200.PAT.fc18.x86_64/extra;
> modprobe nvidia").
>
> Error "NVRM: PAT configuration unsupported." does not shown (as expected).
>
> Most CUDA demoprograms WORKS without error!!!
Nice.
>
> But some programs hung PCIe and kernel:
>
> [55799.433278] BUG: Bad rss-counter state mm:ffff8800723e0000 idx:1 val:21
> [55800.139090] abrt-handle-eve[10175]: segfault at 18 ip 0000003f20ebb6d3 sp 00007fffa7e6ef00 error 4 in libc-2.16.so[3f20e00000+1ad000]
> [55800.375196] BUG: Bad rss-counter state mm:ffff8800723e2680 idx:1 val:5
> [55845.124636] BUG: Bad rss-counter state mm:ffff8800723e0000 idx:1 val:8
> [55962.186275] BUG: Bad rss-counter state mm:ffff880074a27800 idx:0 val:5
> [55962.192811] BUG: Bad rss-counter state mm:ffff880074a27800 idx:1 val:795
> [55962.262019] traps: abrt-handle-eve[10287] general protection ip:3f20ebb7a6 sp:7fffbd613410 error:0 in libc-2.16.so[3f20e00000+1ad000]
> [55962.394789] BUG: Bad rss-counter state mm:ffff8800723e0380 idx:1 val:13
That and those errors above imply that the nvidia driver is not doing
a good job of converting the WC pages back to WB. And when they
go back to the general pool of memory they still have the WC bit
set. Which is really really bad.
I presume there was some code that did the 'mark_WC' and then
'unmark_WC' (or mark_WB) or perhaps set_pages_wb and set_pages_wc.
(The set_pages_wb and set_pages_wb fix is the one pageattr.c file.
You could also add in the code there an printk to make sure that
it is indeed working correctly - or use this little module:
http://xenbits.xen.org/gitweb/?p=xentesttools/bootstrap.git;a=blob;f=root_image/drivers/wb_to_wc/wb_to_wc.c;h=cd2439ac103150229f14f732a9a7a271ca6f397e;hb=HEAD
to double check that it is working correctly).
> [55981.779246] NVRM: GPU at 0000:02:00: GPU-fe328712-3546-53fe-149d-3d78e7aa64d5
> [55981.786391] NVRM: Xid (0000:02:00): 38, 0001 00000000 00000000 00000000 00000000 00000000
> [55982.407300] NVRM: GPU at 0000:02:00.0 has fallen off the bus.
Ha!
> [55982.425810] NVRM: os_pci_init_handle: invalid context!
> ....
next prev parent reply other threads:[~2013-08-15 13:19 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-10 16:21 Some trouble to use NVIDIA CUDA with Xen Martin Cerveny
2013-08-12 12:33 ` Gordan Bobic
2013-08-12 13:00 ` Konrad Rzeszutek Wilk
2013-08-13 19:59 ` Martin Cerveny
2013-08-13 20:20 ` Konrad Rzeszutek Wilk
2013-08-13 20:32 ` Martin Cerveny
2013-08-14 22:21 ` Martin Cerveny
2013-08-15 13:19 ` Konrad Rzeszutek Wilk [this message]
2013-08-15 13:28 ` Martin Cerveny
2013-08-15 14:15 ` Konrad Rzeszutek Wilk
2013-08-27 13:17 ` Martin Cerveny
-- strict thread matches above, loose matches on Subject: below --
2013-04-30 11:39 Sébastien Frémal
2013-04-30 11:57 ` George Dunlap
2013-04-30 12:05 ` Samuel Thibault
2013-04-30 14:13 ` Sébastien Frémal
2013-04-30 14:19 ` Samuel Thibault
2013-04-30 14:32 ` Sébastien Frémal
2013-04-30 14:41 ` Samuel Thibault
2013-05-03 13:54 ` Samuel Thibault
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130815131904.GB3545@konrad-lan.dumpdata.com \
--to=konrad.wilk@oracle.com \
--cc=M.Cerveny@computer.org \
--cc=gordan@bobich.net \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.