From: Zhu Yanjun <yanjun.zhu@linux.dev>
To: "Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com>,
Yi Zhang <yi.zhang@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>,
"Daisuke Matsuda (Fujitsu)" <matsuda-daisuke@fujitsu.com>,
Zhu Yanjun <yanjun.zhu@intel.com>,
"leon@kernel.org" <leon@kernel.org>,
"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
"zyjzyj2000@gmail.com" <zyjzyj2000@gmail.com>,
Bart Van Assche <bvanassche@acm.org>
Subject: Re: [PATCH 1/1] RDMA/rxe: Fix blktests srp lead kernel panic with 64k page size
Date: Tue, 31 Oct 2023 09:36:46 +0800 [thread overview]
Message-ID: <4a19ae90-cc0f-4ace-862c-1251b6d98c3b@linux.dev> (raw)
In-Reply-To: <8f705223-6fde-4b29-880b-570349f40db8@fujitsu.com>
在 2023/10/26 17:05, Zhijian Li (Fujitsu) 写道:
> The root cause is that
>
> rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
> So the offset will get lost.
It is interesting root cause.
page_size is 0x1000, PAGE_SIZE 0x10000.
Zhu Yanjun
>
> For example,
> store process:
> page_size = 0x1000;
> PAGE_SIZE = 0x10000;
> va0 = 0xffff000020651000;
> page_offset = 0 = va & (page_size - 1);
> page = va_to_page(va);
> xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
>
> load_process:
> page = xa_load(&mr->page_list, index);
> page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
> va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
>
> Obviously, *va0 != va1*, page_offset get lost.
>
>
> How to fix:
> - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
> - don't allow ulp registering mr.page_size != PAGE_SIZE ?
>
>
>
> Thanks
> Zhijian
>
>
> On 24/10/2023 17:13, Li Zhijian wrote:
>>
>>
>> On 24/10/2023 16:15, Zhijian Li (Fujitsu) wrote:
>>>
>>>
>>> On 23/10/2023 18:45, Yi Zhang wrote:
>>>> On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu)
>>>> <lizhijian@fujitsu.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 20/10/2023 22:01, Jason Gunthorpe wrote:
>>>>>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote:
>>>>>>> CC Bart
>>>>>>>
>>>>>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
>>>>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>>>>>>>>> From: Zhu Yanjun<yanjun.zhu@linux.dev>
>>>>>>>>>
>>>>>>>>> The page_size of mr is set in infiniband core originally. In the commit
>>>>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>>>>>>>>> page_size is also set. Sometime this will cause conflict.
>>>>>>>> I appreciate your prompt action, but I do not think this commit deals with
>>>>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong
>>>>>>>> with assigning actual page size to ibmr.page_size?
>>>>>>>>
>>>>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used
>>>>>>>> in ulp/srp layer to calculate the page_size.
>>>>>>>> =====
>>>>>>>> static int srp_add_one(struct ib_device *device)
>>>>>>>> {
>>>>>>>> struct srp_device *srp_dev;
>>>>>>>> struct ib_device_attr *attr = &device->attrs;
>>>>>>>> <...>
>>>>>>>> /*
>>>>>>>> * Use the smallest page size supported by the HCA, down to a
>>>>>>>> * minimum of 4096 bytes. We're unlikely to build large sglists
>>>>>>>> * out of smaller entries.
>>>>>>>> */
>>>>>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1);
>>>>>>>
>>>>>>>
>>>>>>> You light me up.
>>>>>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size
>>>>>>
>>>>>> That doesn't seem right even in concept.>
>>>>>> I think the multi-size support in the new xarray code does not work
>>>>>> right, just looking at it makes me think it does not work right. It
>>>>>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will
>>>>>> explode because kmap_local_page() does only 4K.
>>>>>>
>>>>>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work?
>>>>>>
>>>>>
>>>>> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT).
>>>>
>>>> Hi Zhijian
>>>>
>>>> Did you try blktests nvme/rdma use_rxe on your environment, it still
>>>> failed on my side.
>>>>
>>>
>>> Thanks for your testing.
>>> I just did that, it also failed in my environment. After a glance debug, I found there are
>>> other places still registered 4K MR. I will dig into it later.
>>
>>
>> nvme intend to register 4K mr, but it should work IMO, at least the RXE have handled it correctly.
>>
>>
>> 1293 static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
>> 1294 struct nvme_rdma_request *req, struct nvme_command *c,
>> 1295 int count)
>> 1296 {
>> 1297 struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>> 1298 int nr;
>> 1299
>> 1300 req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
>> 1301 if (WARN_ON_ONCE(!req->mr))
>> 1302 return -EAGAIN;
>> 1303
>> 1304 /*
>> 1305 * Align the MR to a 4K page size to match the ctrl page size and
>> 1306 * the block virtual boundary.
>> 1307 */
>> 1308 nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL,
>> 1309 SZ_4K);
>>
>>
>> Anyway, i will go ahead. if you have any thought, please let me know.
>>
>>
>>
>>>
>>> Thanks
>>> Zhijian
>>>
>>>
>>>
>>>
>>>> # use_rxe=1 nvme_trtype=rdma ./check nvme/003
>>>> nvme/003 (test if we're sending keep-alives to a discovery controller) [failed]
>>>> runtime 12.179s ... 11.941s
>>>> --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400
>>>> +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23
>>>> 05:52:27.882759168 -0400
>>>> @@ -1,3 +1,3 @@
>>>> Running nvme/003
>>>> -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s)
>>>> +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s)
>>>> Test complete
>>>>
>>>> [ 7033.431910] rdma_rxe: loaded
>>>> [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15
>>>> [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024
>>>> [ 7033.510969] infiniband enP2p1s0v0_rxe: set active
>>>> [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0
>>>> [ 7033.549301] loop0: detected capacity change from 0 to 2097152
>>>> [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>>> [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420)
>>>> [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808
>>>> [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535
>>>> [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770
>>>> [ 7046.317861] rdma_rxe: unloaded
>>>>
>>>>
>>>>>
>>>>>>>> import ctypes
>>>>>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6')
>>>>>>>> hex(65536)
>>>>> '0x10000'
>>>>>>>> libc.ffs(0x10000) - 1
>>>>> 16
>>>>>>>> 1 << 16
>>>>> 65536
>>>>>
>>>>> so
>>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16;
>>>>>
>>>>>
>>>>> So I think Daisuke's patch should work as well.
>>>>>
>>>>> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@OS3PR01MB9865.jpnprd01.prod.outlook.com/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1
>>>>>
>>>>>
>>>>>> Jason
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Yi Zhang
next prev parent reply other threads:[~2023-10-31 1:37 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-13 1:18 [PATCH 1/1] RDMA/rxe: Fix blktests srp lead kernel panic with 64k page size Zhu Yanjun
2023-10-13 12:01 ` Daisuke Matsuda (Fujitsu)
2023-10-13 12:28 ` Zhu Yanjun
2023-10-13 13:01 ` Daisuke Matsuda (Fujitsu)
2023-10-13 13:44 ` Rain River
2023-10-16 6:07 ` Daisuke Matsuda (Fujitsu)
2023-10-18 8:34 ` Zhu Yanjun
2023-10-20 3:47 ` Zhijian Li (Fujitsu)
2023-10-20 6:54 ` Zhijian Li (Fujitsu)
2023-10-20 16:21 ` Bart Van Assche
2023-10-23 0:58 ` Zhijian Li (Fujitsu)
2023-10-20 14:01 ` Jason Gunthorpe
2023-10-23 3:52 ` Zhijian Li (Fujitsu)
2023-10-23 6:08 ` Zhu Yanjun
2023-10-23 10:45 ` Yi Zhang
2023-10-24 8:15 ` Zhijian Li (Fujitsu)
2023-10-24 9:13 ` Zhijian Li (Fujitsu)
2023-10-26 9:05 ` Zhijian Li (Fujitsu)
2023-10-26 11:42 ` Jason Gunthorpe
2023-10-26 12:59 ` Zhu Yanjun
2023-10-26 23:23 ` Jason Gunthorpe
2023-10-27 1:36 ` Zhu Yanjun
2023-10-27 4:01 ` Zhu Yanjun
2023-10-27 11:51 ` Jason Gunthorpe
2023-10-26 13:28 ` Bart Van Assche
2023-10-26 13:43 ` Jason Gunthorpe
2023-10-26 21:47 ` Bart Van Assche
2023-10-27 1:26 ` Daisuke Matsuda (Fujitsu)
2023-10-27 1:39 ` Zhu Yanjun
2023-10-27 5:43 ` Zhijian Li (Fujitsu)
2023-10-31 1:36 ` Zhu Yanjun [this message]
[not found] ` <CAEz=LcuLCe7bhUohh6BcHdJ1_ocJdZq=eu07vWb3Md5_ZOGDBg@mail.gmail.com>
[not found] ` <CAEz=LcuQ6fFpHqBPT1oTUgKABAHFJqYDC-AHidE-+n6OtzmCPQ@mail.gmail.com>
2023-10-31 8:14 ` Greg Sword
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4a19ae90-cc0f-4ace-862c-1251b6d98c3b@linux.dev \
--to=yanjun.zhu@linux.dev \
--cc=bvanassche@acm.org \
--cc=jgg@ziepe.ca \
--cc=leon@kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=lizhijian@fujitsu.com \
--cc=matsuda-daisuke@fujitsu.com \
--cc=yanjun.zhu@intel.com \
--cc=yi.zhang@redhat.com \
--cc=zyjzyj2000@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox