From: "lizhijian@fujitsu.com" <lizhijian@fujitsu.com>
To: Bob Pearson <rpearsonhpe@gmail.com>,
Jason Gunthorpe <jgg@nvidia.com>,
Zhu Yanjun <zyjzyj2000@gmail.com>,
"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
"Hack, Jenny (Ft. Collins)" <jhack@hpe.com>,
Frank Zago <frank.zago@hpe.com>
Subject: Re: [RFC] Alternative design for fast register physical memory
Date: Thu, 26 May 2022 06:01:34 +0000 [thread overview]
Message-ID: <85517938-db53-3044-0993-812510676718@fujitsu.com> (raw)
In-Reply-To: <78918262-6060-546b-134d-2d29bbefb349@gmail.com>
On 25/05/2022 06:28, Bob Pearson wrote:
> Jason,
>
> There has been a lot of chatter on the FMRs in rxe recently and I am also trying to help out at home with
> the folks who are trying to run Lustre on rxe. The last fix for this was to split the state in an FMR into
> two with separate rkeys and memory maps so that apps can pipeline the preparation of IO and doing IO.
>
> However, I am convinced that the current design only works by accident when it works. The thing that really
> makes a hash of it is retries. Unfortunately the documentation on all this is almost non existent. Lustre
> (actually o2iblnd) makes heavy use of FMRs and typically has several different MRs in flight in the send queue
> with a mixture of local and remote writes accessing these MRs interleaved with REG_MRs and INVALIDATE_MR local
> work requests. When a packet gets dropped from a WQE deep in the send queue the result is nothing works at all.
>
> We have a work around by fencing all the local operations which more or less works but will have bad performance.
> The maps used in FMRs have fairly short lifetimes but definitely longer than we we can support today. I am
> trying to work out the semantics of everything.
>
> IBA view of FMRs:
>
> verb: ib_alloc_mr(pd, max_num_sg) - creates empty MR object
> roughly Allocate L_Key
>
> verb: ib_dereg_mr(mr) - destroys MR object
>
> verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset) - builds a map for MR
> roughly (Re)Register Physical Memory Region
>
> verb: ib_update_fast_reg_key(mr, newkey) - update key portion of l/rkey
>
> send wr: IB_WR_REG_MR(mr, key) - moves MR from FREE to VALID and updates
> roughly Fast Register Physical Memory Region key portion of l/rkey to key
>
> send_wr: IB_WR_LOCAL_INV(invalidate_rkey) - invalidate local MR moves MR to FREE
>
> send_wr: IB_SEND_WITH_INV(invalidate_rkey) - invalidate remote MR moves MR to FREE
>
>
> To make this all recoverable in the face of errors let there be more than one map present for an
> FMR indexed by the key portion of the l/rkeys.
>
> Alternative view of FMRs:
>
> verb: ib_alloc_mr(pd, max_num_sg) - create an empty MR object with no maps
> with l/rkey = [index, key] with index
> fixed and key some initial value.
>
> verb: ib_update_fast_reg_key(mr, newkey) - update key portion of l/rkey
>
> verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset) - create a new map from allocated memory
> or by re-using an INVALID map. Maps are
> all the same size (max_num_sg). The
> key (index) of this map is the current
> key from l/rkey. The initial state of
> the map is FREE. (and thus not usable
> until a REG_MR work request is used.)
>
> verb: ib_dereg_mr(mr) - free all maps and the MR.
>
> send_wr: IB_WR_REG_MR(mr, key) - Find mr->map[key] and change its state
> to VALID. Associate QP with map since
> it will be hard to manage multiple QPs
> trying to use the same MR at the same
> time with differing states. Fail if the
> map is not FREE. A map with that key must
> have been created by ib_map_mr_sg with
> the same key previously. Check the current
> number of VALID maps and if this exceeds
> a limit pause the send queue until there
> is room to reg another MR.
>
> send_wr: IB_WR_LOCAL_INV (execute) - Lookup a map with the same index and key
> Change its state to FREE and dissociate
> from QP. Leave map contents the same.
> (complete) - When the local invalidate operation is
> completed (after all previous send queue WQEs
> have completed) change its state to INVALID
> and place map resources on a free list or
> free memory.
>
> send_wr: IB_SEND_WITH_INV - same except at remote end.
>
> retry: - if a retry occurs for a send queue. Back up
> the requester to the first incomplete PSN.
> Change the state of all maps which were
> VALID at that PSN back to VALID. This will
> require maintaining a list of valid maps
> at the boundary of completed and un-completed
> WQEs.
>
> Arrival of RDMA packet Lookup MR from index and map from key and if
> the state is VALID carry out the memory copy.
>
>
> This is an improvement over the current state. At the moment we have only two maps one for making new
> ones and one for doing IO. There is no room to back up but at the moment the retry logic assumes that
> you can which is false. This can be fixed easily by forcing all local operations to be fenced
> which is what we are doing at the moment at HPE. This can insert long delays between every new FMR instance.
> By allowing three maps and then fencing we can back up one broken IO operation without too much of a delay.
Hi Bob
I thought i have almost understood all your approach expect the *retry/back up* part in where i have not have a full imagination.
It sounds good to me.
But i think the *retry* should be a new feature to our existing bug reports about FMRs where they all were trying to fix.
https://lore.kernel.org/all/20220210073655.42281-1-guoqing.jiang@linux.dev/T/
https://lore.kernel.org/all/dfba7eb7-8467-59b5-2c2a-071ed1e4949f@gmail.com/T/
https://lore.kernel.org/lkml/94a5ea93-b8bb-3a01-9497-e2021f29598a@linux.dev/t/
I'm convinced that this approach can help on this bug, shall we focus on fixing the above known FMRs bug first, and then improve the *retry* feature.
Thanks
Zhijian
>
> Even if you have a clean network the current design of the retransmit timer which is never cleared and which
> can fire frequently can make a mess of MB sized IOs used for storage.
>
> Bob
next prev parent reply other threads:[~2022-05-26 6:01 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-24 22:28 [RFC] Alternative design for fast register physical memory Bob Pearson
2022-05-26 6:01 ` lizhijian [this message]
2022-05-27 12:42 ` Jason Gunthorpe
2022-06-08 11:23 ` lizhijian
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=85517938-db53-3044-0993-812510676718@fujitsu.com \
--to=lizhijian@fujitsu.com \
--cc=frank.zago@hpe.com \
--cc=jgg@nvidia.com \
--cc=jhack@hpe.com \
--cc=linux-rdma@vger.kernel.org \
--cc=rpearsonhpe@gmail.com \
--cc=zyjzyj2000@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox