Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
From: "lizhijian@fujitsu.com" <lizhijian@fujitsu.com>
To: Bob Pearson <rpearsonhpe@gmail.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Zhu Yanjun <zyjzyj2000@gmail.com>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"Hack, Jenny (Ft. Collins)" <jhack@hpe.com>,
	Frank Zago <frank.zago@hpe.com>
Subject: Re: [RFC] Alternative design for fast register physical memory
Date: Thu, 26 May 2022 06:01:34 +0000	[thread overview]
Message-ID: <85517938-db53-3044-0993-812510676718@fujitsu.com> (raw)
In-Reply-To: <78918262-6060-546b-134d-2d29bbefb349@gmail.com>



On 25/05/2022 06:28, Bob Pearson wrote:
> Jason,
>
> There has been a lot of chatter on the FMRs in rxe recently and I am also trying to help out at home with
> the folks who are trying to run Lustre on rxe. The last fix for this was to split the state in an FMR into
> two with separate rkeys and memory maps so that apps can pipeline the preparation of IO and doing IO.
>
> However, I am convinced that the current design only works by accident when it works. The thing that really
> makes a hash of it is retries. Unfortunately the documentation on all this is almost non existent. Lustre
> (actually o2iblnd) makes heavy use of FMRs and typically has several different MRs in flight in the send queue
> with a mixture of local and remote writes accessing these MRs interleaved with REG_MRs and INVALIDATE_MR local
> work requests. When a packet gets dropped from a WQE deep in the send queue the result is nothing works at all.
>
> We have a work around by fencing all the local operations which more or less works but will have bad performance.
> The maps used in FMRs have fairly short lifetimes but definitely longer than we we can support today. I am
> trying to work out the semantics of everything.
>
> IBA view of FMRs:
>
> verb: ib_alloc_mr(pd, max_num_sg)			- creates empty MR object
> 	roughly Allocate L_Key
>
> verb: ib_dereg_mr(mr)					- destroys MR object
>
> verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset)		- builds a map for MR
> 	roughly (Re)Register Physical Memory Region
>
> verb: ib_update_fast_reg_key(mr, newkey)		- update key portion of l/rkey
>
> send wr: IB_WR_REG_MR(mr, key)				- moves MR from FREE to VALID and updates
> 	roughly Fast Register Physical Memory Region	  key portion of l/rkey to key
>
> send_wr: IB_WR_LOCAL_INV(invalidate_rkey)		- invalidate local MR moves MR to FREE
>
> send_wr: IB_SEND_WITH_INV(invalidate_rkey)		- invalidate remote MR moves MR to FREE
>
>
> To make this all recoverable in the face of errors let there be more than one map present for an
> FMR indexed by the key portion of the l/rkeys.
>
> Alternative view of FMRs:
>
> verb: ib_alloc_mr(pd, max_num_sg)			- create an empty MR object with no maps
> 							  with l/rkey = [index, key] with index
> 							  fixed and key some initial value.
>
> verb: ib_update_fast_reg_key(mr, newkey)		- update key portion of l/rkey
>
> verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset)		- create a new map from allocated memory
> 							  or by re-using an INVALID map. Maps are
> 							  all the same size (max_num_sg). The
> 							  key (index) of this map is the current
> 							  key from l/rkey. The initial state of
> 							  the map is FREE. (and thus not usable
> 							  until a REG_MR work request is used.)
>
> verb: ib_dereg_mr(mr)					- free all maps and the MR.
>
> send_wr: IB_WR_REG_MR(mr, key)				- Find mr->map[key] and change its state
> 							  to VALID. Associate QP with map since
> 							  it will be hard to manage multiple QPs
> 							  trying to use the same MR at the same
> 							  time with differing states. Fail if the
> 							  map is not FREE. A map with that key must
> 							  have been created by ib_map_mr_sg with
> 							  the same key previously. Check the current
> 							  number of VALID maps and if this exceeds
> 							  a limit pause the send queue until there
> 							  is room to reg another MR.
>
> send_wr: IB_WR_LOCAL_INV (execute)			- Lookup a map with the same index and key
> 							  Change its state to FREE and dissociate
> 							  from QP. Leave map contents the same.
> 			 (complete)			- When the local invalidate operation is
> 							  completed (after all previous send queue WQEs
> 							  have completed) change its state to INVALID
> 							  and place map resources on a free list or
> 							  free memory.
>
> send_wr: IB_SEND_WITH_INV				- same except at remote end.
>
> retry:							- if a retry occurs for a send queue. Back up
> 							  the requester to the first incomplete PSN.
> 							  Change the state of all maps which were
> 							  VALID at that PSN back to VALID. This will
> 							  require maintaining a list of valid maps
> 							  at the boundary of completed and un-completed
> 							  WQEs.
>
> Arrival of RDMA packet					  Lookup MR from index and map from key and if
> 							  the state is VALID carry out the memory copy.
>
>
> This is an improvement over the current state. At the moment we have only two maps one for making new
> ones and one for doing IO. There is no room to back up but at the moment the retry logic assumes that
> you can which is false. This can be fixed easily by forcing all local operations to be fenced
> which is what we are doing at the moment at HPE. This can insert long delays between every new FMR instance.
> By allowing three maps and then fencing we can back up one broken IO operation without too much of a delay.

Hi Bob

I thought i have almost understood all your approach expect the *retry/back up* part in where i have not have a full imagination.
It sounds good to me.
But i think the *retry* should be a new feature to our existing bug reports about FMRs where they all were trying to fix.
https://lore.kernel.org/all/20220210073655.42281-1-guoqing.jiang@linux.dev/T/
https://lore.kernel.org/all/dfba7eb7-8467-59b5-2c2a-071ed1e4949f@gmail.com/T/
https://lore.kernel.org/lkml/94a5ea93-b8bb-3a01-9497-e2021f29598a@linux.dev/t/

I'm convinced that this approach can help on this bug, shall we focus on fixing the above known FMRs bug first, and then improve the *retry* feature.

Thanks
Zhijian


>
> Even if you have a clean network the current design of the retransmit timer which is never cleared and which
> can fire frequently can make a mess of MB sized IOs used for storage.
>
> Bob

  reply	other threads:[~2022-05-26  6:01 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-24 22:28 [RFC] Alternative design for fast register physical memory Bob Pearson
2022-05-26  6:01 ` lizhijian [this message]
2022-05-27 12:42 ` Jason Gunthorpe
2022-06-08 11:23   ` lizhijian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=85517938-db53-3044-0993-812510676718@fujitsu.com \
    --to=lizhijian@fujitsu.com \
    --cc=frank.zago@hpe.com \
    --cc=jgg@nvidia.com \
    --cc=jhack@hpe.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=rpearsonhpe@gmail.com \
    --cc=zyjzyj2000@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox