From: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Hannes Reinecke <hare-l3A5Bk7waGM@public.gmane.org>
Cc: fcoe-devel-s9riP+hp16TNLxjTenLetw@public.gmane.org,
"Curtis Taylor (cjt-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org)"
<cjt-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>,
Bud Brown <bubrown-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
Linux SCSI Mailinglist
<linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: Issue with fc_exch_alloc failing initiated by fc_queuecommand on NUMA or large configurations with Intel ixgbe running FCOE
Date: Sat, 8 Oct 2016 15:44:16 -0400 (EDT) [thread overview]
Message-ID: <1271455655.818631.1475955856691.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <1360350390.815966.1475949181371.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
----- Original Message -----
> From: "Laurence Oberman" <loberman@redhat.com>
> To: "Hannes Reinecke" <hare@suse.de>
> Cc: "Linux SCSI Mailinglist" <linux-scsi@vger.kernel.org>, fcoe-devel@open-fcoe.org, "Curtis Taylor (cjt@us.ibm.com)"
> <cjt@us.ibm.com>, "Bud Brown" <bubrown@redhat.com>
> Sent: Saturday, October 8, 2016 1:53:01 PM
> Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by fc_queuecommand on NUMA or large
> configurations with Intel ixgbe running FCOE
>
>
>
> ----- Original Message -----
> > From: "Hannes Reinecke" <hare@suse.de>
> > To: "Laurence Oberman" <loberman@redhat.com>, "Linux SCSI Mailinglist"
> > <linux-scsi@vger.kernel.org>,
> > fcoe-devel@open-fcoe.org
> > Cc: "Curtis Taylor (cjt@us.ibm.com)" <cjt@us.ibm.com>, "Bud Brown"
> > <bubrown@redhat.com>
> > Sent: Saturday, October 8, 2016 1:35:19 PM
> > Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by
> > fc_queuecommand on NUMA or large
> > configurations with Intel ixgbe running FCOE
> >
> > On 10/08/2016 02:57 PM, Laurence Oberman wrote:
> > > Hello
> > >
> > > This has been a tough problem to chase down but was finally reproduced.
> > > This issue is apparent on RHEL kernels and upstream so justified
> > > reporting
> > > here.
> > >
> > > Its out there and some may not be aware its even happening other than
> > > very
> > > slow
> > > performance using ixgbe and software FCOE on large configurations.
> > >
> > > Upstream Kernel used for reproducing is 4.8.0
> > >
> > > I/O performance was noted to be very impacted on a large NUMA test system
> > > (64 CPUS 4 NUMA nodes) running the software fcoe stack with Intel ixgbe
> > > interfaces.
> > > After capturing blktraces we saw for every I/O there was at least one
> > > blk_requeue_request and sometimes hundreds or more.
> > > This resulted in IOPS rates being marginal at best with queuing and high
> > > wait times.
> > > After narrowing this down with systemtap and trace-cmd we added further
> > > debug and it was apparent this was dues to SCSI_MLQUEUE_HOST_BUSY being
> > > returned.
> > > So I/O passes but very slowly as it constantly having to be requeued.
> > >
> > > The identical configuration in our lab with a single NUMA node and 4 CPUS
> > > does not see this issue at all.
> > > The same large system that reproduces this was booted with numa=off and
> > > still sees the issue.
> > >
> > Have you tested with my FCoE fixes?
> > I've done quite some fixes for libfc/fcoe, and it would be nice to see
> > how the patches behave with this setup.
> >
> > > The flow is as follows:
> > >
> > > From with fc_queuecommand
> > > fc_fcp_pkt_send() calls fc_fcp_cmd_send() calls
> > > tt.exch_seq_send() which calls fc_exch_seq_send
> > >
> > > this fails and returns NULL in fc_exch_alloc() as the list traveral never
> > > creates a match.
> > >
> > > static struct fc_seq *fc_exch_seq_send(struct fc_lport *lport,
> > > struct fc_frame *fp,
> > > void (*resp)(struct fc_seq *,
> > > struct fc_frame *fp,
> > > void *arg),
> > > void (*destructor)(struct fc_seq *,
> > > void *),
> > > void *arg, u32 timer_msec)
> > > {
> > > struct fc_exch *ep;
> > > struct fc_seq *sp = NULL;
> > > struct fc_frame_header *fh;
> > > struct fc_fcp_pkt *fsp = NULL;
> > > int rc = 1;
> > >
> > > ep = fc_exch_alloc(lport, fp); ***** Called Here and fails
> > > if (!ep) {
> > > fc_frame_free(fp);
> > > printk("RHDEBUG: In fc_exch_seq_send returned NULL because !ep with ep
> > > =
> > > %p\n",ep);
> > > return NULL;
> > > }
> > > ..
> > > ..
> > > ]
> > >
> > >
> > > fc_exch_alloc() - Allocate an exchange from an EM on a
> > > * /**
> > > * local port's list of EMs.
> > > * @lport: The local port that will own the exchange
> > > * @fp: The FC frame that the exchange will be for
> > > *
> > > * This function walks the list of exchange manager(EM)
> > > * anchors to select an EM for a new exchange allocation. The
> > > * EM is selected when a NULL match function pointer is encountered
> > > * or when a call to a match function returns true.
> > > */
> > > static inline struct fc_exch *fc_exch_alloc(struct fc_lport *lport,
> > > struct fc_frame *fp)
> > > {
> > > struct fc_exch_mgr_anchor *ema;
> > >
> > > list_for_each_entry(ema, &lport->ema_list, ema_list)
> > > if (!ema->match || ema->match(fp))
> > > return fc_exch_em_alloc(lport, ema->mp);
> > > return NULL; ***** Never matches so
> > > returns NULL
> > > }
> > >
> > >
> > > RHDEBUG: In fc_exch_seq_send returned NULL because !ep with ep = (null)
> > > RHDEBUG: rc -1 with !seq = (null) after calling tt.exch_seq_send within
> > > fc_fcp_cmd_send
> > > RHDEBUG: rc non zero in :unlock within fc_fcp_cmd_send = -1
> > > RHDEBUG: In fc_fcp_pkt_send, we returned from rc =
> > > lport->tt.fcp_cmd_send
> > > with rc = -1
> > >
> > > RHDEBUG: We hit SCSI_MLQUEUE_HOST_BUSY in fc_queuecommand with rval in
> > > fc_fcp_pkt_send=-1
> > >
> > > I am trying to get my head around why a large multi-node system sees this
> > > issue even with NUMA disabled.
> > > Has anybody seen this or is aware of this with configurations (using
> > > fc_queuecommand)
> > >
> > > I am continuing to add debug to narrow this down.
> > >
> > You might actually be hitting a limitation in the exchange manager code.
> > The libfc exchange manager tries to be really clever and will assign a
> > per-cpu exchange manager (probably to increase locality). However, we
> > only have a limited number of exchanges, so on large systems we might
> > actually run into a exchange starvation problem, where we have in theory
> > enough free exchanges, but none for the submitting cpu.
> >
> > (Personally, the exchange manager code is in urgent need of reworking.
> > It should be replaced by the sbitmap code from Omar).
> >
> > Do check how many free exchanges are actually present for the stalling
> > CPU; it might be that you run into a starvation issue.
> >
> > Cheers,
> >
> > Hannes
> > --
> > Dr. Hannes Reinecke zSeries & Storage
> > hare@suse.de +49 911 74053 688
> > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
> > GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
> >
> Hi Hannes,
> Thanks for responding
>
> I am adding additional debug as I type this.
>
> I am using latest linux-next, I assume your latest FCOE are not in there yet.
> What is puzzling here is a identical kernel with 1 numa node and only 8GB
> memory does not see this.
> Surely if I was running out of exchanges that would show up on the smaller
> system as well.
> I am able to get to 1500 IOPS/sec with the same I/O exerciser on the smaller
> system with ZERO blk_requeue_request() calls.
> Again, same kernel, same ixgbe same FCOE switch etc.
>
> I traced specifically those initially because we saw it in the blktrace.
>
> I dont understand the match stuff going on in the list reversal stuff here
> very well, still trying to understand the code flow.
> the bool match() I also cannot figure out from the code.
> It runs the fc_exch_em_alloc() if ether bool *match is false or *match(fp)
> I can't find the actual code for the match, will have to get a vmcore to find
> it.
>
> static inline struct fc_exch *fc_exch_alloc(struct fc_lport *lport,
> struct fc_frame *fp)
> {
> struct fc_exch_mgr_anchor *ema;
>
> list_for_each_entry(ema, &lport->ema_list, ema_list)
> if (!ema->match || ema->match(fp))
> return fc_exch_em_alloc(lport, ema->mp);
> return NULL; ***** Never matches so
> returns NULL
> }
>
> Will replay after some finer debug has been added
>
> Again specific to fc_queuecommand and S/W FCOE, not an issue with the F/C
> queuecommand in the HBA templates fro example lpfc or qla2xxx and alos not
> applicable to full offload FCOE like the Emulex cards.
> Thanks
> Laurence
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Hi Hannes
Replying to my own prior message.
Added the additional debug
RHDEBUG: in fc_exch_em_alloc: returning NULL in err: path jumped from allocate new exch from pool because index == pool->next_index
RHDEBUG: In fc_exch_seq_send returned NULL because !ep with ep = (null)
RHDEBUG: rc -1 with !seq = (null) after calling tt.exch_seq_send within fc_fcp_cmd_send
RHDEBUG: rc non zero in :unlock within fc_fcp_cmd_send = -1
RHDEBUG: In fc_fcp_pkt_send, we returned from rc = lport->tt.fcp_cmd_send with rc = -1
RHDEBUG: We hit SCSI_MLQUEUE_HOST_BUSY in fc_queuecommand with rval in fc_fcp_pkt_send=-1
So we are actually failing in fc_exch_em_alloc with index == pool->next_index, not in the list traversal as I originally thought.
This seems to then match what your said, we are running out of exchanges.
During my testing, if I start multiple dd's and ramp them up, for example 100 parallel dd's with 64 CPUS I see this.
On the 4 CPU system it does not happen.
/**
* fc_exch_em_alloc() - Allocate an exchange from a specified EM.
* @lport: The local port that the exchange is for
* @mp: The exchange manager that will allocate the exchange
*
* Returns pointer to allocated fc_exch with exch lock held.
*/
static struct fc_exch *fc_exch_em_alloc(struct fc_lport *lport,
struct fc_exch_mgr *mp)
{
struct fc_exch *ep;
unsigned int cpu;
u16 index;
struct fc_exch_pool *pool;
/* allocate memory for exchange */
ep = mempool_alloc(mp->ep_pool, GFP_ATOMIC);
if (!ep) {
atomic_inc(&mp->stats.no_free_exch);
goto out;
}
memset(ep, 0, sizeof(*ep));
cpu = get_cpu();
pool = per_cpu_ptr(mp->pool, cpu);
spin_lock_bh(&pool->lock);
put_cpu();
/* peek cache of free slot */
if (pool->left != FC_XID_UNKNOWN) {
index = pool->left;
pool->left = FC_XID_UNKNOWN;
goto hit;
}
if (pool->right != FC_XID_UNKNOWN) {
index = pool->right;
pool->right = FC_XID_UNKNOWN;
goto hit;
}
index = pool->next_index;
/* allocate new exch from pool */
while (fc_exch_ptr_get(pool, index)) {
index = index == mp->pool_max_index ? 0 : index + 1;
if (index == pool->next_index)
goto err;
I will apply your latest FCOE patches, can you provide a link to your tree.
Thanks
Laurence
_______________________________________________
fcoe-devel mailing list
fcoe-devel@open-fcoe.org
http://lists.open-fcoe.org/mailman/listinfo/fcoe-devel
next prev parent reply other threads:[~2016-10-08 19:44 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1812349047.803888.1475929839972.JavaMail.zimbra@redhat.com>
[not found] ` <1812349047.803888.1475929839972.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-10-08 12:57 ` Issue with fc_exch_alloc failing initiated by fc_queuecommand on NUMA or large configurations with Intel ixgbe running FCOE Laurence Oberman
[not found] ` <209207528.804499.1475931430678.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-10-08 17:35 ` Hannes Reinecke
2016-10-08 17:53 ` [Open-FCoE] " Laurence Oberman
[not found] ` <1360350390.815966.1475949181371.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-10-08 19:44 ` Laurence Oberman [this message]
[not found] ` <1271455655.818631.1475955856691.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-10-09 15:52 ` Laurence Oberman
[not found] ` <141863610.848432.1476028364025.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-10-11 13:39 ` Laurence Oberman
2016-10-11 14:51 ` [Open-FCoE] " Ewan D. Milne
2016-10-12 15:26 ` Ewan D. Milne
2016-10-12 15:46 ` Hannes Reinecke
2016-10-13 1:20 ` Laurence Oberman
[not found] ` <1564904000.1465519.1476321625269.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-10-13 12:43 ` Patch: Revert commit 3e22760d4db6fd89e0be46c3d132390a251da9c6 due to performance issues Laurence Oberman
[not found] ` <2049046384.1533310.1476362610308.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-10-13 12:51 ` Hannes Reinecke
2016-10-13 12:55 ` Johannes Thumshirn
2016-10-14 20:39 ` Patch: [Open-FCoE] " Martin K. Petersen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1271455655.818631.1475955856691.JavaMail.zimbra@redhat.com \
--to=loberman-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
--cc=bubrown-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=cjt-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org \
--cc=fcoe-devel-s9riP+hp16TNLxjTenLetw@public.gmane.org \
--cc=hare-l3A5Bk7waGM@public.gmane.org \
--cc=linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).