All of lore.kernel.org
 help / color / mirror / Atom feed
From: Potnuri Bharat Teja <bharat@chelsio.com>
To: target-devel@vger.kernel.org
Subject: Re: Connection errors with ISER IO
Date: Tue, 10 Mar 2020 12:13:51 +0000	[thread overview]
Message-ID: <20200310120150.GA7669@chelsio.com> (raw)
In-Reply-To: <20200226155241.GA28594@chelsio.com>

On Wednesday, March 03/04/20, 2020 at 23:56:12 +0530, Potnuri Bharat Teja wrote:
> On Friday, February 02/28/20, 2020 at 05:12:32 +0530, Sagi Grimberg wrote:
> > 
> > >>> Hi All,
> > >>> I observe connection errors almost immediately after I start iozone over iser
> > >>> luns. Atached are the connection error and hung task traces on initator and
> > >>> target respecively.
> > >>> Interestingly, I see connection errors only if LUN size is less than 512MB.
> > >>> In my case I could consistently reproduce the issue with 511MB LUN and 300MB
> > >>> lun size. Connections errors are not seen if I create 512MB or greated LUN.
> > >>
> > >> Can you share log output on the target to before hung tasks?
> > > 
> > > Sure, Attached are the target and initiator dmesg logs.
> > >>
> > >>> Further, after the connection errors, I noticed that the poll work queue is
> > >>> stuck and never processes drain CQE resulting in hung tasks on the target side.
> > >>
> > >> Is the drain CQE actually generated?
> > >>
> > > 
> > > Yes it is generated. I was able to track it with prints until queue_work() in
> > > ib_cq_completion_workqueue(). Work Function ib_cq_poll_work() is never getting
> > > scheduled. Therefore, I see drain CQE unpolled and hung task due to
> > > __ib_drain_sq() waiting forever for complete() to be called from drain CQE
> > > done() handler.
> > 
> > Hmm, that is interesting. This tells me that cq->work is probably
> > blocked by another completion invokation (which hangs), which means that
> > queuing the cq->work did not happen as workqueues are not re-entrant.
> > 
> > Looking at the code, nothing should be blocking in the isert ->done()
> > handlers, so its not clear to me how this can happen.
> > 
> > Would it be possible to run:
> > echo t > /proc/sysrq-trigger when this happens? I'd like to see where
> > that cq->work is blocking.
> >
> Attached file t_sysrq-trigger_and_dmesg.txt is the triggered output. Please let 
> me know if that is timed correctly as I triggered it a little after login timeout.
> I'll try getting a better one meanwhile.
> > I'd also enable pr_debug on iscsi_traget.c
> > 
> Attached files are with debug enabled:
> tgt_discovery_and_login_dmesg.txt -> dmesg just after login for reference
> tgt_IO_511MB_8target_1lun_each_iozone_dmesg_untill_hang.txt -> dmesg untill connection error.
> 
> Please let me know if there is anything that I could check.

Hi Sagi,
Got any chance to check this?
Thanks.
> > > 
> > >>> I tried changing the CQ poll workqueue to be UNBOUND but it did not fix the issue.
> > >>>
> > >>> Here is what my test does:
> > >>> Create 8 targets with 511MB lun each, login and format disks to ext3, mount the
> > >>> disks and run iozone over them.
> > >>> #iozone -a -I -+d -g 256m
> > >>
> > >> Does it happen specifically with iozone? or can dd/fio also
> > >> reproduce this issue? on which I/O pattern do you see the issue?
> > >>
> > > I see it with iozone. I am trying with fio, shall soon update.
> > > I see issue with at iosizes around 128k/256k block sizes of iozone. Its not
> > > consistent.
> > >>> I am not sure how LUN size could cause the connection errors. I appreciate any
> > >>> inputs on this.
> > >>
> > >> I imagine that a single LUN is enough to reproduce the issue?
> > >>
> > > 
> > > yes, attached is the target conf.
> > >> btw, I tried reproducing the issue with rxe (couldn't setup an iser
> > >> listener with siw) in 2 VMs on my laptop using lio to a file backend but
> > >> I cannot reproduce the issue..
> > > I see the issue quickly with 40G/25G links. I have not seen the issue on a 100G
> > > link. BTW i a trying iwarp(T6/t5)
> > > 
> > > Thanks for looking into it.
> > > 
> > 
> >  From the log, looks like the hang happens when the initiator tries to
> > login after the failure (trace starts in iscsi_target_do_login). and
> > looks like the target gave up on login timeout, but what is not
> > indicated is why did the initiator got a ping timeout in the
> > first place...

  parent reply	other threads:[~2020-03-10 12:13 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-26 15:54 Connection errors with ISER IO Potnuri Bharat Teja
2020-02-26 18:22 ` Sagi Grimberg
2020-02-27 14:12 ` Potnuri Bharat Teja
2020-02-27 23:42 ` Sagi Grimberg
2020-03-10 12:13 ` Potnuri Bharat Teja [this message]
2020-03-19 17:57 ` Potnuri Bharat Teja
2020-03-19 21:05 ` Sagi Grimberg
2020-03-20  6:15 ` Potnuri Bharat Teja
2020-03-20  6:22 ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200310120150.GA7669@chelsio.com \
    --to=bharat@chelsio.com \
    --cc=target-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.