From: Harish Chegondi <harish.chegondi@intel.com>
To: "Dixit, Ashutosh" <ashutosh.dixit@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH v8 4/7] drm/xe/eustall: Return -EIO error from read() if HW drops data
Date: Fri, 31 Jan 2025 14:59:16 -0800 [thread overview]
Message-ID: <Z51VxJa6KjavSAol@intel.com> (raw)
In-Reply-To: <85wmeaenx9.wl-ashutosh.dixit@intel.com>
On Fri, Jan 31, 2025 at 12:19:46PM -0800, Dixit, Ashutosh wrote:
> On Fri, 31 Jan 2025 11:30:03 -0800, Harish Chegondi wrote:
> >
> > On Wed, Jan 29, 2025 at 08:45:59PM -0800, Dixit, Ashutosh wrote:
> > > On Wed, 15 Jan 2025 12:02:10 -0800, Harish Chegondi wrote:
> > > >
> > > > If the user space doesn't read the EU stall data fast enough,
> > > > it is possible that the EU stall data buffer can get filled,
> > > > and if the hardware wants to write more data, it simply drops
> > > > data due to unavailable buffer space. In that case, hardware
> > > > sets a bit in a register. If the driver detects data drop,
> > > > the driver read() returns -EIO error to let the user space
> > > > know that HW has dropped data. The -EIO error is returned
> > > > even if there is EU stall data in the buffer. A subsequent
> > > > read by the user space returns the remaining EU stall data.
> > >
> > > As I mentioned earlier, entire dropped packet handling should be in this
> > > patch, so we can see the entire logic around this. So data_drop struct
> > > should be defined in this patch.
> > I worded the commit message that this commit is about read() returning
> > -EIO when data is dropped. So, I didn't put all the data drop code in
> > this patch. Sure, I can reword the commit message and move the code
> > into this patch.
>
> Yeah, the "unit of functionality" is not returning -EIO, it is "dropped
> data handling", so that whole unit should be in a separate patch. This one,
> I don't want to compromise on.
> > >
> > > >
> > > > Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
> > > > ---
> > > > drivers/gpu/drm/xe/xe_eu_stall.c | 12 ++++++++++++
> > > > drivers/gpu/drm/xe/xe_eu_stall.h | 1 +
> > > > 2 files changed, 13 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_eu_stall.c b/drivers/gpu/drm/xe/xe_eu_stall.c
> > > > index c388d733b857..437782f8433c 100644
> > > > --- a/drivers/gpu/drm/xe/xe_eu_stall.c
> > > > +++ b/drivers/gpu/drm/xe/xe_eu_stall.c
> > > > @@ -472,6 +472,7 @@ xe_eu_stall_stream_read_locked(struct xe_eu_stall_data_stream *stream,
> > > > * before calling read().
> > > > *
> > > > * Returns: The number of bytes copied or a negative error code on failure.
> > > > + * -EIO if HW drops any EU stall data when the buffer is full.
> > > > */
> > > > static ssize_t xe_eu_stall_stream_read(struct file *file, char __user *buf,
> > > > size_t count, loff_t *ppos)
> > > > @@ -485,6 +486,16 @@ static ssize_t xe_eu_stall_stream_read(struct file *file, char __user *buf,
> > > > return -EINVAL;
> > > > }
> > > >
> > > > + if (bitmap_weight(stream->data_drop.mask, XE_MAX_DSS_FUSE_BITS)) {
> > >
> > > Since data_drop.mask is being touched elsewhere under xecore_buf->lock,
> > > here also it should be accessed under the same lock. So this returning -EIO
> > > should probably be moved into xe_eu_stall_stream_read_locked?
> > data_drop.mask is being accessed via set_bit(), clear_bit(), test_bit()
> > and bitmap_weight(). set_bit() and clear_bit() are atomic operations,
> > but test_bit() and bitmap_weight() are not atomic. So, not all the code
> > accessing the mask need to be under lock. The code that is under lock is
> > under the buffer lock, whereas xe_eu_stall_stream_read_locked() is under
> > gt->eu_stall->lock. So, moving this code into xe_eu_stall_stream_read_locked
> > would not make any difference. I think this code can exist outside of a
> > lock. If one read() just misses a data drop, the subsequent read would
> > report the data drop.
>
> OK, let me see what happens with my suggestion below (starting with "I had
> also outlined another...") and then we can see if the locking will be ok or
> not.
>
> > >
> > > > + if (!stream->data_drop.reported_to_user) {
> > > > + stream->data_drop.reported_to_user = true;
> > > > + xe_gt_dbg(gt, "EU stall data dropped in XeCores: %*pb\n",
> > > > + XE_MAX_DSS_FUSE_BITS, stream->data_drop.mask);
> > > > + return -EIO;
> > > > + }
> > > > + stream->data_drop.reported_to_user = false;
> > >
> > > I don't think this logic is correct. We should set this to false only after
> > > we have cleared all set bits (e.g. only after bitmap_weight) otherwise we
> > > might keep returning -EIO multiple times?
> > If the subsequent read() reads all the data from all the subslices, it
> > would clear all the bits. But if the user buffer is small and it doesn't
> > read all the data from all the subslices, some bits can continue to be
> > set and can cause multiple alternate -EIO returns. Ideally, the user
> > buffer should be big enough to accomodate all the data from the kernel
> > buffer.
>
> Afaik we are not exposing a minimum user buffer size in the uapi, but we
> could too.
Starting patch series v6, the per subslice buffer size is being exposed
to the user space via the query IOCTL. Recommended user buffer size is:
number of subslices x per subslice buffer size which is same as the
kernel buffer size.
>
> I had also outlined another simple way of doing this in my follow up to
> this email, which doesn't have such issues. What do you think of that?
I don't think driver dropping data is a good idea. The HW drops data
only when the buffer is full. Even though HW drops some data, a buffer
full of day will be useful to the user space.
>
> > >
> > > If HW continues to drop data and keep setting the line, while we are
> > > resetting the bit, it is possible bitmap_weight might never become 0. I
> > > think that is ok, we have returned -EIO at least once to indicate to
> > > userspace that it is not reading data fast enough and HW is dropping data.
> > >
> > > Or we may return -EIO multiple times as is happening here, where
> > > reported_to_user is set to 0 before all bits might have been cleared. So
> > > what is happening here might be ok too.
> > >
> > > To see this clearly and evaluate it is why I am saying move all of this
> > > data drop handling and -EIO return into this one patch. So we can decide
> > > which approach to take: return -EIO just once or return multiple times.
I will move all the data drop code into this patch.
> > >
> > > We can also maybe defer this patch and merge the other stuff first if it's
> > > a separate patch.
> > >
> > > So maybe this is ok, maybe not, anyway something to think about.
> > >
> > > > + }
> > > > +
> > > > if (!(file->f_flags & O_NONBLOCK)) {
> > > > do {
> > > > if (!stream->pollin) {
> > > > @@ -680,6 +691,7 @@ static int xe_eu_stall_stream_init(struct xe_eu_stall_data_stream *stream,
> > > > if (!stream->xecore_buf)
> > > > return -ENOMEM;
> > > >
> > > > + stream->data_drop.reported_to_user = false;
> > > > bitmap_zero(stream->data_drop.mask, XE_MAX_DSS_FUSE_BITS);
> > >
> > > Stream is kzalloc'd, why do you need to init these?
> > >
> > > >
> > > > xe_pm_runtime_get(gt_to_xe(gt));
> > > > diff --git a/drivers/gpu/drm/xe/xe_eu_stall.h b/drivers/gpu/drm/xe/xe_eu_stall.h
> > > > index f97c8bf8e852..8bc44e9e98af 100644
> > > > --- a/drivers/gpu/drm/xe/xe_eu_stall.h
> > > > +++ b/drivers/gpu/drm/xe/xe_eu_stall.h
> > > > @@ -31,6 +31,7 @@ struct xe_eu_stall_data_stream {
> > > > struct xe_bo *bo;
> > > > struct per_xecore_buf *xecore_buf;
> > > > struct {
> > > > + bool reported_to_user;
> > > > xe_dss_mask_t mask;
> > > > } data_drop;
> > > > struct hrtimer poll_check_timer;
> > > > --
> > > > 2.47.1
> > > >
next prev parent reply other threads:[~2025-01-31 22:59 UTC|newest]
Thread overview: 49+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-15 20:02 [PATCH v8 0/7] Add support for EU stall sampling Harish Chegondi
2025-01-15 20:02 ` [PATCH v8 1/7] drm/xe/topology: Add a function to find the index of the last enabled DSS in a mask Harish Chegondi
2025-01-17 17:25 ` Dixit, Ashutosh
2025-01-22 5:18 ` Harish Chegondi
2025-01-15 20:02 ` [PATCH v8 2/7] drm/xe/uapi: Introduce API for EU stall sampling Harish Chegondi
2025-01-17 19:02 ` Dixit, Ashutosh
2025-01-22 23:44 ` Harish Chegondi
2025-01-23 2:19 ` Dixit, Ashutosh
2025-01-15 20:02 ` [PATCH v8 3/7] drm/xe/eustall: Implement EU stall sampling APIs for Xe_HPC Harish Chegondi
2025-01-18 2:34 ` Dixit, Ashutosh
2025-01-23 18:51 ` Dixit, Ashutosh
2025-01-25 3:09 ` [PATCH v8 3 " Dixit, Ashutosh
2025-01-29 4:12 ` [PATCH v8 4 " Dixit, Ashutosh
2025-01-29 4:32 ` Dixit, Ashutosh
2025-01-30 18:46 ` Harish Chegondi
2025-01-31 3:23 ` Dixit, Ashutosh
2025-01-15 20:02 ` [PATCH v8 4/7] drm/xe/eustall: Return -EIO error from read() if HW drops data Harish Chegondi
2025-01-30 4:45 ` Dixit, Ashutosh
2025-01-30 17:05 ` Dixit, Ashutosh
2025-01-31 21:50 ` Harish Chegondi
2025-01-31 19:30 ` Harish Chegondi
2025-01-31 20:19 ` Dixit, Ashutosh
2025-01-31 22:59 ` Harish Chegondi [this message]
2025-02-01 0:13 ` Dixit, Ashutosh
2025-02-01 6:57 ` Dixit, Ashutosh
2025-01-15 20:02 ` [PATCH v8 5/7] drm/xe/eustall: Add EU stall sampling support for Xe2 Harish Chegondi
2025-01-30 4:55 ` Dixit, Ashutosh
2025-02-05 1:16 ` Olson, Matthew
2025-02-05 1:57 ` Dixit, Ashutosh
2025-02-05 19:03 ` Olson, Matthew
2025-02-05 20:02 ` Dixit, Ashutosh
2025-01-15 20:02 ` [PATCH v8 6/7] drm/xe/uapi: Add a device query to get EU stall sampling information Harish Chegondi
2025-01-16 22:34 ` Dixit, Ashutosh
2025-01-22 2:48 ` Harish Chegondi
2025-01-22 3:00 ` Dixit, Ashutosh
2025-01-30 17:36 ` Dixit, Ashutosh
2025-01-15 20:02 ` [PATCH v8 7/7] drm/xe/eustall: Add workaround 22016596838 which applies to PVC Harish Chegondi
2025-01-30 5:14 ` Dixit, Ashutosh
2025-01-15 20:46 ` ✓ CI.Patch_applied: success for Add support for EU stall sampling Patchwork
2025-01-15 20:46 ` ✗ CI.checkpatch: warning " Patchwork
2025-01-15 20:48 ` ✓ CI.KUnit: success " Patchwork
2025-01-15 21:14 ` ✓ CI.Build: " Patchwork
2025-01-15 21:16 ` ✗ CI.Hooks: failure " Patchwork
2025-01-15 21:18 ` ✓ CI.checksparse: success " Patchwork
2025-01-15 21:43 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-01-16 0:37 ` ✗ Xe.CI.Full: " Patchwork
2025-01-16 0:51 ` [PATCH v8 0/7] " Degrood, Felix J
2025-01-16 21:50 ` Olson, Matthew
2025-01-18 5:19 ` Harish Chegondi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z51VxJa6KjavSAol@intel.com \
--to=harish.chegondi@intel.com \
--cc=ashutosh.dixit@intel.com \
--cc=intel-xe@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).