Re: [PATCH v15 1/2] drm/tegra: dc: Support memory bandwidth management

From: Dmitry Osipenko <digetx@gmail.com>
To: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Thierry Reding <thierry.reding@gmail.com>,
	Jonathan Hunter <jonathanh@nvidia.com>,
	Matt Merhar <mattmerhar@protonmail.com>,
	Peter Geis <pgwipeout@gmail.com>,
	Nicolas Chauvet <kwizart@gmail.com>,
	linux-tegra@vger.kernel.org, linux-pm@vger.kernel.org,
	linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org
Subject: Re: [PATCH v15 1/2] drm/tegra: dc: Support memory bandwidth management
Date: Mon, 15 Mar 2021 21:39:06 +0300	[thread overview]
Message-ID: <1158bbca-8880-918d-7564-e2e30552e6b3@gmail.com> (raw)
In-Reply-To: <20210314223119.GC2733@qmqm.qmqm.pl>

15.03.2021 01:31, Michał Mirosław пишет:
> On Thu, Mar 11, 2021 at 08:22:54PM +0300, Dmitry Osipenko wrote:
>> Display controller (DC) performs isochronous memory transfers, and thus,
>> has a requirement for a minimum memory bandwidth that shall be fulfilled,
>> otherwise framebuffer data can't be fetched fast enough and this results
>> in a DC's data-FIFO underflow that follows by a visual corruption.
> [...]
>> +static unsigned long
>> +tegra_plane_overlap_mask(struct drm_crtc_state *state,
>> +			 const struct drm_plane_state *plane_state)
>> +{
>> +	const struct drm_plane_state *other_state;
>> +	const struct tegra_plane *tegra;
>> +	unsigned long overlap_mask = 0;
>> +	struct drm_plane *plane;
>> +	struct drm_rect rect;
>> +
>> +	if (!plane_state->visible || !plane_state->fb)
>> +		return 0;
>> +
>> +	drm_atomic_crtc_state_for_each_plane_state(plane, other_state, state) {
> [...]
>> +	/*
>> +	 * Data-prefetch FIFO will easily help to overcome temporal memory
>> +	 * pressure if other plane overlaps with the cursor plane.
>> +	 */
>> +	if (tegra_plane_is_cursor(plane_state) && overlap_mask)
>> +		return 0;
>> +
>> +	return overlap_mask;
>> +}
> 
> Since for cursor plane this always returns 0, you could test
> tegra_plane_is_cursor() at the start of the function.

Yes, thanks.

>> +static int tegra_crtc_calculate_memory_bandwidth(struct drm_crtc *crtc,
>> +						 struct drm_atomic_state *state)
> [...]
>> +	/*
>> +	 * For overlapping planes pixel's data is fetched for each plane at
>> +	 * the same time, hence bandwidths are accumulated in this case.
>> +	 * This needs to be taken into account for calculating total bandwidth
>> +	 * consumed by all planes.
>> +	 *
>> +	 * Here we get the overlapping state of each plane, which is a
>> +	 * bitmask of plane indices telling with what planes there is an
>> +	 * overlap. Note that bitmask[plane] includes BIT(plane) in order
>> +	 * to make further code nicer and simpler.
>> +	 */
>> +	drm_atomic_crtc_state_for_each_plane_state(plane, plane_state, new_state) {
>> +		tegra_state = to_const_tegra_plane_state(plane_state);
>> +		tegra = to_tegra_plane(plane);
>> +
>> +		if (WARN_ON_ONCE(tegra->index >= TEGRA_DC_LEGACY_PLANES_NUM))
>> +			return -EINVAL;
>> +
>> +		plane_peak_bw[tegra->index] = tegra_state->peak_memory_bandwidth;
>> +		mask = tegra_plane_overlap_mask(new_state, plane_state);
>> +		overlap_mask[tegra->index] = mask;
>> +
>> +		if (hweight_long(mask) != 3)
>> +			all_planes_overlap_simultaneously = false;
>> +	}
>> +
>> +	old_state = drm_atomic_get_old_crtc_state(state, crtc);
>> +	old_dc_state = to_const_dc_state(old_state);
>> +
>> +	/*
>> +	 * Then we calculate maximum bandwidth of each plane state.
>> +	 * The bandwidth includes the plane BW + BW of the "simultaneously"
>> +	 * overlapping planes, where "simultaneously" means areas where DC
>> +	 * fetches from the planes simultaneously during of scan-out process.
>> +	 *
>> +	 * For example, if plane A overlaps with planes B and C, but B and C
>> +	 * don't overlap, then the peak bandwidth will be either in area where
>> +	 * A-and-B or A-and-C planes overlap.
>> +	 *
>> +	 * The plane_peak_bw[] contains peak memory bandwidth values of
>> +	 * each plane, this information is needed by interconnect provider
>> +	 * in order to set up latency allowness based on the peak BW, see
>> +	 * tegra_crtc_update_memory_bandwidth().
>> +	 */
>> +	for (i = 0; i < ARRAY_SIZE(plane_peak_bw); i++) {
>> +		overlap_bw = 0;
>> +
>> +		for_each_set_bit(k, &overlap_mask[i], 3) {
>> +			if (k == i)
>> +				continue;
>> +
>> +			if (all_planes_overlap_simultaneously)
>> +				overlap_bw += plane_peak_bw[k];
>> +			else
>> +				overlap_bw = max(overlap_bw, plane_peak_bw[k]);
>> +		}
>> +
>> +		new_dc_state->plane_peak_bw[i] = plane_peak_bw[i] + overlap_bw;
>> +
>> +		/*
>> +		 * If plane's peak bandwidth changed (for example plane isn't
>> +		 * overlapped anymore) and plane isn't in the atomic state,
>> +		 * then add plane to the state in order to have the bandwidth
>> +		 * updated.
>> +		 */
>> +		if (old_dc_state->plane_peak_bw[i] !=
>> +		    new_dc_state->plane_peak_bw[i]) {
>> +			plane = tegra_crtc_get_plane_by_index(crtc, i);
>> +			if (!plane)
>> +				continue;
>> +
>> +			plane_state = drm_atomic_get_plane_state(state, plane);
>> +			if (IS_ERR(plane_state))
>> +				return PTR_ERR(plane_state);
>> +		}
>> +	}
> [...]
> 
> Does it matter to which channel (plane) the peak bw is attached? Would
> it still work if the first channel specified max(peak_bw of overlaps)
> and others only zeroes?

The latency allowance will be configured incorrectly for the case of
zeroes by the memory driver, hence it does matter.

>> +	/*
>> +	 * Horizontal downscale needs a lower memory latency, which roughly
>> +	 * depends on the scaled width.  Trying to tune latency of a memory
>> +	 * client alone will likely result in a strong negative impact on
>> +	 * other memory clients, hence we will request a higher bandwidth
>> +	 * since latency depends on bandwidth.  This allows to prevent memory
>> +	 * FIFO underflows for a large plane downscales, meanwhile allowing
>> +	 * display to share bandwidth fairly with other memory clients.
>> +	 */
>> +	if (src_w > dst_w)
>> +		mul = (src_w - dst_w) * bpp / 2048 + 1;
>> +	else
>> +		mul = 1;
> [...]
> 
> One point is unexplained yet: why is the multiplier proportional to a
> *difference* between src and dst widths? Also, I would expect max (worst
> case) is pixclock * read_size when src_w/dst_w >= read_size.

IIRC, the difference gives a more adequate/practical result than the
proportion. Although, downstream driver uses proportion. I'll try to
revisit this for the next version of the patch.

> BTW, you could move this below and : if (src > dst_w) peak_bandwidth *= ...

Indeed, and should be a bit nicer to use 'mul' in both cases.