Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* [PATCH v3 3/4] block: drop shared-tag fairness throttling
From: Sumit Saxena @ 2026-06-09 12:18 UTC (permalink / raw)
  To: Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Bart Van Assche, Sumit Saxena
In-Reply-To: <20260609121806.2121755-1-sumit.saxena@broadcom.com>

From: Bart Van Assche <bvanassche@acm.org>

Original patch [1] by Bart Van Assche; this version is rebased onto the
current tree.  In testing it improves IOPS by roughly 16-18% by removing
the fair-sharing throttle on shared tag queues.

This patch removes the following code and structure members:
- The function hctx_may_queue().
- blk_mq_hw_ctx.nr_active and request_queue.nr_active_requests_shared_tags
  and also all the code that modifies these two member variables.

[1]: https://lore.kernel.org/linux-block/20240529213921.3166462-1-bvanassche@acm.org/

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
 block/blk-core.c       |   2 -
 block/blk-mq-debugfs.c |  22 ++++++++-
 block/blk-mq-tag.c     |   4 --
 block/blk-mq.c         |  17 +------
 block/blk-mq.h         | 100 -----------------------------------------
 include/linux/blk-mq.h |   6 ---
 include/linux/blkdev.h |   2 -
 7 files changed, 22 insertions(+), 131 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d..129acc1b27e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -421,8 +421,6 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
 
 	q->node = node_id;
 
-	atomic_set(&q->nr_active_requests_shared_tags, 0);
-
 	timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
 	INIT_WORK(&q->timeout_work, blk_timeout_work);
 	INIT_LIST_HEAD(&q->icq_list);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 047ec887456b..8b85a7f8e987 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -468,11 +468,31 @@ static int hctx_sched_tags_bitmap_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+struct count_active_params {
+	struct blk_mq_hw_ctx	*hctx;
+	int			*active;
+};
+
+static bool hctx_count_active(struct request *rq, void *data)
+{
+	const struct count_active_params *params = data;
+
+	if (rq->mq_hctx == params->hctx)
+		(*params->active)++;
+
+	return true;
+}
+
 static int hctx_active_show(void *data, struct seq_file *m)
 {
 	struct blk_mq_hw_ctx *hctx = data;
+	int active = 0;
+	struct count_active_params params = { .hctx = hctx, .active = &active };
+
+	blk_mq_all_tag_iter(hctx->sched_tags ?: hctx->tags, hctx_count_active,
+			    &params);
 
-	seq_printf(m, "%d\n", __blk_mq_active_requests(hctx));
+	seq_printf(m, "%d\n", active);
 	return 0;
 }
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..bfd27cc6249b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -109,10 +109,6 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 			    struct sbitmap_queue *bt)
 {
-	if (!data->q->elevator && !(data->flags & BLK_MQ_REQ_RESERVED) &&
-			!hctx_may_queue(data->hctx, bt))
-		return BLK_MQ_NO_TAG;
-
 	if (data->shallow_depth)
 		return sbitmap_queue_get_shallow(bt, data->shallow_depth);
 	else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..bbac59a06044 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -489,8 +489,6 @@ __blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data)
 		}
 	} while (data->nr_tags > nr);
 
-	if (!(data->rq_flags & RQF_SCHED_TAGS))
-		blk_mq_add_active_requests(data->hctx, nr);
 	/* caller already holds a reference, add for remainder */
 	percpu_ref_get_many(&data->q->q_usage_counter, nr - 1);
 	data->nr_tags -= nr;
@@ -587,8 +585,6 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 		goto retry;
 	}
 
-	if (!(data->rq_flags & RQF_SCHED_TAGS))
-		blk_mq_inc_active_requests(data->hctx);
 	rq = blk_mq_rq_ctx_init(data, blk_mq_tags_from_data(data), tag);
 	blk_mq_rq_time_init(rq, alloc_time_ns);
 	return rq;
@@ -763,8 +759,6 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	tag = blk_mq_get_tag(&data);
 	if (tag == BLK_MQ_NO_TAG)
 		goto out_queue_exit;
-	if (!(data.rq_flags & RQF_SCHED_TAGS))
-		blk_mq_inc_active_requests(data.hctx);
 	rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag);
 	blk_mq_rq_time_init(rq, alloc_time_ns);
 	rq->__data_len = 0;
@@ -807,10 +801,8 @@ static void __blk_mq_free_request(struct request *rq)
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
 
-	if (rq->tag != BLK_MQ_NO_TAG) {
-		blk_mq_dec_active_requests(hctx);
+	if (rq->tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
-	}
 	if (sched_tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
@@ -1188,8 +1180,6 @@ static inline void blk_mq_flush_tag_batch(struct blk_mq_hw_ctx *hctx,
 {
 	struct request_queue *q = hctx->queue;
 
-	blk_mq_sub_active_requests(hctx, nr_tags);
-
 	blk_mq_put_tags(hctx->tags, tag_array, nr_tags);
 	percpu_ref_put_many(&q->q_usage_counter, nr_tags);
 }
@@ -1875,9 +1865,6 @@ bool __blk_mq_alloc_driver_tag(struct request *rq)
 	if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) {
 		bt = &rq->mq_hctx->tags->breserved_tags;
 		tag_offset = 0;
-	} else {
-		if (!hctx_may_queue(rq->mq_hctx, bt))
-			return false;
 	}
 
 	tag = __sbitmap_queue_get(bt);
@@ -1885,7 +1872,6 @@ bool __blk_mq_alloc_driver_tag(struct request *rq)
 		return false;
 
 	rq->tag = tag + tag_offset;
-	blk_mq_inc_active_requests(rq->mq_hctx);
 	return true;
 }
 
@@ -4058,7 +4044,6 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	if (!zalloc_cpumask_var_node(&hctx->cpumask, gfp, node))
 		goto free_hctx;
 
-	atomic_set(&hctx->nr_active, 0);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 	hctx->numa_node = node;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index aa15d31aaae9..8dfb67c55f5d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -291,70 +291,9 @@ static inline int blk_mq_get_rq_budget_token(struct request *rq)
 	return -1;
 }
 
-static inline void __blk_mq_add_active_requests(struct blk_mq_hw_ctx *hctx,
-						int val)
-{
-	if (blk_mq_is_shared_tags(hctx->flags))
-		atomic_add(val, &hctx->queue->nr_active_requests_shared_tags);
-	else
-		atomic_add(val, &hctx->nr_active);
-}
-
-static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx)
-{
-	__blk_mq_add_active_requests(hctx, 1);
-}
-
-static inline void __blk_mq_sub_active_requests(struct blk_mq_hw_ctx *hctx,
-		int val)
-{
-	if (blk_mq_is_shared_tags(hctx->flags))
-		atomic_sub(val, &hctx->queue->nr_active_requests_shared_tags);
-	else
-		atomic_sub(val, &hctx->nr_active);
-}
-
-static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx)
-{
-	__blk_mq_sub_active_requests(hctx, 1);
-}
-
-static inline void blk_mq_add_active_requests(struct blk_mq_hw_ctx *hctx,
-					      int val)
-{
-	if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		__blk_mq_add_active_requests(hctx, val);
-}
-
-static inline void blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx)
-{
-	if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		__blk_mq_inc_active_requests(hctx);
-}
-
-static inline void blk_mq_sub_active_requests(struct blk_mq_hw_ctx *hctx,
-					      int val)
-{
-	if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		__blk_mq_sub_active_requests(hctx, val);
-}
-
-static inline void blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx)
-{
-	if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		__blk_mq_dec_active_requests(hctx);
-}
-
-static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx)
-{
-	if (blk_mq_is_shared_tags(hctx->flags))
-		return atomic_read(&hctx->queue->nr_active_requests_shared_tags);
-	return atomic_read(&hctx->nr_active);
-}
 static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 					   struct request *rq)
 {
-	blk_mq_dec_active_requests(hctx);
 	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
 	rq->tag = BLK_MQ_NO_TAG;
 }
@@ -396,45 +335,6 @@ static inline void blk_mq_free_requests(struct list_head *list)
 	}
 }
 
-/*
- * For shared tag users, we track the number of currently active users
- * and attempt to provide a fair share of the tag depth for each of them.
- */
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
-				  struct sbitmap_queue *bt)
-{
-	unsigned int depth, users;
-
-	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
-		return true;
-
-	/*
-	 * Don't try dividing an ant
-	 */
-	if (bt->sb.depth == 1)
-		return true;
-
-	if (blk_mq_is_shared_tags(hctx->flags)) {
-		struct request_queue *q = hctx->queue;
-
-		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
-			return true;
-	} else {
-		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-			return true;
-	}
-
-	users = READ_ONCE(hctx->tags->active_queues);
-	if (!users)
-		return true;
-
-	/*
-	 * Allow at least some tags
-	 */
-	depth = max((bt->sb.depth + users - 1) / users, 4U);
-	return __blk_mq_active_requests(hctx) < depth;
-}
-
 /* run the code block in @dispatch_ops with rcu/srcu read lock held */
 #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops)	\
 do {								\
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..ccbb07559402 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -432,12 +432,6 @@ struct blk_mq_hw_ctx {
 	/** @queue_num: Index of this hardware queue. */
 	unsigned int		queue_num;
 
-	/**
-	 * @nr_active: Number of active requests. Only used when a tag set is
-	 * shared across request queues.
-	 */
-	atomic_t		nr_active;
-
 	/** @cpuhp_online: List to store request if CPU is going to die */
 	struct hlist_node	cpuhp_online;
 	/** @cpuhp_dead: List to store request if some CPU die. */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 890128cdea1c..95525b1d7b74 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -567,8 +567,6 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 
-	atomic_t		nr_active_requests_shared_tags;
-
 	struct blk_mq_tags	*sched_shared_tags;
 
 	struct list_head	icq_list;
-- 
2.43.7


^ permalink raw reply related

* [PATCH v3 2/4] scsi: host: allocate struct Scsi_Host on the NUMA node of the host adapter
From: Sumit Saxena @ 2026-06-09 12:18 UTC (permalink / raw)
  To: Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Sumit Saxena, John Garry
In-Reply-To: <20260609121806.2121755-1-sumit.saxena@broadcom.com>

scsi_host_alloc() used kzalloc(), which always picks an arbitrary node.
Extend the function to accept a 'struct device *dev' parameter and use
kzalloc_node() with dev_to_node(dev) so the Scsi_Host struct lands on
the same NUMA node as the HBA, mirroring the treatment already applied
to struct scsi_device, struct scsi_target, and shost_data.

When dev is NULL (legacy ISA/platform drivers without a dma_dev) the
allocation falls back to NUMA_NO_NODE, preserving existing behaviour.

Update all in-tree callers:
  - PCI-based HBA drivers pass &pdev->dev (or the equivalent struct
    member such as &phba->pcidev->dev, &h->pdev->dev, &ha->pdev->dev)
    so their host struct is placed on the adapter's node.
  - Non-PCI drivers (ISA, Amiga, ARM PCMCIA, virtio, Hyper-V, PS3, …)
    pass NULL.
  - libfc's libfc_host_alloc() inline helper passes NULL; FC drivers
    that want NUMA awareness can open-code the call with their pdev.

Suggested-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
 drivers/scsi/3w-9xxx.c                    | 2 +-
 drivers/scsi/3w-sas.c                     | 2 +-
 drivers/scsi/3w-xxxx.c                    | 2 +-
 drivers/scsi/53c700.c                     | 2 +-
 drivers/scsi/BusLogic.c                   | 2 +-
 drivers/scsi/a100u2w.c                    | 2 +-
 drivers/scsi/a2091.c                      | 2 +-
 drivers/scsi/a3000.c                      | 2 +-
 drivers/scsi/aacraid/linit.c              | 2 +-
 drivers/scsi/advansys.c                   | 6 +++---
 drivers/scsi/aha152x.c                    | 2 +-
 drivers/scsi/aha1542.c                    | 2 +-
 drivers/scsi/aha1740.c                    | 2 +-
 drivers/scsi/aic7xxx/aic79xx_osm.c        | 2 +-
 drivers/scsi/aic7xxx/aic7xxx_osm.c        | 2 +-
 drivers/scsi/aic94xx/aic94xx_init.c       | 2 +-
 drivers/scsi/am53c974.c                   | 2 +-
 drivers/scsi/arcmsr/arcmsr_hba.c          | 3 ++-
 drivers/scsi/arm/acornscsi.c              | 2 +-
 drivers/scsi/arm/arxescsi.c               | 2 +-
 drivers/scsi/arm/cumana_1.c               | 2 +-
 drivers/scsi/arm/cumana_2.c               | 2 +-
 drivers/scsi/arm/eesox.c                  | 2 +-
 drivers/scsi/arm/oak.c                    | 2 +-
 drivers/scsi/arm/powertec.c               | 2 +-
 drivers/scsi/atari_scsi.c                 | 2 +-
 drivers/scsi/atp870u.c                    | 2 +-
 drivers/scsi/bfa/bfad_im.c                | 2 +-
 drivers/scsi/csiostor/csio_init.c         | 4 ++--
 drivers/scsi/dc395x.c                     | 2 +-
 drivers/scsi/dmx3191d.c                   | 2 +-
 drivers/scsi/elx/efct/efct_xport.c        | 4 ++--
 drivers/scsi/esas2r/esas2r_main.c         | 2 +-
 drivers/scsi/fdomain.c                    | 2 +-
 drivers/scsi/fnic/fnic_main.c             | 2 +-
 drivers/scsi/g_NCR5380.c                  | 2 +-
 drivers/scsi/gvp11.c                      | 2 +-
 drivers/scsi/hisi_sas/hisi_sas_main.c     | 2 +-
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    | 2 +-
 drivers/scsi/hosts.c                      | 6 ++++--
 drivers/scsi/hpsa.c                       | 2 +-
 drivers/scsi/hptiop.c                     | 2 +-
 drivers/scsi/ibmvscsi/ibmvfc.c            | 2 +-
 drivers/scsi/ibmvscsi/ibmvscsi.c          | 2 +-
 drivers/scsi/imm.c                        | 2 +-
 drivers/scsi/initio.c                     | 2 +-
 drivers/scsi/ipr.c                        | 2 +-
 drivers/scsi/ips.c                        | 2 +-
 drivers/scsi/isci/init.c                  | 2 +-
 drivers/scsi/jazz_esp.c                   | 2 +-
 drivers/scsi/libiscsi.c                   | 2 +-
 drivers/scsi/lpfc/lpfc_init.c             | 2 +-
 drivers/scsi/mac53c94.c                   | 2 +-
 drivers/scsi/mac_esp.c                    | 2 +-
 drivers/scsi/mac_scsi.c                   | 2 +-
 drivers/scsi/megaraid.c                   | 2 +-
 drivers/scsi/megaraid/megaraid_mbox.c     | 2 +-
 drivers/scsi/megaraid/megaraid_sas_base.c | 2 +-
 drivers/scsi/mesh.c                       | 2 +-
 drivers/scsi/mpi3mr/mpi3mr_os.c           | 2 +-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c      | 4 ++--
 drivers/scsi/mvme147.c                    | 2 +-
 drivers/scsi/mvsas/mv_init.c              | 2 +-
 drivers/scsi/mvumi.c                      | 2 +-
 drivers/scsi/myrb.c                       | 2 +-
 drivers/scsi/myrs.c                       | 2 +-
 drivers/scsi/ncr53c8xx.c                  | 2 +-
 drivers/scsi/nsp32.c                      | 2 +-
 drivers/scsi/pcmcia/nsp_cs.c              | 2 +-
 drivers/scsi/pcmcia/qlogic_stub.c         | 2 +-
 drivers/scsi/pcmcia/sym53c500_cs.c        | 2 +-
 drivers/scsi/pm8001/pm8001_init.c         | 2 +-
 drivers/scsi/pmcraid.c                    | 2 +-
 drivers/scsi/ppa.c                        | 2 +-
 drivers/scsi/ps3rom.c                     | 2 +-
 drivers/scsi/qla1280.c                    | 2 +-
 drivers/scsi/qla2xxx/qla_mid.c            | 2 +-
 drivers/scsi/qla2xxx/qla_os.c             | 2 +-
 drivers/scsi/qlogicfas.c                  | 2 +-
 drivers/scsi/qlogicpti.c                  | 2 +-
 drivers/scsi/scsi_debug.c                 | 2 +-
 drivers/scsi/sgiwd93.c                    | 2 +-
 drivers/scsi/smartpqi/smartpqi_init.c     | 2 +-
 drivers/scsi/snic/snic_main.c             | 2 +-
 drivers/scsi/stex.c                       | 2 +-
 drivers/scsi/storvsc_drv.c                | 2 +-
 drivers/scsi/sun3_scsi.c                  | 2 +-
 drivers/scsi/sun3x_esp.c                  | 2 +-
 drivers/scsi/sun_esp.c                    | 2 +-
 drivers/scsi/sym53c8xx_2/sym_glue.c       | 2 +-
 drivers/scsi/virtio_scsi.c                | 2 +-
 drivers/scsi/vmw_pvscsi.c                 | 2 +-
 drivers/scsi/wd719x.c                     | 2 +-
 drivers/scsi/xen-scsifront.c              | 2 +-
 drivers/scsi/zorro_esp.c                  | 2 +-
 include/scsi/libfc.h                      | 2 +-
 include/scsi/scsi_host.h                  | 3 ++-
 97 files changed, 107 insertions(+), 103 deletions(-)

diff --git a/drivers/scsi/3w-9xxx.c b/drivers/scsi/3w-9xxx.c
index 9b93a2440af8..444578ee8070 100644
--- a/drivers/scsi/3w-9xxx.c
+++ b/drivers/scsi/3w-9xxx.c
@@ -2021,7 +2021,7 @@ static int twa_probe(struct pci_dev *pdev, const struct pci_device_id *dev_id)
 		goto out_disable_device;
 	}
 
-	host = scsi_host_alloc(&driver_template, sizeof(TW_Device_Extension));
+	host = scsi_host_alloc(&driver_template, sizeof(TW_Device_Extension), &pdev->dev);
 	if (!host) {
 		TW_PRINTK(host, TW_DRIVER, 0x24, "Failed to allocate memory for device extension");
 		retval = -ENOMEM;
diff --git a/drivers/scsi/3w-sas.c b/drivers/scsi/3w-sas.c
index 52dc1aa639f7..d063d39faf4f 100644
--- a/drivers/scsi/3w-sas.c
+++ b/drivers/scsi/3w-sas.c
@@ -1576,7 +1576,7 @@ static int twl_probe(struct pci_dev *pdev, const struct pci_device_id *dev_id)
 		goto out_disable_device;
 	}
 
-	host = scsi_host_alloc(&driver_template, sizeof(TW_Device_Extension));
+	host = scsi_host_alloc(&driver_template, sizeof(TW_Device_Extension), &pdev->dev);
 	if (!host) {
 		TW_PRINTK(host, TW_DRIVER, 0x19, "Failed to allocate memory for device extension");
 		retval = -ENOMEM;
diff --git a/drivers/scsi/3w-xxxx.c b/drivers/scsi/3w-xxxx.c
index c68678fa72c1..0ccb5f1f8805 100644
--- a/drivers/scsi/3w-xxxx.c
+++ b/drivers/scsi/3w-xxxx.c
@@ -2268,7 +2268,7 @@ static int tw_probe(struct pci_dev *pdev, const struct pci_device_id *dev_id)
 		goto out_disable_device;
 	}
 
-	host = scsi_host_alloc(&driver_template, sizeof(TW_Device_Extension));
+	host = scsi_host_alloc(&driver_template, sizeof(TW_Device_Extension), &pdev->dev);
 	if (!host) {
 		printk(KERN_WARNING "3w-xxxx: Failed to allocate memory for device extension.");
 		retval = -ENOMEM;
diff --git a/drivers/scsi/53c700.c b/drivers/scsi/53c700.c
index c78f74b8f45c..e30d55ab5dea 100644
--- a/drivers/scsi/53c700.c
+++ b/drivers/scsi/53c700.c
@@ -341,7 +341,7 @@ NCR_700_detect(struct scsi_host_template *tpnt,
 	if(tpnt->proc_name == NULL)
 		tpnt->proc_name = "53c700";
 
-	host = scsi_host_alloc(tpnt, 4);
+	host = scsi_host_alloc(tpnt, 4, NULL);
 	if (!host)
 		return NULL;
 	memset(hostdata->slots, 0, sizeof(struct NCR_700_command_slot)
diff --git a/drivers/scsi/BusLogic.c b/drivers/scsi/BusLogic.c
index 5304d2febd63..f865fdec4136 100644
--- a/drivers/scsi/BusLogic.c
+++ b/drivers/scsi/BusLogic.c
@@ -2302,7 +2302,7 @@ static int __init blogic_init(void)
 		 */
 
 		host = scsi_host_alloc(&blogic_template,
-				sizeof(struct blogic_adapter));
+				sizeof(struct blogic_adapter), NULL);
 		if (host == NULL) {
 			release_region(myadapter->io_addr,
 					myadapter->addr_count);
diff --git a/drivers/scsi/a100u2w.c b/drivers/scsi/a100u2w.c
index 4365b896f5c4..9124c6103902 100644
--- a/drivers/scsi/a100u2w.c
+++ b/drivers/scsi/a100u2w.c
@@ -1106,7 +1106,7 @@ static int inia100_probe_one(struct pci_dev *pdev,
 	bios = inw(port + 0x50);
 
 
-	shost = scsi_host_alloc(&inia100_template, sizeof(struct orc_host));
+	shost = scsi_host_alloc(&inia100_template, sizeof(struct orc_host), &pdev->dev);
 	if (!shost)
 		goto out_release_region;
 
diff --git a/drivers/scsi/a2091.c b/drivers/scsi/a2091.c
index 204448bfd04b..51effb2edefb 100644
--- a/drivers/scsi/a2091.c
+++ b/drivers/scsi/a2091.c
@@ -214,7 +214,7 @@ static int a2091_probe(struct zorro_dev *z, const struct zorro_device_id *ent)
 		return -EBUSY;
 
 	instance = scsi_host_alloc(&a2091_scsi_template,
-				   sizeof(struct a2091_hostdata));
+				   sizeof(struct a2091_hostdata), NULL);
 	if (!instance) {
 		error = -ENOMEM;
 		goto fail_alloc;
diff --git a/drivers/scsi/a3000.c b/drivers/scsi/a3000.c
index bf054dd7682b..5b3d25b8ad37 100644
--- a/drivers/scsi/a3000.c
+++ b/drivers/scsi/a3000.c
@@ -235,7 +235,7 @@ static int __init amiga_a3000_scsi_probe(struct platform_device *pdev)
 		return -EBUSY;
 
 	instance = scsi_host_alloc(&amiga_a3000_scsi_template,
-				   sizeof(struct a3000_hostdata));
+				   sizeof(struct a3000_hostdata), NULL);
 	if (!instance) {
 		error = -ENOMEM;
 		goto fail_alloc;
diff --git a/drivers/scsi/aacraid/linit.c b/drivers/scsi/aacraid/linit.c
index 2fa8f7ddb703..d003667007f7 100644
--- a/drivers/scsi/aacraid/linit.c
+++ b/drivers/scsi/aacraid/linit.c
@@ -1636,7 +1636,7 @@ static int aac_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	pci_set_master(pdev);
 
-	shost = scsi_host_alloc(&aac_driver_template, sizeof(struct aac_dev));
+	shost = scsi_host_alloc(&aac_driver_template, sizeof(struct aac_dev), &pdev->dev);
 	if (!shost) {
 		error = -ENOMEM;
 		goto out_disable_pdev;
diff --git a/drivers/scsi/advansys.c b/drivers/scsi/advansys.c
index 5cdbf2bdb13d..e7ef433778a1 100644
--- a/drivers/scsi/advansys.c
+++ b/drivers/scsi/advansys.c
@@ -11237,7 +11237,7 @@ static int advansys_vlb_probe(struct device *dev, unsigned int id)
 		goto release_region;
 
 	err = -ENOMEM;
-	shost = scsi_host_alloc(&advansys_template, sizeof(*board));
+	shost = scsi_host_alloc(&advansys_template, sizeof(*board), NULL);
 	if (!shost)
 		goto release_region;
 
@@ -11345,7 +11345,7 @@ static int advansys_eisa_probe(struct device *dev)
 			irq = advansys_eisa_irq_no(edev);
 
 		err = -ENOMEM;
-		shost = scsi_host_alloc(&advansys_template, sizeof(*board));
+		shost = scsi_host_alloc(&advansys_template, sizeof(*board), NULL);
 		if (!shost)
 			goto release_region;
 
@@ -11462,7 +11462,7 @@ static int advansys_pci_probe(struct pci_dev *pdev,
 	ioport = pci_resource_start(pdev, 0);
 
 	err = -ENOMEM;
-	shost = scsi_host_alloc(&advansys_template, sizeof(*board));
+	shost = scsi_host_alloc(&advansys_template, sizeof(*board), &pdev->dev);
 	if (!shost)
 		goto release_region;
 
diff --git a/drivers/scsi/aha152x.c b/drivers/scsi/aha152x.c
index e3ccb6bb62c0..d82ce80de098 100644
--- a/drivers/scsi/aha152x.c
+++ b/drivers/scsi/aha152x.c
@@ -734,7 +734,7 @@ struct Scsi_Host *aha152x_probe_one(struct aha152x_setup *setup)
 {
 	struct Scsi_Host *shpnt;
 
-	shpnt = scsi_host_alloc(&aha152x_driver_template, sizeof(struct aha152x_hostdata));
+	shpnt = scsi_host_alloc(&aha152x_driver_template, sizeof(struct aha152x_hostdata), NULL);
 	if (!shpnt) {
 		printk(KERN_ERR "aha152x: scsi_host_alloc failed\n");
 		return NULL;
diff --git a/drivers/scsi/aha1542.c b/drivers/scsi/aha1542.c
index fd766282d4a4..1a109c850785 100644
--- a/drivers/scsi/aha1542.c
+++ b/drivers/scsi/aha1542.c
@@ -752,7 +752,7 @@ static struct Scsi_Host *aha1542_hw_init(const struct scsi_host_template *tpnt,
 	if (!request_region(base_io, AHA1542_REGION_SIZE, "aha1542"))
 		return NULL;
 
-	sh = scsi_host_alloc(tpnt, sizeof(struct aha1542_hostdata));
+	sh = scsi_host_alloc(tpnt, sizeof(struct aha1542_hostdata), NULL);
 	if (!sh)
 		goto release;
 	aha1542 = shost_priv(sh);
diff --git a/drivers/scsi/aha1740.c b/drivers/scsi/aha1740.c
index c435769359f2..31a52edf0748 100644
--- a/drivers/scsi/aha1740.c
+++ b/drivers/scsi/aha1740.c
@@ -583,7 +583,7 @@ static int aha1740_probe (struct device *dev)
 	printk(KERN_INFO "aha174x: Extended translation %sabled.\n",
 	       translation ? "en" : "dis");
 	shpnt = scsi_host_alloc(&aha1740_template,
-			      sizeof(struct aha1740_hostdata));
+			      sizeof(struct aha1740_hostdata), NULL);
 	if(shpnt == NULL)
 		goto err_release_region;
 
diff --git a/drivers/scsi/aic7xxx/aic79xx_osm.c b/drivers/scsi/aic7xxx/aic79xx_osm.c
index feb1707feb7e..76e30b0784b9 100644
--- a/drivers/scsi/aic7xxx/aic79xx_osm.c
+++ b/drivers/scsi/aic7xxx/aic79xx_osm.c
@@ -1214,7 +1214,7 @@ ahd_linux_register_host(struct ahd_softc *ahd, struct scsi_host_template *templa
 	int	retval;
 
 	template->name = ahd->description;
-	host = scsi_host_alloc(template, sizeof(struct ahd_softc *));
+	host = scsi_host_alloc(template, sizeof(struct ahd_softc *), NULL);
 	if (host == NULL)
 		return (ENOMEM);
 
diff --git a/drivers/scsi/aic7xxx/aic7xxx_osm.c b/drivers/scsi/aic7xxx/aic7xxx_osm.c
index d93b522695eb..0169509abd76 100644
--- a/drivers/scsi/aic7xxx/aic7xxx_osm.c
+++ b/drivers/scsi/aic7xxx/aic7xxx_osm.c
@@ -1083,7 +1083,7 @@ ahc_linux_register_host(struct ahc_softc *ahc, struct scsi_host_template *templa
 	int	retval;
 
 	template->name = ahc->description;
-	host = scsi_host_alloc(template, sizeof(struct ahc_softc *));
+	host = scsi_host_alloc(template, sizeof(struct ahc_softc *), NULL);
 	if (host == NULL)
 		return -ENOMEM;
 
diff --git a/drivers/scsi/aic94xx/aic94xx_init.c b/drivers/scsi/aic94xx/aic94xx_init.c
index 4400a3661d90..1336e5e38f8d 100644
--- a/drivers/scsi/aic94xx/aic94xx_init.c
+++ b/drivers/scsi/aic94xx/aic94xx_init.c
@@ -704,7 +704,7 @@ static int asd_pci_probe(struct pci_dev *dev, const struct pci_device_id *id)
 
 	err = -ENOMEM;
 
-	shost = scsi_host_alloc(&aic94xx_sht, sizeof(void *));
+	shost = scsi_host_alloc(&aic94xx_sht, sizeof(void *), &dev->dev);
 	if (!shost)
 		goto Err;
 
diff --git a/drivers/scsi/am53c974.c b/drivers/scsi/am53c974.c
index f972a3c90a2f..4ca73e801232 100644
--- a/drivers/scsi/am53c974.c
+++ b/drivers/scsi/am53c974.c
@@ -388,7 +388,7 @@ static int pci_esp_probe_one(struct pci_dev *pdev,
 		goto fail_disable_device;
 	}
 
-	shost = scsi_host_alloc(hostt, sizeof(struct esp));
+	shost = scsi_host_alloc(hostt, sizeof(struct esp), &pdev->dev);
 	if (!shost) {
 		dev_printk(KERN_INFO, &pdev->dev,
 			   "failed to allocate scsi host\n");
diff --git a/drivers/scsi/arcmsr/arcmsr_hba.c b/drivers/scsi/arcmsr/arcmsr_hba.c
index 8aa948f06cac..f0cc59e756dc 100644
--- a/drivers/scsi/arcmsr/arcmsr_hba.c
+++ b/drivers/scsi/arcmsr/arcmsr_hba.c
@@ -1087,7 +1087,8 @@ static int arcmsr_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if(error){
 		return -ENODEV;
 	}
-	host = scsi_host_alloc(&arcmsr_scsi_host_template, sizeof(struct AdapterControlBlock));
+	host = scsi_host_alloc(&arcmsr_scsi_host_template,
+			       sizeof(struct AdapterControlBlock), &pdev->dev);
 	if(!host){
     		goto pci_disable_dev;
 	}
diff --git a/drivers/scsi/arm/acornscsi.c b/drivers/scsi/arm/acornscsi.c
index 79d7d7336b6a..97e3db7e6a7c 100644
--- a/drivers/scsi/arm/acornscsi.c
+++ b/drivers/scsi/arm/acornscsi.c
@@ -2806,7 +2806,7 @@ static int acornscsi_probe(struct expansion_card *ec, const struct ecard_id *id)
 	if (ret)
 		goto out;
 
-	host = scsi_host_alloc(&acornscsi_template, sizeof(AS_Host));
+	host = scsi_host_alloc(&acornscsi_template, sizeof(AS_Host), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto out_release;
diff --git a/drivers/scsi/arm/arxescsi.c b/drivers/scsi/arm/arxescsi.c
index 925d0bd68aa5..32f0a3aefb44 100644
--- a/drivers/scsi/arm/arxescsi.c
+++ b/drivers/scsi/arm/arxescsi.c
@@ -272,7 +272,7 @@ static int arxescsi_probe(struct expansion_card *ec, const struct ecard_id *id)
 		goto out_region;
 	}
 
-	host = scsi_host_alloc(&arxescsi_template, sizeof(struct arxescsi_info));
+	host = scsi_host_alloc(&arxescsi_template, sizeof(struct arxescsi_info), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto out_region;
diff --git a/drivers/scsi/arm/cumana_1.c b/drivers/scsi/arm/cumana_1.c
index d1a2a22ffe8c..d47ff9353c1b 100644
--- a/drivers/scsi/arm/cumana_1.c
+++ b/drivers/scsi/arm/cumana_1.c
@@ -238,7 +238,7 @@ static int cumanascsi1_probe(struct expansion_card *ec,
 	if (ret)
 		goto out;
 
-	host = scsi_host_alloc(&cumanascsi_template, sizeof(struct NCR5380_hostdata));
+	host = scsi_host_alloc(&cumanascsi_template, sizeof(struct NCR5380_hostdata), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto out_release;
diff --git a/drivers/scsi/arm/cumana_2.c b/drivers/scsi/arm/cumana_2.c
index e460068f6834..e35afe3a1fe4 100644
--- a/drivers/scsi/arm/cumana_2.c
+++ b/drivers/scsi/arm/cumana_2.c
@@ -394,7 +394,7 @@ static int cumanascsi2_probe(struct expansion_card *ec,
 	}
 
 	host = scsi_host_alloc(&cumanascsi2_template,
-			       sizeof(struct cumanascsi2_info));
+			       sizeof(struct cumanascsi2_info), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto out_region;
diff --git a/drivers/scsi/arm/eesox.c b/drivers/scsi/arm/eesox.c
index 99be9da8757f..de4d457f8ce7 100644
--- a/drivers/scsi/arm/eesox.c
+++ b/drivers/scsi/arm/eesox.c
@@ -510,7 +510,7 @@ static int eesoxscsi_probe(struct expansion_card *ec, const struct ecard_id *id)
 	}
 
 	host = scsi_host_alloc(&eesox_template,
-			       sizeof(struct eesoxscsi_info));
+			       sizeof(struct eesoxscsi_info), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto out_region;
diff --git a/drivers/scsi/arm/oak.c b/drivers/scsi/arm/oak.c
index d69245007096..b2ff8616f963 100644
--- a/drivers/scsi/arm/oak.c
+++ b/drivers/scsi/arm/oak.c
@@ -126,7 +126,7 @@ static int oakscsi_probe(struct expansion_card *ec, const struct ecard_id *id)
 	if (ret)
 		goto out;
 
-	host = scsi_host_alloc(&oakscsi_template, sizeof(struct NCR5380_hostdata));
+	host = scsi_host_alloc(&oakscsi_template, sizeof(struct NCR5380_hostdata), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto release;
diff --git a/drivers/scsi/arm/powertec.c b/drivers/scsi/arm/powertec.c
index 823c65ff6c12..045f35e50eff 100644
--- a/drivers/scsi/arm/powertec.c
+++ b/drivers/scsi/arm/powertec.c
@@ -318,7 +318,7 @@ static int powertecscsi_probe(struct expansion_card *ec,
 	}
 
 	host = scsi_host_alloc(&powertecscsi_template,
-			       sizeof (struct powertec_info));
+			       sizeof(struct powertec_info), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto out_region;
diff --git a/drivers/scsi/atari_scsi.c b/drivers/scsi/atari_scsi.c
index 85055677666c..9a469cf3991f 100644
--- a/drivers/scsi/atari_scsi.c
+++ b/drivers/scsi/atari_scsi.c
@@ -785,7 +785,7 @@ static int __init atari_scsi_probe(struct platform_device *pdev)
 	}
 
 	instance = scsi_host_alloc(&atari_scsi_template,
-	                           sizeof(struct NCR5380_hostdata));
+				   sizeof(struct NCR5380_hostdata), NULL);
 	if (!instance) {
 		error = -ENOMEM;
 		goto fail_alloc;
diff --git a/drivers/scsi/atp870u.c b/drivers/scsi/atp870u.c
index 67459d81f479..57f0b4a11ba7 100644
--- a/drivers/scsi/atp870u.c
+++ b/drivers/scsi/atp870u.c
@@ -1579,7 +1579,7 @@ static int atp870u_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	pci_set_master(pdev);
 
 	err = -ENOMEM;
-	shpnt = scsi_host_alloc(&atp870u_template, sizeof(struct atp_unit));
+	shpnt = scsi_host_alloc(&atp870u_template, sizeof(struct atp_unit), &pdev->dev);
 	if (!shpnt)
 		goto release_region;
 
diff --git a/drivers/scsi/bfa/bfad_im.c b/drivers/scsi/bfa/bfad_im.c
index 97990b285e17..bd14aee64886 100644
--- a/drivers/scsi/bfa/bfad_im.c
+++ b/drivers/scsi/bfa/bfad_im.c
@@ -740,7 +740,7 @@ bfad_scsi_host_alloc(struct bfad_im_port_s *im_port, struct bfad_s *bfad)
 
 	sht->sg_tablesize = bfad->cfg_data.io_max_sge;
 
-	return scsi_host_alloc(sht, sizeof(struct bfad_im_port_pointer));
+	return scsi_host_alloc(sht, sizeof(struct bfad_im_port_pointer), NULL);
 }
 
 void
diff --git a/drivers/scsi/csiostor/csio_init.c b/drivers/scsi/csiostor/csio_init.c
index 238431524801..a4bf1ba03248 100644
--- a/drivers/scsi/csiostor/csio_init.c
+++ b/drivers/scsi/csiostor/csio_init.c
@@ -606,11 +606,11 @@ csio_shost_init(struct csio_hw *hw, struct device *dev,
 	if (dev == &hw->pdev->dev)
 		shost = scsi_host_alloc(
 				&csio_fcoe_shost_template,
-				sizeof(struct csio_lnode));
+				sizeof(struct csio_lnode), &hw->pdev->dev);
 	else
 		shost = scsi_host_alloc(
 				&csio_fcoe_shost_vport_template,
-				sizeof(struct csio_lnode));
+				sizeof(struct csio_lnode), &hw->pdev->dev);
 
 	if (!shost)
 		goto err;
diff --git a/drivers/scsi/dc395x.c b/drivers/scsi/dc395x.c
index 6183ce05d8cf..16adeac93aac 100644
--- a/drivers/scsi/dc395x.c
+++ b/drivers/scsi/dc395x.c
@@ -3984,7 +3984,7 @@ static int dc395x_init_one(struct pci_dev *dev, const struct pci_device_id *id)
 
 	/* allocate scsi host information (includes out adapter) */
 	scsi_host = scsi_host_alloc(&dc395x_driver_template,
-				    sizeof(struct AdapterCtlBlk));
+				    sizeof(struct AdapterCtlBlk), &dev->dev);
 	if (!scsi_host)
 		goto fail;
 
diff --git a/drivers/scsi/dmx3191d.c b/drivers/scsi/dmx3191d.c
index d6d091b2f3c7..8ba17e3eefe3 100644
--- a/drivers/scsi/dmx3191d.c
+++ b/drivers/scsi/dmx3191d.c
@@ -74,7 +74,7 @@ static int dmx3191d_probe_one(struct pci_dev *pdev,
 	}
 
 	shost = scsi_host_alloc(&dmx3191d_driver_template,
-			sizeof(struct NCR5380_hostdata));
+			sizeof(struct NCR5380_hostdata), &pdev->dev);
 	if (!shost)
 		goto out_release_region;       
 
diff --git a/drivers/scsi/elx/efct/efct_xport.c b/drivers/scsi/elx/efct/efct_xport.c
index 9dcaef6fc188..74ef76e00eb5 100644
--- a/drivers/scsi/elx/efct/efct_xport.c
+++ b/drivers/scsi/elx/efct/efct_xport.c
@@ -378,7 +378,7 @@ efct_scsi_new_device(struct efct *efct)
 	int error = 0;
 	struct efct_vport *vport = NULL;
 
-	shost = scsi_host_alloc(&efct_template, sizeof(*vport));
+	shost = scsi_host_alloc(&efct_template, sizeof(*vport), NULL);
 	if (!shost) {
 		efc_log_err(efct, "failed to allocate Scsi_Host struct\n");
 		return -ENOMEM;
@@ -902,7 +902,7 @@ efct_scsi_new_vport(struct efct *efct, struct device *dev)
 	int error = 0;
 	struct efct_vport *vport = NULL;
 
-	shost = scsi_host_alloc(&efct_template, sizeof(*vport));
+	shost = scsi_host_alloc(&efct_template, sizeof(*vport), NULL);
 	if (!shost) {
 		efc_log_err(efct, "failed to allocate Scsi_Host struct\n");
 		return NULL;
diff --git a/drivers/scsi/esas2r/esas2r_main.c b/drivers/scsi/esas2r/esas2r_main.c
index ada278c24c51..4aac1f6db5e9 100644
--- a/drivers/scsi/esas2r/esas2r_main.c
+++ b/drivers/scsi/esas2r/esas2r_main.c
@@ -382,7 +382,7 @@ static int esas2r_probe(struct pci_dev *pcid,
 		       "after pci_enable_device() enable_cnt: %d",
 		       pcid->enable_cnt.counter);
 
-	host = scsi_host_alloc(&driver_template, host_alloc_size);
+	host = scsi_host_alloc(&driver_template, host_alloc_size, &pcid->dev);
 	if (host == NULL) {
 		esas2r_log(ESAS2R_LOG_CRIT, "scsi_host_alloc() FAIL");
 		return -ENODEV;
diff --git a/drivers/scsi/fdomain.c b/drivers/scsi/fdomain.c
index 22fbb0222f07..66ba4551def8 100644
--- a/drivers/scsi/fdomain.c
+++ b/drivers/scsi/fdomain.c
@@ -537,7 +537,7 @@ struct Scsi_Host *fdomain_create(int base, int irq, int this_id,
 		return NULL;
 	}
 
-	sh = scsi_host_alloc(&fdomain_template, sizeof(struct fdomain));
+	sh = scsi_host_alloc(&fdomain_template, sizeof(struct fdomain), NULL);
 	if (!sh)
 		return NULL;
 
diff --git a/drivers/scsi/fnic/fnic_main.c b/drivers/scsi/fnic/fnic_main.c
index 24d62c0874ac..688d85bc3f01 100644
--- a/drivers/scsi/fnic/fnic_main.c
+++ b/drivers/scsi/fnic/fnic_main.c
@@ -847,7 +847,7 @@ static int fnic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		{
 			host =
 				scsi_host_alloc(&fnic_host_template,
-								sizeof(struct fnic *));
+								sizeof(struct fnic *), &pdev->dev);
 			if (!host) {
 				dev_err(&fnic->pdev->dev, "Unable to allocate scsi host\n");
 				err = -ENOMEM;
diff --git a/drivers/scsi/g_NCR5380.c b/drivers/scsi/g_NCR5380.c
index 270eae7ac427..8b9076d6a964 100644
--- a/drivers/scsi/g_NCR5380.c
+++ b/drivers/scsi/g_NCR5380.c
@@ -312,7 +312,7 @@ static int generic_NCR5380_init_one(const struct scsi_host_template *tpnt,
 		goto out_release;
 	}
 
-	instance = scsi_host_alloc(tpnt, sizeof(struct NCR5380_hostdata));
+	instance = scsi_host_alloc(tpnt, sizeof(struct NCR5380_hostdata), NULL);
 	if (instance == NULL) {
 		ret = -ENOMEM;
 		goto out_unmap;
diff --git a/drivers/scsi/gvp11.c b/drivers/scsi/gvp11.c
index 0420bfe9bd42..ad5052db5a2e 100644
--- a/drivers/scsi/gvp11.c
+++ b/drivers/scsi/gvp11.c
@@ -353,7 +353,7 @@ static int gvp11_probe(struct zorro_dev *z, const struct zorro_device_id *ent)
 		goto fail_check_or_alloc;
 
 	instance = scsi_host_alloc(&gvp11_scsi_template,
-				   sizeof(struct gvp11_hostdata));
+				   sizeof(struct gvp11_hostdata), NULL);
 	if (!instance) {
 		error = -ENOMEM;
 		goto fail_check_or_alloc;
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c
index 944ce19ae2fc..5696da8da6c7 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -2483,7 +2483,7 @@ static struct Scsi_Host *hisi_sas_shost_alloc(struct platform_device *pdev,
 	struct device *dev = &pdev->dev;
 	int error;
 
-	shost = scsi_host_alloc(hw->sht, sizeof(*hisi_hba));
+	shost = scsi_host_alloc(hw->sht, sizeof(*hisi_hba), NULL);
 	if (!shost) {
 		dev_err(dev, "scsi host alloc failed\n");
 		return NULL;
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index c7430f7c4048..44e584496ed5 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -3469,7 +3469,7 @@ hisi_sas_shost_alloc_pci(struct pci_dev *pdev)
 	struct hisi_hba *hisi_hba;
 	struct device *dev = &pdev->dev;
 
-	shost = scsi_host_alloc(&sht_v3_hw, sizeof(*hisi_hba));
+	shost = scsi_host_alloc(&sht_v3_hw, sizeof(*hisi_hba), &pdev->dev);
 	if (!shost) {
 		dev_err(dev, "shost alloc failed\n");
 		return NULL;
diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index e047747d4ecf..e1f42be79729 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -403,12 +403,14 @@ static const struct device_type scsi_host_type = {
  * Return value:
  * 	Pointer to a new Scsi_Host
  **/
-struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht, int privsize)
+struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht, int privsize,
+				  struct device *dev)
 {
 	struct Scsi_Host *shost;
 	int index;
 
-	shost = kzalloc(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL);
+	shost = kzalloc_node(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL,
+			     dev ? dev_to_node(dev) : NUMA_NO_NODE);
 	if (!shost)
 		return NULL;
 
diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index a1b116cd4723..b9f9f18bd985 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -5837,7 +5837,7 @@ static int hpsa_scsi_host_alloc(struct ctlr_info *h)
 {
 	struct Scsi_Host *sh;
 
-	sh = scsi_host_alloc(&hpsa_driver_template, sizeof(struct ctlr_info *));
+	sh = scsi_host_alloc(&hpsa_driver_template, sizeof(struct ctlr_info *), &h->pdev->dev);
 	if (sh == NULL) {
 		dev_err(&h->pdev->dev, "scsi_host_alloc failed\n");
 		return -ENOMEM;
diff --git a/drivers/scsi/hptiop.c b/drivers/scsi/hptiop.c
index 7083c14c5302..7d79357be265 100644
--- a/drivers/scsi/hptiop.c
+++ b/drivers/scsi/hptiop.c
@@ -1311,7 +1311,7 @@ static int hptiop_probe(struct pci_dev *pcidev, const struct pci_device_id *id)
 		goto disable_pci_device;
 	}
 
-	host = scsi_host_alloc(&driver_template, sizeof(struct hptiop_hba));
+	host = scsi_host_alloc(&driver_template, sizeof(struct hptiop_hba), &pcidev->dev);
 	if (!host) {
 		printk(KERN_ERR "hptiop: fail to alloc scsi host\n");
 		goto free_pci_regions;
diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 3dd2adda195e..b11d564a21d9 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -6325,7 +6325,7 @@ static int ibmvfc_probe(struct vio_dev *vdev, const struct vio_device_id *id)
 	unsigned int max_scsi_queues = min((unsigned int)IBMVFC_MAX_SCSI_QUEUES, online_cpus);
 
 	ENTER;
-	shost = scsi_host_alloc(&driver_template, sizeof(*vhost));
+	shost = scsi_host_alloc(&driver_template, sizeof(*vhost), NULL);
 	if (!shost) {
 		dev_err(dev, "Couldn't allocate host data\n");
 		goto out;
diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
index 609bda730b3a..e8342e581246 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -2235,7 +2235,7 @@ static int ibmvscsi_probe(struct vio_dev *vdev, const struct vio_device_id *id)
 
 	dev_set_drvdata(&vdev->dev, NULL);
 
-	host = scsi_host_alloc(&driver_template, sizeof(*hostdata));
+	host = scsi_host_alloc(&driver_template, sizeof(*hostdata), NULL);
 	if (!host) {
 		dev_err(&vdev->dev, "couldn't allocate host data\n");
 		goto scsi_host_alloc_failed;
diff --git a/drivers/scsi/imm.c b/drivers/scsi/imm.c
index 0535252e77e3..a6131f87fcaf 100644
--- a/drivers/scsi/imm.c
+++ b/drivers/scsi/imm.c
@@ -1221,7 +1221,7 @@ static int __imm_attach(struct parport *pb)
 	INIT_DELAYED_WORK(&dev->imm_tq, imm_interrupt);
 
 	err = -ENOMEM;
-	host = scsi_host_alloc(&imm_template, sizeof(imm_struct *));
+	host = scsi_host_alloc(&imm_template, sizeof(imm_struct *), NULL);
 	if (!host)
 		goto out1;
 	host->io_port = pb->base;
diff --git a/drivers/scsi/initio.c b/drivers/scsi/initio.c
index 06fbe85dccfa..294f7f8d5dbb 100644
--- a/drivers/scsi/initio.c
+++ b/drivers/scsi/initio.c
@@ -2824,7 +2824,7 @@ static int initio_probe_one(struct pci_dev *pdev,
 		error = -ENODEV;
 		goto out_disable_device;
 	}
-	shost = scsi_host_alloc(&initio_template, sizeof(struct initio_host));
+	shost = scsi_host_alloc(&initio_template, sizeof(struct initio_host), &pdev->dev);
 	if (!shost) {
 		printk(KERN_WARNING "initio: Could not allocate host structure.\n");
 		error = -ENOMEM;
diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c
index d207e5e81afe..85608804ff39 100644
--- a/drivers/scsi/ipr.c
+++ b/drivers/scsi/ipr.c
@@ -9379,7 +9379,7 @@ static int ipr_probe_ioa(struct pci_dev *pdev,
 	ENTER;
 
 	dev_info(&pdev->dev, "Found IOA with IRQ: %d\n", pdev->irq);
-	host = scsi_host_alloc(&driver_template, sizeof(*ioa_cfg));
+	host = scsi_host_alloc(&driver_template, sizeof(*ioa_cfg), &pdev->dev);
 
 	if (!host) {
 		dev_err(&pdev->dev, "call to scsi_host_alloc failed!\n");
diff --git a/drivers/scsi/ips.c b/drivers/scsi/ips.c
index 41ed73966a48..709a2a799f3e 100644
--- a/drivers/scsi/ips.c
+++ b/drivers/scsi/ips.c
@@ -6638,7 +6638,7 @@ ips_register_scsi(int index)
 {
 	struct Scsi_Host *sh;
 	ips_ha_t *ha, *oldha = ips_ha[index];
-	sh = scsi_host_alloc(&ips_driver_template, sizeof (ips_ha_t));
+	sh = scsi_host_alloc(&ips_driver_template, sizeof(ips_ha_t), &oldha->pcidev->dev);
 	if (!sh) {
 		IPS_PRINTK(KERN_WARNING, oldha->pcidev,
 			   "Unable to register controller with SCSI subsystem\n");
diff --git a/drivers/scsi/isci/init.c b/drivers/scsi/isci/init.c
index acf0c2038d20..7da06ace20ad 100644
--- a/drivers/scsi/isci/init.c
+++ b/drivers/scsi/isci/init.c
@@ -538,7 +538,7 @@ static struct isci_host *isci_host_alloc(struct pci_dev *pdev, int id)
 		INIT_LIST_HEAD(&idev->node);
 	}
 
-	shost = scsi_host_alloc(&isci_sht, sizeof(void *));
+	shost = scsi_host_alloc(&isci_sht, sizeof(void *), &pdev->dev);
 	if (!shost)
 		return NULL;
 
diff --git a/drivers/scsi/jazz_esp.c b/drivers/scsi/jazz_esp.c
index 35137f5cfb3a..1817246e4cc6 100644
--- a/drivers/scsi/jazz_esp.c
+++ b/drivers/scsi/jazz_esp.c
@@ -110,7 +110,7 @@ static int esp_jazz_probe(struct platform_device *dev)
 	struct resource *res;
 	int err;
 
-	host = scsi_host_alloc(tpnt, sizeof(struct esp));
+	host = scsi_host_alloc(tpnt, sizeof(struct esp), NULL);
 
 	err = -ENOMEM;
 	if (!host)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index 160f02f2f51d..458955dfc0aa 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -2903,7 +2903,7 @@ struct Scsi_Host *iscsi_host_alloc(const struct scsi_host_template *sht,
 	struct Scsi_Host *shost;
 	struct iscsi_host *ihost;
 
-	shost = scsi_host_alloc(sht, sizeof(struct iscsi_host) + dd_data_size);
+	shost = scsi_host_alloc(sht, sizeof(struct iscsi_host) + dd_data_size, NULL);
 	if (!shost)
 		return NULL;
 	ihost = shost_priv(shost);
diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 82af59c913e9..25264866075f 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -4745,7 +4745,7 @@ lpfc_create_port(struct lpfc_hba *phba, int instance, struct device *dev)
 		template->sg_tablesize = lpfc_get_sg_tablesize(phba);
 	}
 
-	shost = scsi_host_alloc(template, sizeof(struct lpfc_vport));
+	shost = scsi_host_alloc(template, sizeof(struct lpfc_vport), &phba->pcidev->dev);
 	if (!shost)
 		goto out;
 
diff --git a/drivers/scsi/mac53c94.c b/drivers/scsi/mac53c94.c
index de2bd860b9d7..737e5f2fef6f 100644
--- a/drivers/scsi/mac53c94.c
+++ b/drivers/scsi/mac53c94.c
@@ -426,7 +426,7 @@ static int mac53c94_probe(struct macio_dev *mdev, const struct of_device_id *mat
 		return -EBUSY;
 	}
 
-       	host = scsi_host_alloc(&mac53c94_template, sizeof(struct fsc_state));
+	host = scsi_host_alloc(&mac53c94_template, sizeof(struct fsc_state), NULL);
 	if (host == NULL) {
 		printk(KERN_ERR "mac53c94: couldn't register host");
 		rc = -ENOMEM;
diff --git a/drivers/scsi/mac_esp.c b/drivers/scsi/mac_esp.c
index a0ceaa2428c2..c8652bfdb3b8 100644
--- a/drivers/scsi/mac_esp.c
+++ b/drivers/scsi/mac_esp.c
@@ -301,7 +301,7 @@ static int esp_mac_probe(struct platform_device *dev)
 	if (dev->id > 1)
 		return -ENODEV;
 
-	host = scsi_host_alloc(tpnt, sizeof(struct esp));
+	host = scsi_host_alloc(tpnt, sizeof(struct esp), NULL);
 
 	err = -ENOMEM;
 	if (!host)
diff --git a/drivers/scsi/mac_scsi.c b/drivers/scsi/mac_scsi.c
index a86bd839d08e..eeb00ee30aaa 100644
--- a/drivers/scsi/mac_scsi.c
+++ b/drivers/scsi/mac_scsi.c
@@ -474,7 +474,7 @@ static int __init mac_scsi_probe(struct platform_device *pdev)
 		mac_scsi_template.sg_tablesize = 1;
 
 	instance = scsi_host_alloc(&mac_scsi_template,
-	                           sizeof(struct NCR5380_hostdata));
+				   sizeof(struct NCR5380_hostdata), NULL);
 	if (!instance)
 		return -ENOMEM;
 
diff --git a/drivers/scsi/megaraid.c b/drivers/scsi/megaraid.c
index 9476a0d2c72d..701e54843193 100644
--- a/drivers/scsi/megaraid.c
+++ b/drivers/scsi/megaraid.c
@@ -4203,7 +4203,7 @@ megaraid_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
 	}
 
 	/* Initialize SCSI Host structure */
-	host = scsi_host_alloc(&megaraid_template, sizeof(adapter_t));
+	host = scsi_host_alloc(&megaraid_template, sizeof(adapter_t), &pdev->dev);
 	if (!host)
 		goto out_iounmap;
 
diff --git a/drivers/scsi/megaraid/megaraid_mbox.c b/drivers/scsi/megaraid/megaraid_mbox.c
index ce89032a5a74..17b015b3d35f 100644
--- a/drivers/scsi/megaraid/megaraid_mbox.c
+++ b/drivers/scsi/megaraid/megaraid_mbox.c
@@ -620,7 +620,7 @@ megaraid_io_attach(adapter_t *adapter)
 	struct Scsi_Host	*host;
 
 	// Initialize SCSI Host structure
-	host = scsi_host_alloc(&megaraid_template_g, 8);
+	host = scsi_host_alloc(&megaraid_template_g, 8, &pdev->dev);
 	if (!host) {
 		con_log(CL_ANN, (KERN_WARNING
 			"megaraid mbox: scsi_host_alloc failed\n"));
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index ecd365d78ae3..bae1070371d5 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -7512,7 +7512,7 @@ static int megasas_probe_one(struct pci_dev *pdev,
 	pci_set_master(pdev);
 
 	host = scsi_host_alloc(&megasas_template,
-			       sizeof(struct megasas_instance));
+			       sizeof(struct megasas_instance), &pdev->dev);
 
 	if (!host) {
 		dev_printk(KERN_DEBUG, &pdev->dev, "scsi_host_alloc failed\n");
diff --git a/drivers/scsi/mesh.c b/drivers/scsi/mesh.c
index dc1402b321da..a4ba6bc49d23 100644
--- a/drivers/scsi/mesh.c
+++ b/drivers/scsi/mesh.c
@@ -1877,7 +1877,7 @@ static int mesh_probe(struct macio_dev *mdev, const struct of_device_id *match)
        		printk(KERN_ERR "mesh: unable to request memory resources");
 		return -EBUSY;
 	}
-       	mesh_host = scsi_host_alloc(&mesh_template, sizeof(struct mesh_state));
+	mesh_host = scsi_host_alloc(&mesh_template, sizeof(struct mesh_state), NULL);
 	if (mesh_host == NULL) {
 		printk(KERN_ERR "mesh: couldn't register host");
 		goto out_release;
diff --git a/drivers/scsi/mpi3mr/mpi3mr_os.c b/drivers/scsi/mpi3mr/mpi3mr_os.c
index 402d1f35d214..c74e2addc77d 100644
--- a/drivers/scsi/mpi3mr/mpi3mr_os.c
+++ b/drivers/scsi/mpi3mr/mpi3mr_os.c
@@ -5468,7 +5468,7 @@ mpi3mr_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	}
 
 	shost = scsi_host_alloc(&mpi3mr_driver_template,
-	    sizeof(struct mpi3mr_ioc));
+	    sizeof(struct mpi3mr_ioc), &pdev->dev);
 	if (!shost) {
 		retval = -ENODEV;
 		goto shost_failed;
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 6ff788557294..06c8df6261d4 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -13367,7 +13367,7 @@ _scsih_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 			PCIE_LINK_STATE_L1 | PCIE_LINK_STATE_CLKPM);
 		/* Use mpt2sas driver host template for SAS 2.0 HBA's */
 		shost = scsi_host_alloc(&mpt2sas_driver_template,
-		  sizeof(struct MPT3SAS_ADAPTER));
+		  sizeof(struct MPT3SAS_ADAPTER), &pdev->dev);
 		if (!shost)
 			return -ENODEV;
 		ioc = shost_priv(shost);
@@ -13399,7 +13399,7 @@ _scsih_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	case MPI26_VERSION:
 		/* Use mpt3sas driver host template for SAS 3.0 HBA's */
 		shost = scsi_host_alloc(&mpt3sas_driver_template,
-		  sizeof(struct MPT3SAS_ADAPTER));
+		  sizeof(struct MPT3SAS_ADAPTER), &pdev->dev);
 		if (!shost)
 			return -ENODEV;
 		ioc = shost_priv(shost);
diff --git a/drivers/scsi/mvme147.c b/drivers/scsi/mvme147.c
index 98b99c0f5bc7..4d61e25de2cf 100644
--- a/drivers/scsi/mvme147.c
+++ b/drivers/scsi/mvme147.c
@@ -97,7 +97,7 @@ static int __init mvme147_init(void)
 		return 0;
 
 	mvme147_shost = scsi_host_alloc(&mvme147_host_template,
-			sizeof(struct WD33C93_hostdata));
+			sizeof(struct WD33C93_hostdata), NULL);
 	if (!mvme147_shost)
 		goto err_out;
 	mvme147_shost->base = 0xfffe4000;
diff --git a/drivers/scsi/mvsas/mv_init.c b/drivers/scsi/mvsas/mv_init.c
index 5abc17a2e261..fd90b5eec0b4 100644
--- a/drivers/scsi/mvsas/mv_init.c
+++ b/drivers/scsi/mvsas/mv_init.c
@@ -494,7 +494,7 @@ static int mvs_pci_init(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (rc)
 		goto err_out_regions;
 
-	shost = scsi_host_alloc(&mvs_sht, sizeof(void *));
+	shost = scsi_host_alloc(&mvs_sht, sizeof(void *), &pdev->dev);
 	if (!shost) {
 		rc = -ENOMEM;
 		goto err_out_regions;
diff --git a/drivers/scsi/mvumi.c b/drivers/scsi/mvumi.c
index e70d336b4ab3..d12b33a32a09 100644
--- a/drivers/scsi/mvumi.c
+++ b/drivers/scsi/mvumi.c
@@ -2468,7 +2468,7 @@ static int mvumi_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (ret)
 		goto fail_set_dma_mask;
 
-	host = scsi_host_alloc(&mvumi_template, sizeof(*mhba));
+	host = scsi_host_alloc(&mvumi_template, sizeof(*mhba), &pdev->dev);
 	if (!host) {
 		dev_err(&pdev->dev, "scsi_host_alloc failed\n");
 		ret = -ENOMEM;
diff --git a/drivers/scsi/myrb.c b/drivers/scsi/myrb.c
index 3678b66310ed..f28c29b41cf6 100644
--- a/drivers/scsi/myrb.c
+++ b/drivers/scsi/myrb.c
@@ -3401,7 +3401,7 @@ static struct myrb_hba *myrb_detect(struct pci_dev *pdev,
 	struct Scsi_Host *shost;
 	struct myrb_hba *cb = NULL;
 
-	shost = scsi_host_alloc(&myrb_template, sizeof(struct myrb_hba));
+	shost = scsi_host_alloc(&myrb_template, sizeof(struct myrb_hba), &pdev->dev);
 	if (!shost) {
 		dev_err(&pdev->dev, "Unable to allocate Controller\n");
 		return NULL;
diff --git a/drivers/scsi/myrs.c b/drivers/scsi/myrs.c
index afd68225221a..a8ce488e6520 100644
--- a/drivers/scsi/myrs.c
+++ b/drivers/scsi/myrs.c
@@ -1937,7 +1937,7 @@ static struct myrs_hba *myrs_alloc_host(struct pci_dev *pdev,
 	struct Scsi_Host *shost;
 	struct myrs_hba *cs;
 
-	shost = scsi_host_alloc(&myrs_template, sizeof(struct myrs_hba));
+	shost = scsi_host_alloc(&myrs_template, sizeof(struct myrs_hba), &pdev->dev);
 	if (!shost)
 		return NULL;
 
diff --git a/drivers/scsi/ncr53c8xx.c b/drivers/scsi/ncr53c8xx.c
index 5369ca3fe4fd..009d4c55054e 100644
--- a/drivers/scsi/ncr53c8xx.c
+++ b/drivers/scsi/ncr53c8xx.c
@@ -8108,7 +8108,7 @@ struct Scsi_Host * __init ncr_attach(struct scsi_host_template *tpnt,
 	printk(KERN_INFO "ncr53c720-%d: rev 0x%x irq %d\n",
 		unit, device->chip.revision_id, device->slot.irq);
 
-	instance = scsi_host_alloc(tpnt, sizeof(*host_data));
+	instance = scsi_host_alloc(tpnt, sizeof(*host_data), NULL);
 	if (!instance)
 	        goto attach_error;
 	host_data = (struct host_data *) instance->hostdata;
diff --git a/drivers/scsi/nsp32.c b/drivers/scsi/nsp32.c
index e893d5677241..681e1d554657 100644
--- a/drivers/scsi/nsp32.c
+++ b/drivers/scsi/nsp32.c
@@ -2556,7 +2556,7 @@ static int nsp32_detect(struct pci_dev *pdev)
 	/*
 	 * register this HBA as SCSI device
 	 */
-	host = scsi_host_alloc(&nsp32_template, sizeof(nsp32_hw_data));
+	host = scsi_host_alloc(&nsp32_template, sizeof(nsp32_hw_data), &pdev->dev);
 	if (host == NULL) {
 		nsp32_msg (KERN_ERR, "failed to scsi register");
 		goto err;
diff --git a/drivers/scsi/pcmcia/nsp_cs.c b/drivers/scsi/pcmcia/nsp_cs.c
index ae70fda96ae9..32ca7872b7f8 100644
--- a/drivers/scsi/pcmcia/nsp_cs.c
+++ b/drivers/scsi/pcmcia/nsp_cs.c
@@ -1326,7 +1326,7 @@ static struct Scsi_Host *nsp_detect(struct scsi_host_template *sht)
 	nsp_hw_data *data_b = &nsp_data_base, *data;
 
 	nsp_dbg(NSP_DEBUG_INIT, "this_id=%d", sht->this_id);
-	host = scsi_host_alloc(&nsp_driver_template, sizeof(nsp_hw_data));
+	host = scsi_host_alloc(&nsp_driver_template, sizeof(nsp_hw_data), NULL);
 	if (host == NULL) {
 		nsp_dbg(NSP_DEBUG_INIT, "host failed");
 		return NULL;
diff --git a/drivers/scsi/pcmcia/qlogic_stub.c b/drivers/scsi/pcmcia/qlogic_stub.c
index 5d8a434d3f66..b417b39ab723 100644
--- a/drivers/scsi/pcmcia/qlogic_stub.c
+++ b/drivers/scsi/pcmcia/qlogic_stub.c
@@ -106,7 +106,7 @@ static struct Scsi_Host *qlogic_detect(struct scsi_host_template *host,
 	qlogicfas408_setup(qbase, qinitid, INT_TYPE);
 
 	host->name = qlogic_name;
-	shost = scsi_host_alloc(host, sizeof(struct qlogicfas408_priv));
+	shost = scsi_host_alloc(host, sizeof(struct qlogicfas408_priv), NULL);
 	if (!shost)
 		goto err;
 	shost->io_port = qbase;
diff --git a/drivers/scsi/pcmcia/sym53c500_cs.c b/drivers/scsi/pcmcia/sym53c500_cs.c
index 1530c1ad5d36..83aab6c69a62 100644
--- a/drivers/scsi/pcmcia/sym53c500_cs.c
+++ b/drivers/scsi/pcmcia/sym53c500_cs.c
@@ -752,7 +752,7 @@ SYM53C500_config(struct pcmcia_device *link)
 
 	chip_init(port_base);
 
-	host = scsi_host_alloc(tpnt, sizeof(struct sym53c500_data));
+	host = scsi_host_alloc(tpnt, sizeof(struct sym53c500_data), NULL);
 	if (!host) {
 		printk("SYM53C500: Unable to register host, giving up.\n");
 		goto err_release;
diff --git a/drivers/scsi/pm8001/pm8001_init.c b/drivers/scsi/pm8001/pm8001_init.c
index e93ea76b565e..873810c6853c 100644
--- a/drivers/scsi/pm8001/pm8001_init.c
+++ b/drivers/scsi/pm8001/pm8001_init.c
@@ -1142,7 +1142,7 @@ static int pm8001_pci_probe(struct pci_dev *pdev,
 	if (rc)
 		goto err_out_regions;
 
-	shost = scsi_host_alloc(&pm8001_sht, sizeof(void *));
+	shost = scsi_host_alloc(&pm8001_sht, sizeof(void *), &pdev->dev);
 	if (!shost) {
 		rc = -ENOMEM;
 		goto err_out_regions;
diff --git a/drivers/scsi/pmcraid.c b/drivers/scsi/pmcraid.c
index 942a99393204..a26c747806ef 100644
--- a/drivers/scsi/pmcraid.c
+++ b/drivers/scsi/pmcraid.c
@@ -5236,7 +5236,7 @@ static int pmcraid_probe(struct pci_dev *pdev,
 	}
 
 	host = scsi_host_alloc(&pmcraid_host_template,
-				sizeof(struct pmcraid_instance));
+				sizeof(struct pmcraid_instance), &pdev->dev);
 
 	if (!host) {
 		dev_err(&pdev->dev, "scsi_host_alloc failed!\n");
diff --git a/drivers/scsi/ppa.c b/drivers/scsi/ppa.c
index 8a4e910d5758..40fe9c6acc3b 100644
--- a/drivers/scsi/ppa.c
+++ b/drivers/scsi/ppa.c
@@ -1101,7 +1101,7 @@ static int __ppa_attach(struct parport *pb)
 	INIT_DELAYED_WORK(&dev->ppa_tq, ppa_interrupt);
 
 	err = -ENOMEM;
-	host = scsi_host_alloc(&ppa_template, sizeof(ppa_struct *));
+	host = scsi_host_alloc(&ppa_template, sizeof(ppa_struct *), NULL);
 	if (!host)
 		goto out1;
 	host->io_port = pb->base;
diff --git a/drivers/scsi/ps3rom.c b/drivers/scsi/ps3rom.c
index a9c727d22931..3542a35b137e 100644
--- a/drivers/scsi/ps3rom.c
+++ b/drivers/scsi/ps3rom.c
@@ -361,7 +361,7 @@ static int ps3rom_probe(struct ps3_system_bus_device *_dev)
 		goto fail_free_bounce;
 
 	host = scsi_host_alloc(&ps3rom_host_template,
-			       sizeof(struct ps3rom_private));
+			       sizeof(struct ps3rom_private), NULL);
 	if (!host) {
 		dev_err(&dev->sbd.core, "%s:%u: scsi_host_alloc failed\n",
 			__func__, __LINE__);
diff --git a/drivers/scsi/qla1280.c b/drivers/scsi/qla1280.c
index cdd6fe002c32..f88f2e659baa 100644
--- a/drivers/scsi/qla1280.c
+++ b/drivers/scsi/qla1280.c
@@ -4142,7 +4142,7 @@ qla1280_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
 	pci_set_master(pdev);
 
 	error = -ENOMEM;
-	host = scsi_host_alloc(&qla1280_driver_template, sizeof(*ha));
+	host = scsi_host_alloc(&qla1280_driver_template, sizeof(*ha), &pdev->dev);
 	if (!host) {
 		printk(KERN_WARNING
 		       "qla1280: Failed to register host, aborting.\n");
diff --git a/drivers/scsi/qla2xxx/qla_mid.c b/drivers/scsi/qla2xxx/qla_mid.c
index c563133f751e..4bafc367e21d 100644
--- a/drivers/scsi/qla2xxx/qla_mid.c
+++ b/drivers/scsi/qla2xxx/qla_mid.c
@@ -502,7 +502,7 @@ qla24xx_create_vhost(struct fc_vport *fc_vport)
 	vha = qla2x00_create_host(sht, ha);
 	if (!vha) {
 		ql_log(ql_log_warn, vha, 0xa005,
-		    "scsi_host_alloc() failed for vport.\n");
+		    "scsi_host_alloc() failed for vport.\n", NULL);
 		return(NULL);
 	}
 
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 72b1c28e4dae..ce0d097f3317 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -5046,7 +5046,7 @@ struct scsi_qla_host *qla2x00_create_host(const struct scsi_host_template *sht,
 	struct Scsi_Host *host;
 	struct scsi_qla_host *vha = NULL;
 
-	host = scsi_host_alloc(sht, sizeof(scsi_qla_host_t));
+	host = scsi_host_alloc(sht, sizeof(scsi_qla_host_t), &ha->pdev->dev);
 	if (!host) {
 		ql_log_pci(ql_log_fatal, ha->pdev, 0x0107,
 		    "Failed to allocate host from the scsi layer, aborting.\n");
diff --git a/drivers/scsi/qlogicfas.c b/drivers/scsi/qlogicfas.c
index 8f05e3707d69..b9ead7dc371c 100644
--- a/drivers/scsi/qlogicfas.c
+++ b/drivers/scsi/qlogicfas.c
@@ -95,7 +95,7 @@ static struct Scsi_Host *__qlogicfas_detect(struct scsi_host_template *host,
 
 	qlogicfas408_setup(qbase, qinitid, INT_TYPE);
 
-	hreg = scsi_host_alloc(host, sizeof(struct qlogicfas408_priv));
+	hreg = scsi_host_alloc(host, sizeof(struct qlogicfas408_priv), NULL);
 	if (!hreg)
 		goto err_release_mem;
 	priv = get_priv_by_host(hreg);
diff --git a/drivers/scsi/qlogicpti.c b/drivers/scsi/qlogicpti.c
index ea0a2b5a0a42..f67a9b400100 100644
--- a/drivers/scsi/qlogicpti.c
+++ b/drivers/scsi/qlogicpti.c
@@ -1316,7 +1316,7 @@ static int qpti_sbus_probe(struct platform_device *op)
 	if (op->archdata.irqs[0] == 0)
 		return -ENODEV;
 
-	host = scsi_host_alloc(&qpti_template, sizeof(struct qlogicpti));
+	host = scsi_host_alloc(&qpti_template, sizeof(struct qlogicpti), NULL);
 	if (!host)
 		return -ENOMEM;
 
diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index bb6b0e7fb910..59488bf74ce0 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -9548,7 +9548,7 @@ static int sdebug_driver_probe(struct device *dev)
 
 	sdbg_host = dev_to_sdebug_host(dev);
 
-	hpnt = scsi_host_alloc(&sdebug_driver_template, 0);
+	hpnt = scsi_host_alloc(&sdebug_driver_template, 0, NULL);
 	if (NULL == hpnt) {
 		pr_err("scsi_host_alloc failed\n");
 		error = -ENODEV;
diff --git a/drivers/scsi/sgiwd93.c b/drivers/scsi/sgiwd93.c
index 6594661db5f4..07fbe6fda7c2 100644
--- a/drivers/scsi/sgiwd93.c
+++ b/drivers/scsi/sgiwd93.c
@@ -231,7 +231,7 @@ static int sgiwd93_probe(struct platform_device *pdev)
 	unsigned int irq = pd->irq;
 	int err;
 
-	host = scsi_host_alloc(&sgiwd93_template, sizeof(struct ip22_hostdata));
+	host = scsi_host_alloc(&sgiwd93_template, sizeof(struct ip22_hostdata), NULL);
 	if (!host) {
 		err = -ENOMEM;
 		goto out;
diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index 65ff50982978..a3163c06b3f8 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -7619,7 +7619,7 @@ static int pqi_register_scsi(struct pqi_ctrl_info *ctrl_info)
 	int rc;
 	struct Scsi_Host *shost;
 
-	shost = scsi_host_alloc(&pqi_driver_template, sizeof(ctrl_info));
+	shost = scsi_host_alloc(&pqi_driver_template, sizeof(ctrl_info), &ctrl_info->pci_dev->dev);
 	if (!shost) {
 		dev_err(&ctrl_info->pci_dev->dev, "scsi_host_alloc failed\n");
 		return -ENOMEM;
diff --git a/drivers/scsi/snic/snic_main.c b/drivers/scsi/snic/snic_main.c
index 82953e6a0915..9edf6661e6f1 100644
--- a/drivers/scsi/snic/snic_main.c
+++ b/drivers/scsi/snic/snic_main.c
@@ -363,7 +363,7 @@ snic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	/*
 	 * Allocate SCSI Host and setup association between host, and snic
 	 */
-	shost = scsi_host_alloc(&snic_host_template, sizeof(struct snic));
+	shost = scsi_host_alloc(&snic_host_template, sizeof(struct snic), &pdev->dev);
 	if (!shost) {
 		SNIC_ERR("Unable to alloc scsi_host\n");
 		ret = -ENOMEM;
diff --git a/drivers/scsi/stex.c b/drivers/scsi/stex.c
index 6aeeb338633d..7d6b851fef24 100644
--- a/drivers/scsi/stex.c
+++ b/drivers/scsi/stex.c
@@ -1667,7 +1667,7 @@ static int stex_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	S6flag = 0;
 	register_reboot_notifier(&stex_notifier);
 
-	host = scsi_host_alloc(&driver_template, sizeof(struct st_hba));
+	host = scsi_host_alloc(&driver_template, sizeof(struct st_hba), &pdev->dev);
 
 	if (!host) {
 		printk(KERN_ERR DRV_NAME "(%s): scsi_host_alloc failed\n",
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 571ea549152b..fc4c05127dc4 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1969,7 +1969,7 @@ static int storvsc_probe(struct hv_device *device,
 				(100 - ring_avail_percent_lowater) / 100;
 
 	host = scsi_host_alloc(&scsi_driver,
-			       sizeof(struct hv_host_device));
+			       sizeof(struct hv_host_device), NULL);
 	if (!host)
 		return -ENOMEM;
 
diff --git a/drivers/scsi/sun3_scsi.c b/drivers/scsi/sun3_scsi.c
index ca9cd691cc32..ed41b605328e 100644
--- a/drivers/scsi/sun3_scsi.c
+++ b/drivers/scsi/sun3_scsi.c
@@ -578,7 +578,7 @@ static int __init sun3_scsi_probe(struct platform_device *pdev)
 #endif
 
 	instance = scsi_host_alloc(&sun3_scsi_template,
-	                           sizeof(struct NCR5380_hostdata));
+				   sizeof(struct NCR5380_hostdata), NULL);
 	if (!instance) {
 		error = -ENOMEM;
 		goto fail_alloc;
diff --git a/drivers/scsi/sun3x_esp.c b/drivers/scsi/sun3x_esp.c
index 365406885b8e..f7e48f4c5444 100644
--- a/drivers/scsi/sun3x_esp.c
+++ b/drivers/scsi/sun3x_esp.c
@@ -175,7 +175,7 @@ static int esp_sun3x_probe(struct platform_device *dev)
 	struct resource *res;
 	int err = -ENOMEM;
 
-	host = scsi_host_alloc(tpnt, sizeof(struct esp));
+	host = scsi_host_alloc(tpnt, sizeof(struct esp), NULL);
 	if (!host)
 		goto fail;
 
diff --git a/drivers/scsi/sun_esp.c b/drivers/scsi/sun_esp.c
index aa430501f0c7..bc4e4030acb6 100644
--- a/drivers/scsi/sun_esp.c
+++ b/drivers/scsi/sun_esp.c
@@ -457,7 +457,7 @@ static int esp_sbus_probe_one(struct platform_device *op,
 	struct esp *esp;
 	int err;
 
-	host = scsi_host_alloc(tpnt, sizeof(struct esp));
+	host = scsi_host_alloc(tpnt, sizeof(struct esp), NULL);
 
 	err = -ENOMEM;
 	if (!host)
diff --git a/drivers/scsi/sym53c8xx_2/sym_glue.c b/drivers/scsi/sym53c8xx_2/sym_glue.c
index 27e22acaf1a7..16e821c3b59e 100644
--- a/drivers/scsi/sym53c8xx_2/sym_glue.c
+++ b/drivers/scsi/sym53c8xx_2/sym_glue.c
@@ -1300,7 +1300,7 @@ static struct Scsi_Host *sym_attach(const struct scsi_host_template *tpnt, int u
 	if (!fw)
 		goto attach_failed;
 
-	shost = scsi_host_alloc(tpnt, sizeof(*sym_data));
+	shost = scsi_host_alloc(tpnt, sizeof(*sym_data), &pdev->dev);
 	if (!shost)
 		goto attach_failed;
 	sym_data = shost_priv(shost);
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 5fdaa71f0652..88375574cb18 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -929,7 +929,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
 	num_targets = virtscsi_config_get(vdev, max_target) + 1;
 
 	shost = scsi_host_alloc(&virtscsi_host_template,
-				struct_size(vscsi, req_vqs, num_queues));
+				struct_size(vscsi, req_vqs, num_queues), NULL);
 	if (!shost)
 		return -ENOMEM;
 
diff --git a/drivers/scsi/vmw_pvscsi.c b/drivers/scsi/vmw_pvscsi.c
index 151cac9f9c2a..32c39c66c49b 100644
--- a/drivers/scsi/vmw_pvscsi.c
+++ b/drivers/scsi/vmw_pvscsi.c
@@ -1435,7 +1435,7 @@ static int pvscsi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		PVSCSI_MAX_NUM_REQ_ENTRIES_PER_PAGE;
 	pvscsi_template.cmd_per_lun =
 		min(pvscsi_template.can_queue, pvscsi_cmd_per_lun);
-	host = scsi_host_alloc(&pvscsi_template, sizeof(struct pvscsi_adapter));
+	host = scsi_host_alloc(&pvscsi_template, sizeof(struct pvscsi_adapter), &pdev->dev);
 	if (!host) {
 		printk(KERN_ERR "vmw_pvscsi: failed to allocate host\n");
 		goto out_release_resources_and_disable;
diff --git a/drivers/scsi/wd719x.c b/drivers/scsi/wd719x.c
index 830d40f57f6a..0aa6bb093431 100644
--- a/drivers/scsi/wd719x.c
+++ b/drivers/scsi/wd719x.c
@@ -921,7 +921,7 @@ static int wd719x_pci_probe(struct pci_dev *pdev, const struct pci_device_id *d)
 		goto release_region;
 
 	err = -ENOMEM;
-	sh = scsi_host_alloc(&wd719x_template, sizeof(struct wd719x));
+	sh = scsi_host_alloc(&wd719x_template, sizeof(struct wd719x), &pdev->dev);
 	if (!sh)
 		goto release_region;
 
diff --git a/drivers/scsi/xen-scsifront.c b/drivers/scsi/xen-scsifront.c
index 989bcaee42ca..d4d57f33cc15 100644
--- a/drivers/scsi/xen-scsifront.c
+++ b/drivers/scsi/xen-scsifront.c
@@ -899,7 +899,7 @@ static int scsifront_probe(struct xenbus_device *dev,
 	int err = -ENOMEM;
 	char name[TASK_COMM_LEN];
 
-	host = scsi_host_alloc(&scsifront_sht, sizeof(*info));
+	host = scsi_host_alloc(&scsifront_sht, sizeof(*info), NULL);
 	if (!host) {
 		xenbus_dev_fatal(dev, err, "fail to allocate scsi host");
 		return err;
diff --git a/drivers/scsi/zorro_esp.c b/drivers/scsi/zorro_esp.c
index 1622285c9aec..5983015877a7 100644
--- a/drivers/scsi/zorro_esp.c
+++ b/drivers/scsi/zorro_esp.c
@@ -774,7 +774,7 @@ static int zorro_esp_probe(struct zorro_dev *z,
 		goto fail_free_zep;
 	}
 
-	host = scsi_host_alloc(tpnt, sizeof(struct esp));
+	host = scsi_host_alloc(tpnt, sizeof(struct esp), NULL);
 
 	if (!host) {
 		pr_err("No host detected; board configuration problem?\n");
diff --git a/include/scsi/libfc.h b/include/scsi/libfc.h
index be0ffe1e3395..17e545fa5c7e 100644
--- a/include/scsi/libfc.h
+++ b/include/scsi/libfc.h
@@ -883,7 +883,7 @@ libfc_host_alloc(const struct scsi_host_template *sht, int priv_size)
 	struct fc_lport *lport;
 	struct Scsi_Host *shost;
 
-	shost = scsi_host_alloc(sht, sizeof(*lport) + priv_size);
+	shost = scsi_host_alloc(sht, sizeof(*lport) + priv_size, NULL);
 	if (!shost)
 		return NULL;
 	lport = shost_priv(shost);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 7e2011830ba4..09c82a41b7a1 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -796,7 +796,8 @@ static inline int scsi_host_in_recovery(struct Scsi_Host *shost)
 extern int scsi_queue_work(struct Scsi_Host *, struct work_struct *);
 extern void scsi_flush_work(struct Scsi_Host *);
 
-extern struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *, int);
+extern struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht,
+					 int privsize, struct device *dev);
 extern int __must_check scsi_add_host_with_dma(struct Scsi_Host *,
 					       struct device *,
 					       struct device *);
-- 
2.43.7


^ permalink raw reply related

* [PATCH v3 1/4] scsi: scan: allocate sdev and starget on the NUMA node of the host adapter
From: Sumit Saxena @ 2026-06-09 12:18 UTC (permalink / raw)
  To: Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, James Rizzo, Sumit Saxena
In-Reply-To: <20260609121806.2121755-1-sumit.saxena@broadcom.com>

From: James Rizzo <james.rizzo@broadcom.com>

When a host adapter is attached to a specific NUMA node, allocating
scsi_device and scsi_target via kzalloc() may place them on a remote
node.  All hot-path I/O accesses to these structures then cross the NUMA
interconnect, adding latency and consuming inter-node bandwidth.

Use kzalloc_node() with dev_to_node(shost->dma_dev) so allocations land
on the same node as the HBA, reducing cross-node traffic and improving
I/O performance on NUMA systems.

Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
 drivers/scsi/scsi_scan.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index e27da038603a..121a14d5fdb8 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -34,6 +34,7 @@
 #include <linux/kthread.h>
 #include <linux/spinlock.h>
 #include <linux/async.h>
+#include <linux/topology.h>
 #include <linux/slab.h>
 #include <linux/unaligned.h>
 
@@ -287,8 +288,8 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
 	struct queue_limits lim;
 
-	sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
-		       GFP_KERNEL);
+	sdev = kzalloc_node(sizeof(*sdev) + shost->transportt->device_size,
+		       GFP_KERNEL, dev_to_node(shost->dma_dev));
 	if (!sdev)
 		goto out;
 
@@ -502,7 +503,7 @@ static struct scsi_target *scsi_alloc_target(struct device *parent,
 	struct scsi_target *found_target;
 	int error, ref_got;
 
-	starget = kzalloc(size, GFP_KERNEL);
+	starget = kzalloc_node(size, GFP_KERNEL, dev_to_node(shost->dma_dev));
 	if (!starget) {
 		printk(KERN_ERR "%s: allocation failure\n", __func__);
 		return NULL;
-- 
2.43.7


^ permalink raw reply related

* [PATCH v3 0/4] scsi/block: NUMA-local scan allocations, shared-tag path cleanup, and SCSI I/O counters
From: Sumit Saxena @ 2026-06-09 12:17 UTC (permalink / raw)
  To: Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Sumit Saxena

This series contains three performance improvements targeting the SCSI
and block layers on multi-socket NUMA and heavily loaded SMP systems.

On multi-socket NUMA systems we observed extreme I/O throughput variance
of 50-60% between runs.  This series identifies and fixes two root causes:
cross-node memory accesses due to NUMA-unaware allocations in the scan
path, and false sharing between hot atomic counters in struct request_queue
and struct scsi_device.

Performance notes:

Tested on a dual-socket NUMA system (2x 32-core, 256 GB/socket) with
an mpi3mr HBA, running fio (random read, 4K, QD 64, 16 jobs, 60 s,
direct I/O).  IOPS figures are in KIOPS (thousands of IOPS):

  Configuration                    Avg KIOPS   Range (KIOPS)   Spread
  Baseline                         6,255       4,200 - 6,700   ~37%
  Baseline + all patches           7,350       7,000 - 7,700    ~10%

Key findings:

These patches combinedly reduces the observed 50-60% run-to-run variance
to under 10%, significantly improving workload predictability and
improves IOPs by 16-18%.

No functional regressions observed.

Changes in v3
-------------
-Handled feedback from Bart Van Assche and John Garry.
-Added a patch for shost local NUMA allocation.
-Converted ioerr_cnt and iotmo_cnt atomic counters into per-cpu counters. 

Changes in v2
--------------

  Patch 1 — Same functional goal as v1 patch 1: NUMA-local scsi_device /
  scsi_target allocations in the scan path so steady-state I/O does not
  habitually touch remote memory when the host has a fixed DMA/NUMA
  affinity.

  Patch 2 — Replaces v1’s ____cacheline_aligned_in_smp on
  nr_active_requests_shared_tags with removal of the shared-tag fairness
  throttling machinery (including hctx_may_queue(), blk_mq_hw_ctx.nr_active,
  and request_queue.nr_active_requests_shared_tags and their updates).
  This follows the earlier standalone proposal by Bart Van Assche [1],
  rebased for the current tree; it removes the high-frequency atomic
  accounting that motivated the v1 false-sharing workaround and, in our
  testing, improves IOPS on the order of roughly 16–18% for the shared-tag
  workload exercised.

  Patch 3 — Replaces v1’s cache-line padding of iodone_cnt with
  percpu_counter for both iorequest_cnt and iodone_cnt, so submission and
  completion paths mostly update CPU-local state instead of bouncing a
  single cache line, without inflating struct scsi_device for SMP
  alignment.

Merge / review hints
--------------------

Patch 3 touches the block layer and should have block maintainer review;
rest of patches are SCSI-oriented.  Please route or Ack as your subsystem
workflow requires.

Bart Van Assche (1):
  block: drop shared-tag fairness throttling

James Rizzo (1):
  scsi: scan: allocate sdev and starget on the NUMA node of the host
    adapter

Sumit Saxena (2):
  scsi: host: allocate struct Scsi_Host on the NUMA node of the host
    adapter
  scsi: use percpu counters for iostat counters in struct scsi_device

 block/blk-core.c                          |   2 -
 block/blk-mq-debugfs.c                    |  22 ++++-
 block/blk-mq-tag.c                        |   4 -
 block/blk-mq.c                            |  17 +---
 block/blk-mq.h                            | 100 ----------------------
 drivers/scsi/3w-9xxx.c                    |   2 +-
 drivers/scsi/3w-sas.c                     |   2 +-
 drivers/scsi/3w-xxxx.c                    |   2 +-
 drivers/scsi/53c700.c                     |   2 +-
 drivers/scsi/BusLogic.c                   |   2 +-
 drivers/scsi/a100u2w.c                    |   2 +-
 drivers/scsi/a2091.c                      |   2 +-
 drivers/scsi/a3000.c                      |   2 +-
 drivers/scsi/aacraid/linit.c              |   2 +-
 drivers/scsi/advansys.c                   |   6 +-
 drivers/scsi/aha152x.c                    |   2 +-
 drivers/scsi/aha1542.c                    |   2 +-
 drivers/scsi/aha1740.c                    |   2 +-
 drivers/scsi/aic7xxx/aic79xx_osm.c        |   2 +-
 drivers/scsi/aic7xxx/aic7xxx_osm.c        |   2 +-
 drivers/scsi/aic94xx/aic94xx_init.c       |   2 +-
 drivers/scsi/am53c974.c                   |   2 +-
 drivers/scsi/arcmsr/arcmsr_hba.c          |   3 +-
 drivers/scsi/arm/acornscsi.c              |   2 +-
 drivers/scsi/arm/arxescsi.c               |   2 +-
 drivers/scsi/arm/cumana_1.c               |   2 +-
 drivers/scsi/arm/cumana_2.c               |   2 +-
 drivers/scsi/arm/eesox.c                  |   2 +-
 drivers/scsi/arm/oak.c                    |   2 +-
 drivers/scsi/arm/powertec.c               |   2 +-
 drivers/scsi/atari_scsi.c                 |   2 +-
 drivers/scsi/atp870u.c                    |   2 +-
 drivers/scsi/bfa/bfad_im.c                |   2 +-
 drivers/scsi/csiostor/csio_init.c         |   4 +-
 drivers/scsi/dc395x.c                     |   2 +-
 drivers/scsi/dmx3191d.c                   |   2 +-
 drivers/scsi/elx/efct/efct_xport.c        |   4 +-
 drivers/scsi/esas2r/esas2r_main.c         |   2 +-
 drivers/scsi/fdomain.c                    |   2 +-
 drivers/scsi/fnic/fnic_main.c             |   2 +-
 drivers/scsi/g_NCR5380.c                  |   2 +-
 drivers/scsi/gvp11.c                      |   2 +-
 drivers/scsi/hisi_sas/hisi_sas_main.c     |   2 +-
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    |   2 +-
 drivers/scsi/hosts.c                      |   6 +-
 drivers/scsi/hpsa.c                       |   2 +-
 drivers/scsi/hptiop.c                     |   2 +-
 drivers/scsi/ibmvscsi/ibmvfc.c            |   2 +-
 drivers/scsi/ibmvscsi/ibmvscsi.c          |   2 +-
 drivers/scsi/imm.c                        |   2 +-
 drivers/scsi/initio.c                     |   2 +-
 drivers/scsi/ipr.c                        |   2 +-
 drivers/scsi/ips.c                        |   2 +-
 drivers/scsi/isci/init.c                  |   2 +-
 drivers/scsi/jazz_esp.c                   |   2 +-
 drivers/scsi/libiscsi.c                   |   2 +-
 drivers/scsi/lpfc/lpfc_init.c             |   2 +-
 drivers/scsi/mac53c94.c                   |   2 +-
 drivers/scsi/mac_esp.c                    |   2 +-
 drivers/scsi/mac_scsi.c                   |   2 +-
 drivers/scsi/megaraid.c                   |   2 +-
 drivers/scsi/megaraid/megaraid_mbox.c     |   2 +-
 drivers/scsi/megaraid/megaraid_sas_base.c |   2 +-
 drivers/scsi/mesh.c                       |   2 +-
 drivers/scsi/mpi3mr/mpi3mr_os.c           |   2 +-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c      |   4 +-
 drivers/scsi/mvme147.c                    |   2 +-
 drivers/scsi/mvsas/mv_init.c              |   2 +-
 drivers/scsi/mvumi.c                      |   2 +-
 drivers/scsi/myrb.c                       |   2 +-
 drivers/scsi/myrs.c                       |   2 +-
 drivers/scsi/ncr53c8xx.c                  |   2 +-
 drivers/scsi/nsp32.c                      |   2 +-
 drivers/scsi/pcmcia/nsp_cs.c              |   2 +-
 drivers/scsi/pcmcia/qlogic_stub.c         |   2 +-
 drivers/scsi/pcmcia/sym53c500_cs.c        |   2 +-
 drivers/scsi/pm8001/pm8001_init.c         |   2 +-
 drivers/scsi/pmcraid.c                    |   2 +-
 drivers/scsi/ppa.c                        |   2 +-
 drivers/scsi/ps3rom.c                     |   2 +-
 drivers/scsi/qla1280.c                    |   2 +-
 drivers/scsi/qla2xxx/qla_mid.c            |   2 +-
 drivers/scsi/qla2xxx/qla_os.c             |   2 +-
 drivers/scsi/qlogicfas.c                  |   2 +-
 drivers/scsi/qlogicpti.c                  |   2 +-
 drivers/scsi/scsi_debug.c                 |   2 +-
 drivers/scsi/scsi_error.c                 |   4 +-
 drivers/scsi/scsi_lib.c                   |  10 +--
 drivers/scsi/scsi_scan.c                  |  15 +++-
 drivers/scsi/scsi_sysfs.c                 |  23 +++--
 drivers/scsi/sd.c                         |   2 +-
 drivers/scsi/sgiwd93.c                    |   2 +-
 drivers/scsi/smartpqi/smartpqi_init.c     |   2 +-
 drivers/scsi/snic/snic_main.c             |   2 +-
 drivers/scsi/stex.c                       |   2 +-
 drivers/scsi/storvsc_drv.c                |   2 +-
 drivers/scsi/sun3_scsi.c                  |   2 +-
 drivers/scsi/sun3x_esp.c                  |   2 +-
 drivers/scsi/sun_esp.c                    |   2 +-
 drivers/scsi/sym53c8xx_2/sym_glue.c       |   2 +-
 drivers/scsi/virtio_scsi.c                |   2 +-
 drivers/scsi/vmw_pvscsi.c                 |   2 +-
 drivers/scsi/wd719x.c                     |   2 +-
 drivers/scsi/xen-scsifront.c              |   2 +-
 drivers/scsi/zorro_esp.c                  |   2 +-
 include/linux/blk-mq.h                    |   6 --
 include/linux/blkdev.h                    |   2 -
 include/scsi/libfc.h                      |   2 +-
 include/scsi/scsi_device.h                |   9 +-
 include/scsi/scsi_host.h                  |   3 +-
 110 files changed, 168 insertions(+), 258 deletions(-)

-- 
2.43.7


^ permalink raw reply

* [PATCH splitout] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-09 10:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Miaohe Lin, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Naoya Horiguchi

TestSetPageHWPoison() is called without zone->lock, so its atomic
update to page->flags can race with non-atomic flag operations
that run under zone->lock in the buddy allocator.

In particular, __free_pages_prepare() does:

    page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;

This non-atomic read-modify-write, while correctly excluding
__PG_HWPOISON from the mask, can still lose a concurrent
TestSetPageHWPoison if the read happens before the poison bit
is set and the write happens after.  Will only get worse if/when
we add more non-atomic flag operations.

Fix by acquiring zone->lock around TestSetPageHWPoison and
around ClearPageHWPoison in the retry path.  This
serializes with all buddy flag manipulation.  The cost is
negligible: one lock/unlock in an extremely rare path
(hardware memory errors).

Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
in this file operate on pages already removed from the buddy
allocator or on non-buddy pages (DAX, hugetlb), so they do not
need zone->lock protection.

Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Assisted-by: Claude:claude-opus-4-6
---

Sending separately as suggested by multiple people. I also added
a Fixes tag.


 mm/memory-failure.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d4361309..3880486028a1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2348,6 +2348,8 @@ int memory_failure(unsigned long pfn, int flags)
 	unsigned long page_flags;
 	bool retry = true;
 	int hugetlb = 0;
+	struct zone *zone;
+	unsigned long mf_flags;
 
 	if (!sysctl_memory_failure_recovery)
 		panic("Memory failure on page %lx", pfn);
@@ -2390,7 +2392,11 @@ int memory_failure(unsigned long pfn, int flags)
 	if (hugetlb)
 		goto unlock_mutex;
 
+	/* Serialize with non-atomic buddy flag operations */
+	zone = page_zone(p);
+	spin_lock_irqsave(&zone->lock, mf_flags);
 	if (TestSetPageHWPoison(p)) {
+		spin_unlock_irqrestore(&zone->lock, mf_flags);
 		res = -EHWPOISON;
 		if (flags & MF_ACTION_REQUIRED)
 			res = kill_accessing_process(current, pfn, flags);
@@ -2399,6 +2405,7 @@ int memory_failure(unsigned long pfn, int flags)
 		action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
 		goto unlock_mutex;
 	}
+	spin_unlock_irqrestore(&zone->lock, mf_flags);
 
 	/*
 	 * We need/can do nothing about count=0 pages.
@@ -2420,7 +2427,10 @@ int memory_failure(unsigned long pfn, int flags)
 			} else {
 				/* We lost the race, try again */
 				if (retry) {
+					/* Serialize with non-atomic buddy flag operations */
+					spin_lock_irqsave(&zone->lock, mf_flags);
 					ClearPageHWPoison(p);
+					spin_unlock_irqrestore(&zone->lock, mf_flags);
 					retry = false;
 					goto try_again;
 				}
-- 
MST


^ permalink raw reply related

* Re: New design
From: Lorenzo Stoakes @ 2026-06-09 10:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Matthew Wilcox, linux-kernel, David Hildenbrand (Arm), Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <20260609033042-mutt-send-email-mst@kernel.org>

On Tue, Jun 09, 2026 at 04:06:08AM -0400, Michael S. Tsirkin wrote:
> One other question: would people like to see it as a single patchset
> or multiple ones 1-4? Multiple ones would be easier to review but
> of course this means no actual perf gain until part 5 is merged. Is that
> acceptable?

Personally I'd like to see multiple patch sets, not all sent at once.

Rather - send the first, wait for review, and once people have given tags,
then send the next - rinse and repeat.

That makes life easier for review, allows forward progress, and avoids
noise on-list.

>
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Michael S. Tsirkin @ 2026-06-09  9:54 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Matthew Wilcox, Lorenzo Stoakes, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <68ac0163-865a-452e-b643-351e094cb2ba@kernel.org>

On Mon, Jun 08, 2026 at 04:55:19PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 16:44, Matthew Wilcox wrote:
> > On Mon, Jun 08, 2026 at 04:37:03PM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/8/26 16:31, Matthew Wilcox wrote:
> >>>
> >>> What I don't understand is how the kernel page allocator needs to know
> >>> the user address in order to effectively zero it, but the hypervisor is
> >>> able to zero the page without knowing the user address.  It feels like
> >>> somebody has x86-centric thinking where cache colouring doesn't matter.
> >>
> >> (not commenting on the icache dache mess we have to drag along)
> > 
> > Well, that was kind of the point of this email ... I did ask the
> > question you're answering in a different email so let me respond
> > to that too.
> 
> Now I'm confused :)
> 
> > 
> >> The thing is that with free-page-reporting the memory is already zeroed by the
> >> hypervisor as part of discarding that memory previously (e.g., MADV_DONTNEED)
> >> and allocating fresh pages on re-access.
> >>
> >> So it's not a question of "why is the hypervisor zeroing less efficiently", as
> >> zeroing is just a side-product of reclaiming that memory in the first place.
> > 
> > We definitely have users who don't want the guest to trust the
> > hypervisor.  So how do they disable this optimisation?
> 
> Right, I don't think we currently have a toggle to disable free page reporting.
> So IIUC, this optimization would similarly automatically get enabled if the
> hypervisor advertises it.
> 
> -- 
> Cheers,
> 
> David

Not as the patchset stands:

[PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests

disables it.

-- 
MST


^ permalink raw reply

* Re: [PATCH v1] vsock/virtio: rework MSG_ZEROCOPY flag handling
From: Arseniy Krasnov @ 2026-06-09  9:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Paolo Abeni, Michael S. Tsirkin, Jason Wang,
	Bobby Eshleman, Xuan Zhuo, Eugenio Pérez, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, oxffffaa, rulkc
In-Reply-To: <20260608190820.71785304@kernel.org>

Hi, I'll rebase it on latest net-next

Thanks

On 09/06/2026 05:08, Jakub Kicinski wrote:
> On Fri,  5 Jun 2026 14:53:14 +0300 Arseniy Krasnov wrote:
>> Logically it was based on TCP implementation, so to make further
>> support easier, rewrite it in the TCP way.
> Does not apply:
>
> $ git pw series apply 1106582
> Failed to apply patch:
> Applying: vsock/virtio: rework MSG_ZEROCOPY flag handling
> error: sha1 information is lacking or useless (net/vmw_vsock/virtio_transport_common.c).
> error: could not build fake ancestor
> hint: Use 'git am --show-current-patch=diff' to see the failed patch
> hint: When you have resolved this problem, run "git am --continue".
> hint: If you prefer to skip this patch, run "git am --skip" instead.
> hint: To restore the original branch and stop patching, run "git am --abort".
> hint: Disable this message with "git config set advice.mergeConflict false"
> Patch failed at 0001 vsock/virtio: rework MSG_ZEROCOPY flag handling

^ permalink raw reply

* Re: [PATCH v1] drm/virtio: Fix driver removal with disabled KMS
From: Dmitry Osipenko @ 2026-06-09  9:33 UTC (permalink / raw)
  To: Ryosuke Yasuoka, David Airlie, Gerd Hoffmann, Gurchetan Singh,
	Chia-I Wu
  Cc: dri-devel, virtualization, linux-kernel
In-Reply-To: <18b75e0e21e52581.f1a5ca06374b8df6.21049e9ba3a0559d@ryasuoka-thinkpadx1carbongen9.tokyo.csb>

On 6/9/26 11:59, Ryosuke Yasuoka wrote:
> Hi Dmitry
> 
> On 08/06/2026 21:40, Dmitry Osipenko wrote:
>> Hi,
>>
>> On 6/7/26 07:31, Ryosuke Yasuoka wrote:
>>> Hi Dmitry
>>>
>>> On 04/06/2026 15:27, Dmitry Osipenko wrote:
>>>> DRM atomic and modesetting aren't initialized if virtio-gpu driver built
>>>> with disabled KMS, leading to access of uninitialized data on driver
>>>> removal/unbinding and crashing kernel. Fix it by skipping shutting down
>>>> atomic core with unavailable KMS.
>>>>
>>>> Fixes: 72122c69d717 ("drm/virtio: Add option to disable KMS support")
>>>> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
>>>> ---
>>>>  drivers/gpu/drm/virtio/virtgpu_drv.c | 5 ++++-
>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.c b/drivers/gpu/drm/virtio/virtgpu_drv.c
>>>> index f0fb784c0f6f..2aaa7cb08085 100644
>>>> --- a/drivers/gpu/drm/virtio/virtgpu_drv.c
>>>> +++ b/drivers/gpu/drm/virtio/virtgpu_drv.c
>>>> @@ -138,7 +138,10 @@ static void virtio_gpu_remove(struct virtio_device *vdev)
>>>>  
>>>>  	virtio_gpu_release_vqs(dev);
>>>>  	drm_dev_unplug(dev);
>>>> -	drm_atomic_helper_shutdown(dev);
>>>> +
>>>> +	if (drm_core_check_feature(dev, DRIVER_ATOMIC))
>>>> +		drm_atomic_helper_shutdown(dev);
>>>> +
>>>>  	virtio_gpu_deinit(dev);
>>>>  	drm_dev_put(dev);
>>>>  }
>>>
>>> The patch looks good to me at a glance. I haven't done a full, deep code
>>> review yet, but I've tested it on my lab and everything works as
>>> expected.
>>>
>>> Tested-by: Ryosuke Yasuoka <ryasuoka@redhat.com>
>>
>> Thanks a lot for the testing. The review from you will be very welcomed
>> too.
> 
> I reviewed your patch and this change looks good to me.
> 
> Reviewed-by: Ryosuke Yasuoka <ryasuoka@redhat.com>

Thanks for the review, applied to misc-fixes

-- 
Best regards,
Dmitry

^ permalink raw reply

* Re: [PATCH v1] drm/virtio: Fix driver removal with disabled KMS
From: Ryosuke Yasuoka @ 2026-06-09  8:59 UTC (permalink / raw)
  To: Dmitry Osipenko, David Airlie, Gerd Hoffmann, Gurchetan Singh,
	Chia-I Wu
  Cc: dri-devel, virtualization, linux-kernel
In-Reply-To: <f3848234-4258-483f-bc27-cdd31f65e7aa@collabora.com>

Hi Dmitry

On 08/06/2026 21:40, Dmitry Osipenko wrote:
> Hi,
> 
> On 6/7/26 07:31, Ryosuke Yasuoka wrote:
>> Hi Dmitry
>> 
>> On 04/06/2026 15:27, Dmitry Osipenko wrote:
>>> DRM atomic and modesetting aren't initialized if virtio-gpu driver built
>>> with disabled KMS, leading to access of uninitialized data on driver
>>> removal/unbinding and crashing kernel. Fix it by skipping shutting down
>>> atomic core with unavailable KMS.
>>>
>>> Fixes: 72122c69d717 ("drm/virtio: Add option to disable KMS support")
>>> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
>>> ---
>>>  drivers/gpu/drm/virtio/virtgpu_drv.c | 5 ++++-
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.c b/drivers/gpu/drm/virtio/virtgpu_drv.c
>>> index f0fb784c0f6f..2aaa7cb08085 100644
>>> --- a/drivers/gpu/drm/virtio/virtgpu_drv.c
>>> +++ b/drivers/gpu/drm/virtio/virtgpu_drv.c
>>> @@ -138,7 +138,10 @@ static void virtio_gpu_remove(struct virtio_device *vdev)
>>>  
>>>  	virtio_gpu_release_vqs(dev);
>>>  	drm_dev_unplug(dev);
>>> -	drm_atomic_helper_shutdown(dev);
>>> +
>>> +	if (drm_core_check_feature(dev, DRIVER_ATOMIC))
>>> +		drm_atomic_helper_shutdown(dev);
>>> +
>>>  	virtio_gpu_deinit(dev);
>>>  	drm_dev_put(dev);
>>>  }
>> 
>> The patch looks good to me at a glance. I haven't done a full, deep code
>> review yet, but I've tested it on my lab and everything works as
>> expected.
>> 
>> Tested-by: Ryosuke Yasuoka <ryasuoka@redhat.com>
> 
> Thanks a lot for the testing. The review from you will be very welcomed
> too.

I reviewed your patch and this change looks good to me.

Reviewed-by: Ryosuke Yasuoka <ryasuoka@redhat.com>

Thank you!
Ryosuke


^ permalink raw reply

* Re: [PATCH net] vsock/virtio: restore msg_iter on transmission failure
From: Stefano Garzarella @ 2026-06-09  8:48 UTC (permalink / raw)
  To: Octavian Purdila
  Cc: netdev, syzbot+28e5f3d207b14bae122a, Stefan Hajnoczi,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Arseniy Krasnov, kvm, virtualization, linux-kernel
In-Reply-To: <20260609004809.1285028-1-tavip@google.com>

On Tue, Jun 09, 2026 at 12:48:05AM +0000, Octavian Purdila wrote:
>When transmission fails in virtio_transport_send_pkt_info, the msg_iter
>might have been partially advanced. If we don't restore it, the next
>attempt to send data will use an incorrect iterator state, leading to
>desync and warnings like "send_pkt() returns 0, but X expected".

Thanks for the fix! I have some comments.
>
>Specifically, this can happen in the following scenario, triggered by
>the syzkaller repro:
>
>1. A write-only VMA (PROT_WRITE only) is partially populated by a
>   prior TUN write that failed with -EIO but still faulted in some
>   pages).
>2. A vsock sendmmsg call with MSG_ZEROCOPY requests transmission of a
>   buffer from this VMA.
>3. The first packet (64KB) is sent successfully because the pages are
>   populated.
>4. The second packet allocation fails because GUP fast pins the first page
>   but GUP slow fails on the next unpopulated page due to PROT_WRITE-only
>   permissions.
>5. The iterator is advanced by the partially successful GUP (68KB total
>   advanced: 64KB from first packet + 4KB from second), but the send loop
>   breaks and only reports 64KB sent. This creates a 4KB desync.
>6. The next retry starts with a non-zero iov_offset, disabling zerocopy
>   and falling back to copy mode.
>7. In copy mode, the transmission succeeds for the next packets but
>   exhausts the iterator early because of the desync.
>8. The final retry sees an empty iterator but zerocopy is re-enabled
>   (offset resets). It attempts to send the remaining bytes with zerocopy
>   but pins 0 pages, creating an empty packet.
>9. The transport sends the empty packet, triggering the warning because
>   the returned bytes (header only) do not match the expected payload size.
>10. The loop continues to spin, allocating ubuf_info each time, eventually
>    exhausting sysctl_optmem_max and returning -ENOMEM to userspace.
>
>Restore msg_iter to its original state before the packet allocation
>and transmission attempt if they fail.
>
>Fixes: e0718bd82e27 ("vsock: enable setting SO_ZEROCOPY")
>Reported-by: syzbot+28e5f3d207b14bae122a@syzkaller.appspotmail.com
>Closes: https://syzkaller.appspot.com/bug?extid=28e5f3d207b14bae122a
>Assisted-by: gemini:gemini-3.1-pro
>Signed-off-by: Octavian Purdila <tavip@google.com>
>---
> net/vmw_vsock/virtio_transport_common.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index b10666937c490..588623a3e2bbc 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -367,6 +367,10 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> 	do {
> 		struct sk_buff *skb;
> 		size_t skb_len;
>+		struct iov_iter saved_iter;

trivial: reverse xmas tree: 
https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs

>+
>+		if (info->msg)
>+			saved_iter = info->msg->msg_iter;

What about using iov_iter_save_state()/iov_iter_restore() ?

IIUC we may need to export iov_iter_restore(), so not a strong opinion, 
but it looks better to use those API IMHO.

>
> 		skb_len = min(max_skb_len, rest_len);
>
>@@ -375,6 +379,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> 						 src_cid, src_port,
> 						 dst_cid, dst_port);

What about adding a comment on top of virtio_transport_alloc_skb() call 
(or when we save the state) to explain that in specific cases it can 
advance the msg_iter ?

> 		if (!skb) {
>+			if (info->msg)
>+				info->msg->msg_iter = saved_iter;
> 			ret = -ENOMEM;
> 			break;
> 		}
>@@ -382,8 +388,11 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> 		virtio_transport_inc_tx_pkt(vvs, skb);
>
> 		ret = t_ops->send_pkt(skb, info->net);
>-		if (ret < 0)
>+		if (ret < 0) {
>+			if (info->msg)
>+				info->msg->msg_iter = saved_iter;

Also, what about having a single restore point after the loop?

I mean something like this (untested):

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index b10666937c49..2f3c6c82c155 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -295,6 +295,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
         u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
         u32 src_cid, src_port, dst_cid, dst_port;
         const struct virtio_transport *t_ops;
+       struct iov_iter_state msg_iter_state;
         struct virtio_vsock_sock *vvs;
         struct ubuf_info *uarg = NULL;
         u32 pkt_len = info->pkt_len;
@@ -368,6 +369,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
                 struct sk_buff *skb;
                 size_t skb_len;
  
+               if (info->msg)
+                       iov_iter_save_state(&info->msg->msg_iter, &msg_iter_state);
+
                 skb_len = min(max_skb_len, rest_len);
  
                 skb = virtio_transport_alloc_skb(info, skb_len, can_zcopy,
@@ -399,6 +403,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
                         break;
         } while (rest_len);
  
+       if (info->msg && ret < 0)
+               iov_iter_restore(&info->msg->msg_iter, &msg_iter_state);
+
         virtio_transport_put_credit(vvs, rest_len);
  
         /* msg_zerocopy_realloc() initializes the ubuf_info refcnt to 1.


Thanks,
Stefano

> 			break;
>+		}
>
> 		/* Both virtio and vhost 'send_pkt()' returns 'skb_len',
> 		 * but for reliability use 'ret' instead of 'skb_len'.
>-- 
>2.54.0.1064.gd145956f57-goog
>


^ permalink raw reply related

* Re: New design
From: David Hildenbrand (Arm) @ 2026-06-09  8:12 UTC (permalink / raw)
  To: Matthew Wilcox, Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiePVsiV-w-IZ8A4@casper.infradead.org>

On 6/9/26 05:58, Matthew Wilcox wrote:
> OK, here's how I'd structure this:
> 
> 1. Introduce PG_zeroed for buddy pages
> 2. Set it if init_on_free is set
> 3. Set it from balloon driver
> 
> https://lore.kernel.org/lkml/c7094de807c0e963526686e1d245bc76193b1a92.1776689093.git.mst@redhat.com/ 
> 
> but add FPI_ZEROED instead of an extra bool parameter.
> 
> 4. Introduce page_is_zeroed like this:
> 
> static inline bool page_is_zeroed(const struct page *page)
> {
>         /*
>          * lru.next has bit 2 set if the page is already zeroed.
>          * Callers may simply overwrite it once they no longer
> 	 * need to preserve that information.
>          */
>         return (unsigned long)page->lru.next & BIT(2);
> }
> 
> (you'll notice this is similar to page_is_pfmemalloc() but it doesn't
> need to be in mm.h)
> 
> This step is going to be a bit fiddly.  We weren't expecting to return
> multiple flags in page->lru.next, so clear_page_pfmemalloc() just sets
> page->lru.next to NULL.  So somewhere we need to make sure that
> page->lru.next is definitely NULL, and then allow both the zeroed and
> pfmemalloc flags to be set in it.
> 
> The important part of this is that it allows the zeroed flag to be
> returned from the page allocator without introducing pghint_t like you
> did in v2.

I previously raised (in v2? not sure) that we could using a pageflag that are
only used for folios, and then simply clear that flag on the folio allocation
path such that we don't get false-postives with the bit set.

> 
> 5. Now you can start skipping various zeroing steps higher in the call
> chain.
> 
> I understand David's disgust with vma_alloc_zeroed_movable_folio()
> but that is surely a separate cleanup and nothing to do with this
> patchset.

Well, in my reality, we're just finding interesting ways to work around the fact
that GFP_ZERO sometimes does what we want, sometimes doesn't.

So we leak information out of the buddy to really only handle one scenario:
fixing up GFP_ZERO currently sometimes not doing what we want.

I'm afraid we couldn't use the above trick to punch zeroed pages back into the
buddy: some random user doing alloc+use+free would be unaware that there is a
bit to clear.

So I assume really only folio allocation would make use of this, to work around
our problematic GFP_ZERO implementation.

-- 
Cheers,

David

^ permalink raw reply

* Re: New design
From: Michael S. Tsirkin @ 2026-06-09  8:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiePVsiV-w-IZ8A4@casper.infradead.org>

On Tue, Jun 09, 2026 at 04:58:14AM +0100, Matthew Wilcox wrote:
> OK, here's how I'd structure this:

Thanks a lot for looking into this and writing this Matthew! Looks
workable, let's see if there's rough consensus around this.
Two questions to make sure I understand.

> 
> 1. Introduce PG_zeroed for buddy pages
> 2. Set it if init_on_free is set

Not 100% sure why we want this bit. And I am not sure this works
actually because init_on_free does kernel_init_pages
and does not flush cache on arm32.
You will notice that user_alloc_needs_zeroing ignores init_on_free.
Right?
How about we skip step 2, make the patchset a bit smaller?


> 3. Set it from balloon driver
> 
> https://lore.kernel.org/lkml/c7094de807c0e963526686e1d245bc76193b1a92.1776689093.git.mst@redhat.com/ 
> 
> but add FPI_ZEROED instead of an extra bool parameter.
> 
> 4. Introduce page_is_zeroed like this:
> 
> static inline bool page_is_zeroed(const struct page *page)
> {
>         /*
>          * lru.next has bit 2 set if the page is already zeroed.
>          * Callers may simply overwrite it once they no longer
> 	 * need to preserve that information.
>          */
>         return (unsigned long)page->lru.next & BIT(2);
> }
> 
> (you'll notice this is similar to page_is_pfmemalloc() but it doesn't
> need to be in mm.h)
> 
> This step is going to be a bit fiddly.  We weren't expecting to return
> multiple flags in page->lru.next, so clear_page_pfmemalloc() just sets
> page->lru.next to NULL.  So somewhere we need to make sure that
> page->lru.next is definitely NULL, and then allow both the zeroed and
> pfmemalloc flags to be set in it.
> 
> The important part of this is that it allows the zeroed flag to be
> returned from the page allocator without introducing pghint_t like you
> did in v2.
> 
> 5. Now you can start skipping various zeroing steps higher in the call
> chain.
> I understand David's disgust with vma_alloc_zeroed_movable_folio()
> but that is surely a separate cleanup and nothing to do with this
> patchset.



One other question: would people like to see it as a single patchset
or multiple ones 1-4? Multiple ones would be easier to review but
of course this means no actual perf gain until part 5 is merged. Is that
acceptable?

-- 
MST


^ permalink raw reply

* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Thomas Gleixner @ 2026-06-09  7:48 UTC (permalink / raw)
  To: Sean Christopherson, David Woodhouse
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, Tom Lendacky, Nikunj A Dadhania,
	Michael Kelley
In-Reply-To: <aidEfvTMjLa2zt43@google.com>

On Mon, Jun 08 2026 at 15:38, Sean Christopherson wrote:
> On Sat, Jun 06, 2026, David Woodhouse wrote:
>> > Along with:
>> > 
>> >    if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
>> >       if (tsc_khz_early)
>> >          pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");
>> > 
>> > or something daft like that.
>
> Ya, I ended up in the same place once Sashiko pointed out that skipping the SNP/TDX
> setup was hazardous[*], and also once I realized that tsc_khz_early *complemented*
> the refinement instead of replacing it.
>
> This is what I have locally:
>
>         if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
>                 known_tsc_khz = snp_secure_tsc_init();
>         else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
>                 known_tsc_khz = tdx_tsc_init();
>
>         /*
>          * If the TSC frequency wasn't provided by trusted firmware, try to get
>          * it from the hypervisor (which is untrusted when running as a CoCo guest).
>          */
>         if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
>                 known_tsc_khz = x86_init.hyper.get_tsc_khz();
>
>         /*
>          * Mark the TSC frequency as known if it was obtained from a hypervisor
>          * or trusted firmware.  Don't mark the frequency as known if the user
>          * specified the frequency, as the user-provided frequency is intended
>          * as a "starting point", not a known, guaranteed frequency.
>          */
>         if (known_tsc_khz && !tsc_early_khz)
>                 setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);

If the frequenct is known via the above then you want to set the
KNOWN_FREQ feature bit unconditionally. SNP/TDX/hypervisor override the
command line argument as you print below.

>         /*
>          * Ignore the user-provided TSC frequency if the exact frequency was
>          * obtained from trusted firmware or the hypervisor, as the user-
>          * provided frequency is intended as a "starting point", not a known,
>          * guaranteed frequency.
>          */
>         if (!known_tsc_khz)
>                 known_tsc_khz = tsc_early_khz;
>         else if (tsc_early_khz)
>                 pr_err("Ignoring 'tsc_early_khz' in favor of firmware/hypervisor.\n");

>> All the nonsense about updating it every time we enter a CPU could just
>> go away completely.
>
> But to Thomas' point, why bother?  For actual old hardware, kvmclock is what it
> is.  For modern hardware, it's completely antiquated.

I agree, but we are not forced to make it a first class citizen to the
detriment of sane systems.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v2] i2c: virtio: retain xfer with kref to fix UAF on interrupted wait
From: Viresh Kumar @ 2026-06-09  7:35 UTC (permalink / raw)
  To: Gavin Li; +Cc: linux-i2c, Chen, Jian Jun, andi.shyti, virtualization
In-Reply-To: <20260608174435.62359-1-gavin.li@samsara.com>

On 08-06-26, 13:44, Gavin Li wrote:
> commit a663b3c47ab1 ("i2c: virtio: Avoid hang by using interruptible
> completion wait") switched virtio_i2c_complete_reqs() to
> wait_for_completion_interruptible() so a stuck device cannot hang a
> task forever. That left a use-after-free: if the wait returns early on
> a signal, virtio_i2c_xfer() frees reqs and DMA bounce buffers while the
> device may still hold virtqueue tokens pointing at &reqs[i] and DMA
> into read buffers. When those requests complete later,
> virtio_i2c_msg_done() calls complete() on freed memory.
> 
> Waiting uninterruptibly for every completion before freeing avoids the
> UAF but can hang the caller indefinitely if the virtio side never
> completes the request. The virtio spec provides no way to cancel an
> in-flight transfer, so that is not an acceptable tradeoff.
> 
> This commit makes two changes:
> 
> - Manage the freeing of the xfer allocations via kref, and ensure that
>   each in-flight request holds a reference. This fixes the
>   use-after-free by ensuring that the virtio device has a valid location
>   to write to until the request completes. This will cause a memory
>   leak in cases where the device hangs, but that is much preferable to
>   memory corruption.
> 
> - Use wait_for_completion_killable() instead of _interruptible(). Even
>   partial I2C transactions can have side effects, so the only time it
>   makes sense to interrupt a transaction is when a process needs to be
>   killed. Most existing I2C drivers don't support interruption at all,
>   so this should not break userspace applications. This also addresses
>   issues with Go programs accessing devices via the I2C userspace API,
>   since the Go runtime stochastically signals SIGURG to running threads;
>   leaving this as _interruptible() may cause partial side effects from
>   which it is impossible to cleanly restart.
> 
> Signed-off-by: Gavin Li <gavin.li@samsara.com>
> ---
>  drivers/i2c/busses/i2c-virtio.c | 89 ++++++++++++++++++++++++---------
>  1 file changed, 64 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-virtio.c b/drivers/i2c/busses/i2c-virtio.c
> index 726c162cabd86..f7320a67a3409 100644
> --- a/drivers/i2c/busses/i2c-virtio.c
> +++ b/drivers/i2c/busses/i2c-virtio.c
> @@ -13,6 +13,7 @@
>  #include <linux/err.h>
>  #include <linux/i2c.h>
>  #include <linux/kernel.h>
> +#include <linux/kref.h>
>  #include <linux/module.h>
>  #include <linux/virtio.h>
>  #include <linux/virtio_ids.h>
> @@ -31,39 +32,77 @@ struct virtio_i2c {
>  	struct virtqueue *vq;
>  };
>  
> +struct virtio_i2c_xfer;
> +
>  /**
>   * struct virtio_i2c_req - the virtio I2C request structure
> + * @xfer: owning transfer
> + * @msg: copy of the I2C message for virtio_i2c_xfer_release
>   * @completion: completion of virtio I2C message
>   * @out_hdr: the OUT header of the virtio I2C message
>   * @buf: the buffer into which data is read, or from which it's written
>   * @in_hdr: the IN header of the virtio I2C message
>   */
>  struct virtio_i2c_req {
> +	struct virtio_i2c_xfer *xfer;
> +	struct i2c_msg msg;
>  	struct completion completion;
>  	struct virtio_i2c_out_hdr out_hdr	____cacheline_aligned;
>  	uint8_t *buf				____cacheline_aligned;
>  	struct virtio_i2c_in_hdr in_hdr		____cacheline_aligned;
>  };
>  
> +/**
> + * struct virtio_i2c_xfer - a queued I2C transfer
> + * @ref: one ref for the caller, plus one per in-flight virtqueue request
> + * @num: number of messages
> + * @reqs: the virtio I2C requests
> + */
> +struct virtio_i2c_xfer {
> +	struct kref ref;
> +	int num;
> +	struct virtio_i2c_req reqs[];
> +};
> +
> +static void virtio_i2c_xfer_release(struct kref *ref)
> +{
> +	struct virtio_i2c_xfer *xfer = container_of(ref, struct virtio_i2c_xfer, ref);
> +	int i;
> +
> +	for (i = 0; i < xfer->num; i++) {
> +		struct virtio_i2c_req *req = &xfer->reqs[i];
> +		i2c_put_dma_safe_msg_buf(req->buf, &req->msg, false);
> +	}
> +
> +	kfree(xfer);
> +}
> +
>  static void virtio_i2c_msg_done(struct virtqueue *vq)
>  {
>  	struct virtio_i2c_req *req;
>  	unsigned int len;
>  
> -	while ((req = virtqueue_get_buf(vq, &len)))
> +	while ((req = virtqueue_get_buf(vq, &len))) {
>  		complete(&req->completion);
> +		kref_put(&req->xfer->ref, virtio_i2c_xfer_release);
> +	}
>  }
>  
>  static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
> -				   struct virtio_i2c_req *reqs,
> +				   struct virtio_i2c_xfer *xfer,
>  				   struct i2c_msg *msgs, int num)
>  {
>  	struct scatterlist *sgs[3], out_hdr, msg_buf, in_hdr;
> +	struct virtio_i2c_req *reqs = xfer->reqs;
>  	int i;
>  
> +	kref_init(&xfer->ref);
> +
>  	for (i = 0; i < num; i++) {
>  		int outcnt = 0, incnt = 0;
>  
> +		reqs[i].xfer = xfer;
> +		reqs[i].msg = msgs[i];
>  		init_completion(&reqs[i].completion);
>  
>  		/*
> @@ -99,36 +138,36 @@ static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
>  
>  		if (virtqueue_add_sgs(vq, sgs, outcnt, incnt, &reqs[i], GFP_KERNEL)) {
>  			i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], false);
> +			reqs[i].buf = NULL; /* prevent free by virtio_i2c_xfer_release */
>  			break;
>  		}
> +
> +		kref_get(&xfer->ref); /* released in virtio_i2c_msg_done() */

Maybe move the comment above the code ? Can be dropped too.

Also, maybe there is a small race here, not sure. What if the other
side (polls and) processes the message as soon as it is added to the
queue with virtqueue_add_sgs() ? In that case virtio_i2c_msg_done()
will call complete (which won't harm) and kref_put(). If this happens
for the first req of the xfer, it may end up freeing the xfer while
being used here ?

>  	}
>  
> +	xfer->num = i;
>  	return i;
>  }
>  
> -static int virtio_i2c_complete_reqs(struct virtqueue *vq,
> -				    struct virtio_i2c_req *reqs,
> -				    struct i2c_msg *msgs, int num)
> +static int virtio_i2c_complete_reqs(struct virtio_i2c_xfer *xfer)

Maybe rename to complete_xfer now ?

>  {
> -	bool failed = false;
> -	int i, j = 0;
> +	struct virtio_i2c_req *reqs = xfer->reqs;
> +	int i, fail_index = -1;
>  
> -	for (i = 0; i < num; i++) {
> +	for (i = 0; i < xfer->num; i++) {
>  		struct virtio_i2c_req *req = &reqs[i];
> -
> -		if (!failed) {
> -			if (wait_for_completion_interruptible(&req->completion))
> -				failed = true;
> -			else if (req->in_hdr.status != VIRTIO_I2C_MSG_OK)
> -				failed = true;
> -			else
> -				j++;
> +		if (wait_for_completion_killable(&req->completion)) {

Maybe do this in a separate patch ?

> +			return -EINTR;
> +		} else if (req->in_hdr.status != VIRTIO_I2C_MSG_OK) {
> +			/* Don't break yet. Try to wait until all requests complete. */
> +			if (fail_index < 0)
> +				fail_index = i;
>  		}
> -
> -		i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], !failed);
> +		i2c_put_dma_safe_msg_buf(req->buf, &req->msg, fail_index < 0);
> +		req->buf = NULL; /* prevent free by virtio_i2c_xfer_release */
>  	}
>  
> -	return j;
> +	return fail_index >= 0 ? fail_index : xfer->num; /* number of successful transactions */

If this comment is required, maybe add it above the line instead.

>  }
>  
>  static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
> @@ -136,14 +175,14 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
>  {
>  	struct virtio_i2c *vi = i2c_get_adapdata(adap);
>  	struct virtqueue *vq = vi->vq;
> -	struct virtio_i2c_req *reqs;
> +	struct virtio_i2c_xfer *xfer;
>  	int count;
>  
> -	reqs = kcalloc(num, sizeof(*reqs), GFP_KERNEL);
> -	if (!reqs)
> +	xfer = kzalloc(struct_size(xfer, reqs, num), GFP_KERNEL);
> +	if (!xfer)
>  		return -ENOMEM;
>  
> -	count = virtio_i2c_prepare_reqs(vq, reqs, msgs, num);
> +	count = virtio_i2c_prepare_reqs(vq, xfer, msgs, num);
>  	if (!count)
>  		goto err_free;
>  
> @@ -157,10 +196,10 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
>  	 */
>  	virtqueue_kick(vq);
>  
> -	count = virtio_i2c_complete_reqs(vq, reqs, msgs, count);
> +	count = virtio_i2c_complete_reqs(xfer);
>  
>  err_free:
> -	kfree(reqs);
> +	kref_put(&xfer->ref, virtio_i2c_xfer_release);
>  	return count;
>  }

Nice work Gavin.

-- 
viresh

^ permalink raw reply

* Re: New design
From: Gregory Price @ 2026-06-09  6:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiePVsiV-w-IZ8A4@casper.infradead.org>

On Tue, Jun 09, 2026 at 04:58:14AM +0100, Matthew Wilcox wrote:
> OK, here's how I'd structure this:
> 
> 1. Introduce PG_zeroed for buddy pages
> 2. Set it if init_on_free is set
> 3. Set it from balloon driver
> 
> https://lore.kernel.org/lkml/c7094de807c0e963526686e1d245bc76193b1a92.1776689093.git.mst@redhat.com/ 
> 
> but add FPI_ZEROED instead of an extra bool parameter.
> 
> 4. Introduce page_is_zeroed like this:
> 
> static inline bool page_is_zeroed(const struct page *page)
> {
>         /*
>          * lru.next has bit 2 set if the page is already zeroed.
>          * Callers may simply overwrite it once they no longer
> 	 * need to preserve that information.
>          */
>         return (unsigned long)page->lru.next & BIT(2);
> }
> 
> (you'll notice this is similar to page_is_pfmemalloc() but it doesn't
> need to be in mm.h)
> 
> This step is going to be a bit fiddly.  We weren't expecting to return
> multiple flags in page->lru.next, so clear_page_pfmemalloc() just sets
> page->lru.next to NULL.  So somewhere we need to make sure that
> page->lru.next is definitely NULL, and then allow both the zeroed and
> pfmemalloc flags to be set in it.
> 
> The important part of this is that it allows the zeroed flag to be
> returned from the page allocator without introducing pghint_t like you
> did in v2.
>

Are you suggesting leaking the flags out entirely, or just to the
boundaries of page_alloc.c (__alloc_frozen_pages_noproft and etc).

I assume the latter, but worth clarifying.

Otherwise this seems reasonable.

If we're just going to pile more stuff in lru.next you might as well
either add the alias to mm.h but keep the bits defined in page_alloc.c
to prevent them from escaping (even if they end up set, nothing outside
page_alloc.c knows what any of them mean).

Unless my read on this is mistaken, let me know if i've misunderstood
anything.

> 5. Now you can start skipping various zeroing steps higher in the call
> chain.
> 
> I understand David's disgust with vma_alloc_zeroed_movable_folio()
> but that is surely a separate cleanup and nothing to do with this
> patchset.

^ permalink raw reply

* New design
From: Matthew Wilcox @ 2026-06-09  3:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <cover.1780906288.git.mst@redhat.com>

OK, here's how I'd structure this:

1. Introduce PG_zeroed for buddy pages
2. Set it if init_on_free is set
3. Set it from balloon driver

https://lore.kernel.org/lkml/c7094de807c0e963526686e1d245bc76193b1a92.1776689093.git.mst@redhat.com/ 

but add FPI_ZEROED instead of an extra bool parameter.

4. Introduce page_is_zeroed like this:

static inline bool page_is_zeroed(const struct page *page)
{
        /*
         * lru.next has bit 2 set if the page is already zeroed.
         * Callers may simply overwrite it once they no longer
	 * need to preserve that information.
         */
        return (unsigned long)page->lru.next & BIT(2);
}

(you'll notice this is similar to page_is_pfmemalloc() but it doesn't
need to be in mm.h)

This step is going to be a bit fiddly.  We weren't expecting to return
multiple flags in page->lru.next, so clear_page_pfmemalloc() just sets
page->lru.next to NULL.  So somewhere we need to make sure that
page->lru.next is definitely NULL, and then allow both the zeroed and
pfmemalloc flags to be set in it.

The important part of this is that it allows the zeroed flag to be
returned from the page allocator without introducing pghint_t like you
did in v2.

5. Now you can start skipping various zeroing steps higher in the call
chain.

I understand David's disgust with vma_alloc_zeroed_movable_folio()
but that is surely a separate cleanup and nothing to do with this
patchset.

^ permalink raw reply

* Re: [PATCH v1] vsock/virtio: rework MSG_ZEROCOPY flag handling
From: Jakub Kicinski @ 2026-06-09  2:08 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Paolo Abeni, Michael S. Tsirkin, Jason Wang,
	Bobby Eshleman, Xuan Zhuo, Eugenio Pérez, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, oxffffaa, rulkc
In-Reply-To: <20260605115314.552321-1-avkrasnov@rulkc.org>

On Fri,  5 Jun 2026 14:53:14 +0300 Arseniy Krasnov wrote:
> Logically it was based on TCP implementation, so to make further
> support easier, rewrite it in the TCP way.

Does not apply:

$ git pw series apply 1106582
Failed to apply patch:
Applying: vsock/virtio: rework MSG_ZEROCOPY flag handling
error: sha1 information is lacking or useless (net/vmw_vsock/virtio_transport_common.c).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 vsock/virtio: rework MSG_ZEROCOPY flag handling
-- 
pw-bot: cr

^ permalink raw reply

* [PATCH net] vsock/virtio: restore msg_iter on transmission failure
From: Octavian Purdila @ 2026-06-09  0:48 UTC (permalink / raw)
  To: netdev
  Cc: Octavian Purdila, syzbot+28e5f3d207b14bae122a, Stefan Hajnoczi,
	Stefano Garzarella, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Arseniy Krasnov, kvm, virtualization,
	linux-kernel

When transmission fails in virtio_transport_send_pkt_info, the msg_iter
might have been partially advanced. If we don't restore it, the next
attempt to send data will use an incorrect iterator state, leading to
desync and warnings like "send_pkt() returns 0, but X expected".

Specifically, this can happen in the following scenario, triggered by
the syzkaller repro:

1. A write-only VMA (PROT_WRITE only) is partially populated by a
   prior TUN write that failed with -EIO but still faulted in some
   pages).
2. A vsock sendmmsg call with MSG_ZEROCOPY requests transmission of a
   buffer from this VMA.
3. The first packet (64KB) is sent successfully because the pages are
   populated.
4. The second packet allocation fails because GUP fast pins the first page
   but GUP slow fails on the next unpopulated page due to PROT_WRITE-only
   permissions.
5. The iterator is advanced by the partially successful GUP (68KB total
   advanced: 64KB from first packet + 4KB from second), but the send loop
   breaks and only reports 64KB sent. This creates a 4KB desync.
6. The next retry starts with a non-zero iov_offset, disabling zerocopy
   and falling back to copy mode.
7. In copy mode, the transmission succeeds for the next packets but
   exhausts the iterator early because of the desync.
8. The final retry sees an empty iterator but zerocopy is re-enabled
   (offset resets). It attempts to send the remaining bytes with zerocopy
   but pins 0 pages, creating an empty packet.
9. The transport sends the empty packet, triggering the warning because
   the returned bytes (header only) do not match the expected payload size.
10. The loop continues to spin, allocating ubuf_info each time, eventually
    exhausting sysctl_optmem_max and returning -ENOMEM to userspace.

Restore msg_iter to its original state before the packet allocation
and transmission attempt if they fail.

Fixes: e0718bd82e27 ("vsock: enable setting SO_ZEROCOPY")
Reported-by: syzbot+28e5f3d207b14bae122a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=28e5f3d207b14bae122a
Assisted-by: gemini:gemini-3.1-pro
Signed-off-by: Octavian Purdila <tavip@google.com>
---
 net/vmw_vsock/virtio_transport_common.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index b10666937c490..588623a3e2bbc 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -367,6 +367,10 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 	do {
 		struct sk_buff *skb;
 		size_t skb_len;
+		struct iov_iter saved_iter;
+
+		if (info->msg)
+			saved_iter = info->msg->msg_iter;
 
 		skb_len = min(max_skb_len, rest_len);
 
@@ -375,6 +379,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 						 src_cid, src_port,
 						 dst_cid, dst_port);
 		if (!skb) {
+			if (info->msg)
+				info->msg->msg_iter = saved_iter;
 			ret = -ENOMEM;
 			break;
 		}
@@ -382,8 +388,11 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 		virtio_transport_inc_tx_pkt(vvs, skb);
 
 		ret = t_ops->send_pkt(skb, info->net);
-		if (ret < 0)
+		if (ret < 0) {
+			if (info->msg)
+				info->msg->msg_iter = saved_iter;
 			break;
+		}
 
 		/* Both virtio and vhost 'send_pkt()' returns 'skb_len',
 		 * but for reliability use 'ret' instead of 'skb_len'.
-- 
2.54.0.1064.gd145956f57-goog


^ permalink raw reply related

* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Sean Christopherson @ 2026-06-08 22:38 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Thomas Gleixner, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, John Stultz, H. Peter Anvin,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, Tom Lendacky, Nikunj A Dadhania,
	Michael Kelley
In-Reply-To: <eef867eae15e30d08482ba16a1a32159745b64a7.camel@infradead.org>

On Sat, Jun 06, 2026, David Woodhouse wrote:
> On Sat, 2026-06-06 at 12:34 +0200, Thomas Gleixner wrote:
> > On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> > 
> > > Now that all paravirt code that explicitly specifies the TSC frequency
> > > also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> > > and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
> > > 
> > > Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> > > by the user.  Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> > > line parameter"), one of the goals of the param is to allow the refined
> > > calibration work "to do meaningful error checking".
> > > 
> > > Note, preferring the user-provided TSC frequency over the frequency from
> > > the hypervisor or trusted firmware, while simultaneously not treating the
> > > user-provided frequency as gospel, is obviously incongruous.  Sweep the
> > > problem under the rug for now to avoid opening a big can of worms that
> > > likely doesn't have a great answer.
> > 
> > There is a good answer I think.
> > 
> > early_tsc_khz exists to cater for the overclocking crowd. On their
> > modded systems the firmware supplied TSC frequency (CPUID/MSR) is not
> > matching reality anymore. So they work around that by supplying a close
> > enough tsc_early_khz and then they let the refined calibration work
> > figure it out.
> > 
> > Arguably that's only relevant for bare metal systems and what's worse is
> > that in virtual environments the refined calibration work can fail,
> > which renders the TSC unstable.
> > 
> > So I'd rather say we change this logic to:
> > 
> >    if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> >       tsc_khz = x86_init.....();
> >       force(X86_FEATURE_TSC_KNOWN_FREQ);
> >    } else if (tsc_khz_early) {
> >       ....
> >    } else {
> >       ...
> >    }
> > 
> > Along with:
> > 
> >    if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> >       if (tsc_khz_early)
> >          pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");
> > 
> > or something daft like that.

Ya, I ended up in the same place once Sashiko pointed out that skipping the SNP/TDX
setup was hazardous[*], and also once I realized that tsc_khz_early *complemented*
the refinement instead of replacing it.

This is what I have locally:

        if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
                known_tsc_khz = snp_secure_tsc_init();
        else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
                known_tsc_khz = tdx_tsc_init();

        /*
         * If the TSC frequency wasn't provided by trusted firmware, try to get
         * it from the hypervisor (which is untrusted when running as a CoCo guest).
         */
        if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
                known_tsc_khz = x86_init.hyper.get_tsc_khz();

        /*
         * Mark the TSC frequency as known if it was obtained from a hypervisor
         * or trusted firmware.  Don't mark the frequency as known if the user
         * specified the frequency, as the user-provided frequency is intended
         * as a "starting point", not a known, guaranteed frequency.
         */
        if (known_tsc_khz && !tsc_early_khz)
                setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);

        /*
         * Ignore the user-provided TSC frequency if the exact frequency was
         * obtained from trusted firmware or the hypervisor, as the user-
         * provided frequency is intended as a "starting point", not a known,
         * guaranteed frequency.
         */
        if (!known_tsc_khz)
                known_tsc_khz = tsc_early_khz;
        else if (tsc_early_khz)
                pr_err("Ignoring 'tsc_early_khz' in favor of firmware/hypervisor.\n");

[*] https://lore.kernel.org/all/ahnF-FehodVd474X@google.com

> > The kernel has for various reasons always tried to cater for the needs
> > of users who are plagued by bonkers firmware, but we have to stop to
> > prioritize or treating equal ancient and modded out of spec hardware.
> > 
> > TBH, I consider that whole KVM clock nonsense to fall into the modded
> > out of spec hardware realm. Do a reality check:
> > 
> >    How many production systems are out there still which run VMs on CPUs
> >    with a broken TSC and the lack of VM TSC scaling?
> > 
> > I'm not saying that we should not support the few remaining systems
> > anymore, but our tendency to pretend that we can keep all of this
> > nonsense working and at the same time making progress is just a fallacy.

FWIW, I have the exact same sentiments about kvmclock, but I'm also trying my
best not to break folks that are happily running on what is effectively flawed,
ancient "hardward". 

> I don't know that we can take the KVM (and Xen) clock away from guests,
> but all of the *horrid* part about it is the way it attempts to cope
> with the possibility that the *host* timekeeping might flip away from
> TSC-based mode at any point in time. By the end of my outstanding
> cleanup series, that is the *only* thing the gtod_notifier remains for.
> 
> If we can trust the hardware *and* the host kernel, then KVM could
> theoretically hardwire the kvmclock into 'master clock mode' where it
> basically just advertises the TSC→kvmclock relationship *once* to all
> CPUs and it never changes.
> 
> All the nonsense about updating it every time we enter a CPU could just
> go away completely.

But to Thomas' point, why bother?  For actual old hardware, kvmclock is what it
is.  For modern hardware, it's completely antiquated.

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Gregory Price @ 2026-06-08 22:28 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Zi Yan, Michael S. Tsirkin, Matthew Wilcox, Lorenzo Stoakes,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <afb14c99-162c-4c36-b1ef-fe971678f6b9@kernel.org>

On Mon, Jun 08, 2026 at 11:51:47PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 23:16, Zi Yan wrote:
> 
> There was Willy's comment in RFC v3 [1], which had 19 patches. Unfortunately, he
> no longer followed up to my initial push back and Michael's question later [2].
> 
> That would have probably been the right time to wait for more discussion.
> 
> RFC v4 had 22 patches with little replies.
> v5 had 28 patches with little replies.
> v6 had 30 patches with no replies.
> v7 had 31 patches with little replies.
> v8 had 37 patches with no replies.
> 
> [1] https://lore.kernel.org/lkml/aeu5P1bZW3yEH54t@casper.infradead.org/
> [2] https://lore.kernel.org/lkml/20260426165330-mutt-send-email-mst@kernel.org/
>

Hm, rewinding on this back to v3 here:
https://lore.kernel.org/lkml/016cc5e5-044c-46c6-a668-200f90a64d85@kernel.org/

You said:

  ```
  Exactly, that's why I am saying that vma_alloc_folio() is the only
  external interface people should be using with a user address.
  ```

Going through the list of folio_zero_user references:

Called unconditionally if a folio is acquired:
   fs/hugetlbfs/inode.c:   folio_zero_user(folio, addr);
   mm/hugetlb.c:           folio_zero_user(folio, vmf->real_address);
   mm/memfd.c:             folio_zero_user(folio, 0);

Called when user_alloc_needs_zeroing() and charging passes:
   mm/huge_memory.c:       folio_zero_user(folio, addr);
   mm/memory.c:            folio_zero_user(folio, vmf->address);

No one outside mm/ should know about this interface at all.
Arguably none of these should know about this interface either.

The appropriate place for this logic appears to be:
    vma_alloc_folio
    alloc_hugetlb_folio
    alloc_hugetlb_folio_reserve

The reason to sink it into the post_alloc_hook is to let the buddy
decide whether the page actually needs to be zeroed (like the virtio
situation) based on PG_zeroed or whatever.

It seems like at a minimum moving the logic all the way into
post_alloc_hook lets us actually delete folio_zero_user() as a published
interface and move it entirely within page_alloc.c.

The catch is user_alloc_needs_zeroing() coming along with it.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Gregory Price @ 2026-06-08 21:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <20260608174430-mutt-send-email-mst@kernel.org>

On Mon, Jun 08, 2026 at 05:46:27PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 05:33:50PM -0400, Gregory Price wrote:
> > 
> > You'd save yourself some revisions by taking the attention you have
> > right now and starting the discussion thread (and consider submitting
> > the topic to LPC if that's something interests you!).
> 
> Well it's in october, is it not? I don't think I have the patience to
> keep fiddling with that for half a year.
> 

You might be able to find a way forward that doesn't take that long, but
that starts with trying to build consensus on what to build before you
build it.

You're proposing a non-trivial change to the page allocator API, I would
not expect this to move at the speed of claude.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: David Hildenbrand (Arm) @ 2026-06-08 21:51 UTC (permalink / raw)
  To: Zi Yan, Michael S. Tsirkin
  Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli
In-Reply-To: <DD2E08E8-1FD7-43E3-A2C8-B4435D3773D1@nvidia.com>

On 6/8/26 23:16, Zi Yan wrote:
> On 8 Jun 2026, at 17:04, Michael S. Tsirkin wrote:
> 
>> On Mon, Jun 08, 2026 at 04:40:15PM -0400, Zi Yan wrote:
>>>
>>>
>>> Change user_alloc_needs_zeroing() to only check address aliasing even
>>> if that can cause double zeroing for virtio.
>>>
>>> Best Regards,
>>> Yan, Zi
>>
>> Ah. I started with exactly that in v1/v2. It's a simple approach.

Simple, and hacky -> unmergable. I tried to push it into a different (no GFP
flags -> IMHO better) direction, but the patch set grew in complexity.

I kept telling to keep it simple (e.g., no folio_put optimization, no hugetlb
optimization, simple wrapper functions), and ideally we would have gotten a
better discussion with other folks here much earlier.

And I still do not consider providing an user address to selected interfaces
while centralizing zeroing a bad idea. The real question is how that could be
done in a cleaner way.

Or as Willy said, if we could move zeroing further out to callers, where they
can special-case. But given that KASAN and friends interact in their own way
with zeroing doesn't make that super straight forward as people might think.

>>
>> But mm maintainers said no, user_alloc_needs_zeroing is a hack and
>> I must not add to it.

I mean, I would hope that we can agree that our existing page/folio zeroing is a
mess and should not be extended by slapping more special casing on top?

Sure, we can try cleaning it up, but conceptually, zeroing happening at two
places in the callchain, with random optimizations to avoid double-zeroing is
just bad.

The fact that a vma_alloc_zeroed_movable_folio() that can be overridden by
architectures even exists makes me angry. user_alloc_needs_zeroing() is jsut the
tip of the ugly iceberg.

> 
> Got it. It sounds that you now get conflicting ideas. Maybe you should
> start a [DISCUSSION] thread that presents the high level idea of what
> you want to achieve and all the ideas you got from the reviews, so that
> people in this thread can have the big picture and come up a consensus
> before you send another version.
> 
> Thank you for patiently replying my comments, since those points
> apparently have been discussed in prior submissions.
> 

There was Willy's comment in RFC v3 [1], which had 19 patches. Unfortunately, he
no longer followed up to my initial push back and Michael's question later [2].

That would have probably been the right time to wait for more discussion.

RFC v4 had 22 patches with little replies.
v5 had 28 patches with little replies.
v6 had 30 patches with no replies.
v7 had 31 patches with little replies.
v8 had 37 patches with no replies.

[1] https://lore.kernel.org/lkml/aeu5P1bZW3yEH54t@casper.infradead.org/
[2] https://lore.kernel.org/lkml/20260426165330-mutt-send-email-mst@kernel.org/

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Michael S. Tsirkin @ 2026-06-08 21:46 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aic1PoxSiHzZ40Jr@gourry-fedora-PF4VCD3F>

On Mon, Jun 08, 2026 at 05:33:50PM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 05:16:53PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 04:53:14PM -0400, Gregory Price wrote:
> > > 
> > > As a start:
> > > 
> > >   1) the user_addr and zeroing piece seems like a discrete
> > >      improvement worthy of its own set - aside from end goal.
> > > 
> > >      This is needed by your patch set, but was requested to
> > >      try to push us towards a more reasonable pattern for
> > >      folio_zero_user().
> > 
> > What I worry about is people can't agree what api they want.
> >
> 
> Oh that's just our base state of existence.  We mostly agree that
> all APIs are bad in some way and we don't want any of them :P
> 
> What you're looking for is to get people to agree to the
> least-offensive, least-worst option :]
> 
> I don't think we're far off from that.  I suggest doing as Zi said and
> start a [DISCUSSION] thread on specifically this and lay out the needs
> and wants and design issues that you've learned from the past set of
> versions and continue the discussion there.
> 
> It helps to take some snippets from your set to lay out what you've
> learned and explain why you need the folio_user_zero() stuff to get from
> A->Z, and then let maintainers hash out whether that should live in
> post_alloc_hook or new interfaces (or outside page_alloc.c altogether).
> 
> > I don't mind trying all kind of approaches, but it seems to
> > be past the point where people feel it's costing too much of
> > their time with all of these revisions.
> > 
> 
> People are still commenting, so I don't think you've gotten there yet.
> I think the rate of revision is what's costing too much attention.
> 
> You'd save yourself some revisions by taking the attention you have
> right now and starting the discussion thread (and consider submitting
> the topic to LPC if that's something interests you!).

Well it's in october, is it not? I don't think I have the patience to
keep fiddling with that for half a year.

> All this is to say you're doing fine, just keep on keepin' on. Maybe
> pivot your approach from iterations to discussion for a bit until the
> opinions settle.
> 
> ~Gregory


^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Gregory Price @ 2026-06-08 21:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <20260608170646-mutt-send-email-mst@kernel.org>

On Mon, Jun 08, 2026 at 05:16:53PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 04:53:14PM -0400, Gregory Price wrote:
> > 
> > As a start:
> > 
> >   1) the user_addr and zeroing piece seems like a discrete
> >      improvement worthy of its own set - aside from end goal.
> > 
> >      This is needed by your patch set, but was requested to
> >      try to push us towards a more reasonable pattern for
> >      folio_zero_user().
> 
> What I worry about is people can't agree what api they want.
>

Oh that's just our base state of existence.  We mostly agree that
all APIs are bad in some way and we don't want any of them :P

What you're looking for is to get people to agree to the
least-offensive, least-worst option :]

I don't think we're far off from that.  I suggest doing as Zi said and
start a [DISCUSSION] thread on specifically this and lay out the needs
and wants and design issues that you've learned from the past set of
versions and continue the discussion there.

It helps to take some snippets from your set to lay out what you've
learned and explain why you need the folio_user_zero() stuff to get from
A->Z, and then let maintainers hash out whether that should live in
post_alloc_hook or new interfaces (or outside page_alloc.c altogether).

> I don't mind trying all kind of approaches, but it seems to
> be past the point where people feel it's costing too much of
> their time with all of these revisions.
> 

People are still commenting, so I don't think you've gotten there yet.
I think the rate of revision is what's costing too much attention.

You'd save yourself some revisions by taking the attention you have
right now and starting the discussion thread (and consider submitting
the topic to LPC if that's something interests you!).

All this is to say you're doing fine, just keep on keepin' on. Maybe
pivot your approach from iterations to discussion for a bit until the
opinions settle.

~Gregory

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox