Commit graph

1369 commits

Author SHA1 Message Date
Linus Torvalds
32a92f8c89 Convert more 'alloc_obj' cases to default GFP_KERNEL arguments
This converts some of the visually simpler cases that have been split
over multiple lines.  I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.

Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script.  I probably had made it a bit _too_ trivial.

So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.

The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 20:03:00 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Linus Torvalds
cebcffe666 VFIO updates for v7.0-rc1
- Update outdated mdev comment referencing the renamed
    mdev_type_add() function. (Julia Lawall)
 
  - Introduce selftest support for IOMMU mapping of PCI MMIO BARs.
    (Alex Mastro)
 
  - Relax selftest assertion relative to differences in huge page
    handling between legacy (v1) TYPE1 IOMMU mapping behavior and
    the compatibility mode supported by IOMMUFD. (David Matlack)
 
  - Reintroduce memory poison handling support for non-struct-page-
    backed memory in the nvgrace-gpu variant driver. (Ankit Agrawal)
 
  - Replace dma_buf_phys_vec with phys_vec to avoid duplicate
    structure and semantics. (Leon Romanovsky)
 
  - Add missing upstream bridge locking across PCI function reset,
    resolving an assertion failure when secondary bus reset is used
    to provide that reset. (Anthony Pighin)
 
  - Fixes to hisi_acc vfio-pci variant driver to resolve corner case
    issues related to resets, repeated migration, and error injection
    scenarios. (Longfang Liu, Weili Qian)
 
  - Restrict vfio selftest builds to arm64 and x86_64, resolving
    compiler warnings on 32-bit archs. (Ted Logan)
 
  - Un-deprecate the fsl-mc vfio bus driver as a new maintainer has
    stepped up. (Ioana Ciornei)
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmmNCcMRHGFsZXhAc2hh
 emJvdC5vcmcACgkQI5ubbjuwiyLlvw/9FLOcpjKCcxyWFPGUMHV9L0N8dWMR5t75
 Pu6cBuYdpqGgrUaa1NWHYEzFbMSkEJMb5jLj26lokn2l4VZ9BKwdehaE/7t978z2
 J0FgnGUg3B4lYm5qoBStaJ26123XafTMnsBn+wKdXt/lN6ng6GXVBxnmGP+Fuuwd
 HA3MSFB6HUFw4et8qDG3ziyboN/pSWyXaupy60zvVy9x39i4/ZzMm3PSrYPdUX4x
 aPM+lWKRi5yFMwiksZyYb67XA717Js8xhmgNMeJ8Yz3ZUF0n3Z7ZpOzbU+hl8LNn
 sAea6+lXXsvNjEXfet1mjg7A+RYmuQdcjk58J//ijRXn7zRijRM671Bzc40T2JcP
 bfrajHhprMsE+u7VwiBuERACTtbemuaKSbi5iNLHAIqTFwPpb400PvbptkyQhkxh
 IRXIxqgKb5G6/sd73m9dKR9HU7d5SL3mNCARrymgqT6kRxz8fqtaVsXbbsa1Tgah
 iV8in7wjKJ/80rYQd7gNyj/RRpYTAJJemfnJtKGQ9LxGnej8AV6kUZ3np7hpspz7
 TVtmn9RxlwbA5lWYXJ4VUzt9u2Riwd2W6jg6ZnUknSZN6B5j2Jd2bDtF/FKLauKG
 DW/bN8UU7nzgC40ro92qJEFF2PC7GkfZUVRlgW0oq54QZjyCoAIpfYOXjLTSteYP
 umnjcrWkgag=
 =F+FV
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v7.0-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:
 "A small cycle with the bulk in selftests and reintroducing poison
  handling in the nvgrace-gpu driver. The rest are fixes, cleanups, and
  some dmabuf structure consolidation.

   - Update outdated mdev comment referencing the renamed
     mdev_type_add() function (Julia Lawall)

   - Introduce selftest support for IOMMU mapping of PCI MMIO BARs (Alex
     Mastro)

   - Relax selftest assertion relative to differences in huge page
     handling between legacy (v1) TYPE1 IOMMU mapping behavior and the
     compatibility mode supported by IOMMUFD (David Matlack)

   - Reintroduce memory poison handling support for non-struct-page-
     backed memory in the nvgrace-gpu variant driver (Ankit Agrawal)

   - Replace dma_buf_phys_vec with phys_vec to avoid duplicate structure
     and semantics (Leon Romanovsky)

   - Add missing upstream bridge locking across PCI function reset,
     resolving an assertion failure when secondary bus reset is used to
     provide that reset (Anthony Pighin)

   - Fixes to hisi_acc vfio-pci variant driver to resolve corner case
     issues related to resets, repeated migration, and error injection
     scenarios (Longfang Liu, Weili Qian)

   - Restrict vfio selftest builds to arm64 and x86_64, resolving
     compiler warnings on 32-bit archs (Ted Logan)

   - Un-deprecate the fsl-mc vfio bus driver as a new maintainer has
     stepped up (Ioana Ciornei)"

* tag 'vfio-v7.0-rc1' of https://github.com/awilliam/linux-vfio:
  vfio/fsl-mc: add myself as maintainer
  vfio: selftests: only build tests on arm64 and x86_64
  hisi_acc_vfio_pci: fix the queue parameter anomaly issue
  hisi_acc_vfio_pci: resolve duplicate migration states
  hisi_acc_vfio_pci: update status after RAS error
  hisi_acc_vfio_pci: fix VF reset timeout issue
  vfio/pci: Lock upstream bridge for vfio_pci_core_disable()
  types: reuse common phys_vec type instead of DMABUF open‑coded variant
  vfio/nvgrace-gpu: register device memory for poison handling
  mm: add stubs for PFNMAP memory failure registration functions
  vfio: selftests: Drop IOMMU mapping size assertions for VFIO_TYPE1_IOMMU
  vfio: selftests: Add vfio_dma_mapping_mmio_test
  vfio: selftests: Align BAR mmaps for efficient IOMMU mapping
  vfio: selftests: Centralize IOMMU mode name definitions
  vfio/mdev: update outdated comment
2026-02-12 15:52:39 -08:00
Ioana Ciornei
96ca4caf90 vfio/fsl-mc: add myself as maintainer
Add myself as maintainer of the vfio/fsl-mc driver. The driver is still
highly in use on Layerscape DPAA2 SoCs.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://lore.kernel.org/r/20260204100913.3197966-1-ioana.ciornei@nxp.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-02-06 15:08:06 -07:00
Longfang Liu
c3cbc276c2 hisi_acc_vfio_pci: fix the queue parameter anomaly issue
When the number of QPs initialized by the device, as read via vft, is zero,
it indicates either an abnormal device configuration or an abnormal read
result.
Returning 0 directly in this case would allow the live migration operation
to complete successfully, leading to incorrect parameter configuration after
migration and preventing the service from recovering normal functionality.
Therefore, in such situations, an error should be returned to roll back the
live migration operation.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Link: https://lore.kernel.org/r/20260122020205.2884497-5-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-01-29 14:11:00 -07:00
Longfang Liu
8c6ac1730a hisi_acc_vfio_pci: resolve duplicate migration states
In special scenarios involving duplicate migrations, after the
first migration is completed, if the original VF device is used
again and then migrated to another destination, the state indicating
data migration completion for the VF device is not reset.
This results in the second migration to the destination being skipped
without performing data migration.
After the modification, it ensures that a complete data migration
is performed after the subsequent migration.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Link: https://lore.kernel.org/r/20260122020205.2884497-4-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-01-29 14:11:00 -07:00
Longfang Liu
8be14dd48d hisi_acc_vfio_pci: update status after RAS error
After a RAS error occurs on the accelerator device, the accelerator
device will be reset. The live migration state will be abnormal
after reset, and the original state needs to be restored during
the reset process.
Therefore, reset processing needs to be performed in a live
migration scenario.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Link: https://lore.kernel.org/r/20260122020205.2884497-3-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-01-29 14:11:00 -07:00
Weili Qian
a22099ed79 hisi_acc_vfio_pci: fix VF reset timeout issue
If device error occurs during live migration, qemu will
reset the VF. At this time, VF reset and device reset are performed
simultaneously. The VF reset will timeout. Therefore, the QM_RESETTING
flag is used to ensure that VF reset and device reset are performed
serially.

Fixes: b0eed08590 ("hisi_acc_vfio_pci: Add support for VFIO live migration")
Signed-off-by: Weili Qian <qianweili@huawei.com>
Link: https://lore.kernel.org/r/20260122020205.2884497-2-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-01-29 14:11:00 -07:00
Leon Romanovsky
61ceaf2361 vfio: Prevent from pinned DMABUF importers to attach to VFIO DMABUF
Some pinned importers, such as non-ODP RDMA ones, cannot invalidate their
mappings and therefore must be prevented from attaching to this exporter.

Fixes: 5d74781ebc ("vfio/pci: Add dma-buf export support for MMIO regions")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20260121-vfio-add-pin-v1-1-4e04916b17f1@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-01-23 08:47:48 -07:00
Anthony Pighin (Nokia)
962ae6892d vfio/pci: Lock upstream bridge for vfio_pci_core_disable()
The commit 7e89efc6e9 ("Lock upstream bridge for pci_reset_function()")
added locking of the upstream bridge to the reset function. To catch
paths that are not properly locked, the commit 920f646892 ("Warn on
missing cfg_access_lock during secondary bus reset") added a warning
if the PCI configuration space was not locked during a secondary bus reset
request.

When a VFIO PCI device is released from userspace ownership, an attempt
to reset the PCI device function may be made. If so, and the upstream bridge
is not locked, the release request results in a warning:

   pcieport 0000:00:00.0: unlocked secondary bus reset via:
   pci_reset_bus_function+0x188/0x1b8

Add missing upstream bridge locking to vfio_pci_core_disable().

Fixes: 7e89efc6e9 ("PCI: Lock upstream bridge for pci_reset_function()")
Signed-off-by: Anthony Pighin <anthony.pighin@nokia.com>
Link: https://lore.kernel.org/r/BN0PR08MB695171D3AB759C65B6438B5D838DA@BN0PR08MB6951.namprd08.prod.outlook.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-01-19 10:52:09 -07:00
Alex Williamson
fab06e956f * Reuse common phys_vec, phase out dma_buf_phys_vec
Signed-off-by: Alex Williamson <alex@shazbot.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmluaLsRHGFsZXhAc2hh
 emJvdC5vcmcACgkQI5ubbjuwiyIZow/+N/BbZYkRub0lFbufNQ/+M/N80kRzfjjQ
 dE2/AN+YDfLdFxNvmq/P1fYI2B/Oc8msuGwT+C+eJMSVMzZIdCDvnhxH+aPH9roo
 5zmE7fxLjWp/k1lOwb8TupOIR3Jl334AiC/t9PvqraIUMGvt3Uv+/7KoCx/xDxb7
 ID6NbDKiXCZpvlQN1dukf8TCzDVFJnWOOMKLiDnez14okfd5rIqiFpOWMjtAhwlY
 VrwIy/eVqX6YktniwunqU3f4/BeurlHdCS29LdDnHdZzL6HER5onvMwIO8qMeuFZ
 yS8Bf8KJTYtqEedMtFUi5a/ipYu4vuK0KEqIE6USKZLuhQCqLVjFDKs5zUGRQ48X
 qLs59BBmP1WgOnM63OGXzBAAvelLNoh/D5KVzzXmQyNkBn6mFy1MWeR38Ozgl0FA
 +GjK+iwV/GRo+CgDa6Vz+eVwvCV2RhcYlT4cK6BodIQbwd9SWAMEcRxI/IEvOxfC
 YY/1U2JRhOSaQb9j65xgwylEbwoi8BMVbFWE3DydYMr+9PVaOyTKLcJLKrYmhwLn
 cuPetgLaK3UtxdcfhnZyrwzpmtvA56SAReQYg9s+TXFGFurQjNlGVlcKk4dB45nX
 JcOtWHm/6+3D8qoN6FY8Vj5QPePn48urSw1R1/D0LP7951gxknILiQpI7aqEPHyU
 rjAZH6nH6bI=
 =c1Gn
 -----END PGP SIGNATURE-----

Merge tag 'common_phys_vec_via_vfio' into v6.20/vfio/next

 * Reuse common phys_vec, phase out dma_buf_phys_vec

Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-01-19 10:25:24 -07:00
Leon Romanovsky
b703b31ea8 types: reuse common phys_vec type instead of DMABUF open‑coded variant
After commit fcf463b92a ("types: move phys_vec definition to common header"),
we can use the shared phys_vec type instead of the DMABUF‑specific
dma_buf_phys_vec, which duplicated the same structure and semantics.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20260107-convert-to-pvec-v1-1-6e3ab8079708@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-01-19 10:13:29 -07:00
Ankit Agrawal
e5f19b619f vfio/nvgrace-gpu: register device memory for poison handling
The nvgrace-gpu module [1] maps the device memory to the user VA (Qemu)
without adding the memory to the kernel. The device memory pages are PFNMAP
and not backed by struct page. The module can thus utilize the MM's PFNMAP
memory_failure mechanism that handles ECC/poison on regions with no struct
pages.

The kernel MM code exposes register/unregister APIs allowing modules to
register the device memory for memory_failure handling. Make nvgrace-gpu
register the GPU memory with the MM on open.

The module registers its memory region, the address_space with the
kernel MM for ECC handling and implements a callback function to convert
the PFN to the file page offset. The callback functions checks if the
PFN belongs to the device memory region and is also contained in the
VMA range, an error is returned otherwise.

Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [1]

Suggested-by: Alex Williamson <alex@shazbot.org>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Jiaqi Yan <jiaqiyan@google.com>
Link: https://lore.kernel.org/r/20260115202849.2921-3-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-01-19 10:06:31 -07:00
Julia Lawall
ffc987b3bc vfio/mdev: update outdated comment
The function add_mdev_supported_type() was renamed mdev_type_add() in
commit da44c340c4 ("vfio/mdev: simplify mdev_type handling").
Update the comment accordingly.

Note that just as mdev_type_release() now states that its put pairs
with the get in mdev_type_add(), mdev_type_add() already stated that
its get pairs with the put in mdev_type_release().

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20251230164113.102604-1-Julia.Lawall@inria.fr
Signed-off-by: Alex Williamson <alex@shazbot.org>
2026-01-19 10:06:29 -07:00
Alper Ak
acf44a2361 vfio/xe: Fix use-after-free in xe_vfio_pci_alloc_file()
migf->filp is accessed after migf has been freed. Save the error
value before calling kfree() to prevent use-after-free.

Fixes: 1f5556ec8b ("vfio/xe: Add device specific vfio_pci driver variant for Intel graphics")
Signed-off-by: Alper Ak <alperyasinak1@gmail.com>
Link: https://lore.kernel.org/r/20251225151349.360870-1-alperyasinak1@gmail.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-12-28 12:42:46 -07:00
Zilin Guan
665077d78d vfio/pds: Fix memory leak in pds_vfio_dirty_enable()
pds_vfio_dirty_enable() allocates memory for region_info. If
interval_tree_iter_first() returns NULL, the function returns -EINVAL
immediately without freeing the allocated memory, causing a memory leak.

Fix this by jumping to the out_free_region_info label to ensure
region_info is freed.

Fixes: 2e7c6feb4e ("vfio/pds: Add multi-region support")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Link: https://lore.kernel.org/r/20251225143150.1117366-1-zilin@seu.edu.cn
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-12-28 12:33:40 -07:00
Michal Wajdeczko
1e91505038 vfio/xe: Add default handler for .get_region_info_caps
New requirement for the vfio drivers was added by the commit
f978595038 ("vfio: Require drivers to implement get_region_info")
followed by commit 1b0ecb5baf ("vfio/pci: Convert all PCI drivers
to get_region_info_caps") that was missed by the new vfio/xe driver.

Add handler for .get_region_info_caps to avoid -EINVAL errors.

Fixes: 2e38c50ae4 ("vfio/xe: Add device specific vfio_pci driver variant for Intel graphics")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com>
Tested-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com>
Link: https://lore.kernel.org/r/20251218205106.4578-1-michal.wajdeczko@intel.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-12-23 14:13:36 -07:00
Kevin Tian
8bb808cea3 vfio/pci: Disable qword access to the VGA region
Seems no reason to allow qword access to the old VGA resource. Better
restrict it to dword access as before.

Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20251218081650.555015-3-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-12-23 14:07:08 -07:00
Kevin Tian
dc85a46928 vfio/pci: Disable qword access to the PCI ROM bar
Commit 2b938e3db3 ("vfio/pci: Enable iowrite64 and ioread64 for vfio
pci") enables qword access to the PCI bar resources. However certain
devices (e.g. Intel X710) are observed with problem upon qword accesses
to the rom bar, e.g. triggering PCI aer errors.

This is triggered by Qemu which caches the rom content by simply does a
pread() of the remaining size until it gets the full contents. The other
bars would only perform operations at the same access width as their
guest drivers.

Instead of trying to identify all broken devices, universally disable
qword access to the rom bar i.e. going back to the old way which worked
reliably for years.

Reported-by: Farrah Chen <farrah.chen@intel.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220740
Fixes: 2b938e3db3 ("vfio/pci: Enable iowrite64 and ioread64 for vfio pci")
Cc: stable@vger.kernel.org
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Link: https://lore.kernel.org/r/20251218081650.555015-2-kevin.tian@intel.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-12-23 14:07:08 -07:00
Linus Torvalds
deb879faa9 drm next part 2 for 6.19-rc1
vfio:
 - add a vfio_pci variant driver for Intel
 
 xe/i915 display:
 - add plane color management support
 
 xe:
 - Add scope-based cleanup helper for runtime PM
 - vfio xe driver prerequisites and exports
 - fix vfio link error
 - Fix a memory leak
 - Fix a 64-bit division
 - vf migration fix
 - LRC pause fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmkyMUgACgkQDHTzWXnE
 hr69pg/9EWjh7qVGk9ZIpYc9AW42UzWwOVBX/HWkuQvmfxUUBqtA3IuP0dGGmPUn
 QbtbetbRvlCaXwEoZpPh1nzrXA2AGFxgHErYMO5BfwquyBcfpwTWZ9T15ptceL/3
 aw2l63aH1R2/yxCRfHFIdwAmq1bThqdh5IkjjbE3im0V0lHT2Uo/jhmf/EWCNWol
 LlPgYxHpfBIzhtFYUcniaXxs9vOSk49AY+ObpPpuvks8OWoaaTcKYWlUCHr/X1ip
 OnWB4NGraTzx4l44vqdRvRL5/KPY7N2IcAxU7rXFTacWp6UoESph5DCYLsPREONb
 OsK1pVbAsKATobeoAC9J+utILhfDmKM8Z7eSAlNE+X+nk/BKu4h9Pp1TnKfo7bCz
 0tER/OrsqnYMfxj1PawT3xpf/KUWkL0aqnRJpmA2cvJqTz8Qnb4h6kRQp1iAKp80
 XaBL1v0uzVE/J4ffuA5bzkT71w3hjN5ytLyEe7h1Y43E/jxyQgyTIHM8cX/UrreJ
 RboaakyoTv1u1xrd9Mzx4WCzwKryH+JFY2nekAC3YnSCcGYnSScSNM/ARTrYC2pf
 wNbWBvkq7ZFy9eybaZQ/zaSYyVO7yQDjdCAqO+SA+xfRuwF41uiADJptyC+FgMPw
 nIBaeid314tJQ9uGNPJH0f2BzLzSvH569trUp/7hbOYWC69XeQI=
 =jyth
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-12-05' of https://gitlab.freedesktop.org/drm/kernel

Pull more drm updates from Dave Airlie:
 "There was some additional intel code for color operations we wanted to
  land. However I discovered I missed a pull for the xe vfio driver
  which I had sorted into 6.20 in my brain, until Thomas mentioned it.

  This contains the xe vfio code, a bunch of xe fixes that were waiting
  and the i915 color management support. I'd like to include it as part
  of keeping the two main vendors on the same page and giving a good
  cross-driver experience for userspace when it starts using it.

  vfio:
   - add a vfio_pci variant driver for Intel

  xe/i915 display:
   - add plane color management support

  xe:
   - Add scope-based cleanup helper for runtime PM
   - vfio xe driver prerequisites and exports
   - fix vfio link error
   - Fix a memory leak
   - Fix a 64-bit division
   - vf migration fix
   - LRC pause fix"

* tag 'drm-next-2025-12-05' of https://gitlab.freedesktop.org/drm/kernel: (25 commits)
  drm/i915/color: Enable Plane Color Pipelines
  drm/i915/color: Add 3D LUT to color pipeline
  drm/i915/color: Add registers for 3D LUT
  drm/i915/color: Program Plane Post CSC Registers
  drm/i915/color: Program Pre-CSC registers
  drm/i915/color: Add framework to program PRE/POST CSC LUT
  drm/i915: Add register definitions for Plane Post CSC
  drm/i915: Add register definitions for Plane Degamma
  drm/i915/color: Add plane CTM callback for D12 and beyond
  drm/i915/color: Preserve sign bit when int_bits is Zero
  drm/i915/color: Add framework to program CSC
  drm/i915/color: Create a transfer function color pipeline
  drm/i915/color: Add helper to create intel colorop
  drm/i915: Add intel_color_op
  drm/i915/display: Add identifiers for driver specific blocks
  drm/xe/pf: fix VFIO link error
  drm/xe: Protect against unset LRC when pausing submissions
  drm/xe/vf: Start re-emission from first unsignaled job during VF migration
  drm/xe/pf: Use div_u64 when calculating GGTT profile
  drm/xe: Fix memory leak when handling pagefault vma
  ...
2025-12-04 19:42:53 -08:00
Linus Torvalds
056daec292 iommufd 6.19 pull request
- Expand IOMMU_IOAS_MAP_FILE to accept a DMABUF exported from VFIO. This
   is the first step to broader DMABUF support in iommufd, right now it
   only works with VFIO. This closes the last functional gap with classic
   VFIO type 1 to safely support PCI peer to peer DMA by mapping the VFIO
   device's MMIO into the IOMMU.
 
 - Relax SMMUv3 restrictions on nesting domains to better support qemu's
   sequence to have an identity mapping before the vSID is established.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaS8DrgAKCRCFwuHvBreF
 YY/kAP0Q1s7LkVb83uQb8kIW3xKzEnFNTlhrSSGV5UBuYLbaDgD+J+y+4VrSkJem
 85LMipmzaoZdHqtxMhQWrlYbZMr9TAM=
 =BacK
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd

Pull iommufd updates from Jason Gunthorpe:
 "This is a pretty consequential cycle for iommufd, though this pull is
  not too big. It is based on a shared branch with VFIO that introduces
  VFIO_DEVICE_FEATURE_DMA_BUF a DMABUF exporter for VFIO device's MMIO
  PCI BARs. This was a large multiple series journey over the last year
  and a half.

  Based on that work IOMMUFD gains support for VFIO DMABUF's in its
  existing IOMMU_IOAS_MAP_FILE, which closes the last major gap to
  support PCI peer to peer transfers within VMs.

  In Joerg's iommu tree we have the "generic page table" work which aims
  to consolidate all the duplicated page table code in every iommu
  driver into a single algorithm. This will be used by iommufd to
  implement unique page table operations to start adding new features
  and improve performance.

  In here:

   - Expand IOMMU_IOAS_MAP_FILE to accept a DMABUF exported from VFIO.
     This is the first step to broader DMABUF support in iommufd, right
     now it only works with VFIO. This closes the last functional gap
     with classic VFIO type 1 to safely support PCI peer to peer DMA by
     mapping the VFIO device's MMIO into the IOMMU.

   - Relax SMMUv3 restrictions on nesting domains to better support
     qemu's sequence to have an identity mapping before the vSID is
     established"

* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
  iommu/arm-smmu-v3-iommufd: Allow attaching nested domain for GBPA cases
  iommufd/selftest: Add some tests for the dmabuf flow
  iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
  iommufd: Have iopt_map_file_pages convert the fd to a file
  iommufd: Have pfn_reader process DMABUF iopt_pages
  iommufd: Allow MMIO pages in a batch
  iommufd: Allow a DMABUF to be revoked
  iommufd: Do not map/unmap revoked DMABUFs
  iommufd: Add DMABUF to iopt_pages
  vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
2025-12-04 18:50:11 -08:00
Linus Torvalds
a3ebb59eee VFIO updates for v6.19-rc1
- Move libvfio selftest artifacts in preparation of more tightly
    coupled integration with KVM selftests. (David Matlack)
 
  - Fix comment typo in mtty driver. (Chu Guangqing)
 
  - Support for new hardware revision in the hisi_acc vfio-pci variant
    driver where the migration registers can now be accessed via the PF.
    When enabled for this support, the full BAR can be exposed to the
    user. (Longfang Liu)
 
  - Fix vfio cdev support for VF token passing, using the correct size
    for the kernel structure, thereby actually allowing userspace to
    provide a non-zero UUID token.  Also set the match token callback for
    the hisi_acc, fixing VF token support for this this vfio-pci variant
    driver. (Raghavendra Rao Ananta)
 
  - Introduce internal callbacks on vfio devices to simplify and
    consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
    data, removing various ioctl intercepts with a more structured
    solution. (Jason Gunthorpe)
 
  - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
    to be exposed through dma-buf objects with lifecycle managed through
    move operations.  This enables low-level interactions such as a
    vfio-pci based SPDK drivers interacting directly with dma-buf capable
    RDMA devices to enable peer-to-peer operations.  IOMMUFD is also now
    able to build upon this support to fill a long standing feature gap
    versus the legacy vfio type1 IOMMU backend with an implementation of
    P2P support for VM use cases that better manages the lifecycle of the
    P2P mapping. (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)
 
  - Convert eventfd triggering for error and request signals to use RCU
    mechanisms in order to avoid a 3-way lockdep reported deadlock issue.
    (Alex Williamson)
 
  - Fix a 32-bit overflow introduced via dma-buf support manifesting with
    large DMA buffers. (Alex Mastro)
 
  - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
    fault rather than at mmap time.  This conversion serves both to make
    use of huge PFNMAPs but also to both avoid corrected RAS events
    during reset by now being subject to vfio-pci-core's use of
    unmap_mapping_range(), and to enable a device readiness test after
    reset. (Ankit Agrawal)
 
  - Refactoring of vfio selftests to support multi-device tests and split
    code to provide better separation between IOMMU and device objects.
    This work also enables a new test suite addition to measure parallel
    device initialization latency. (David Matlack)
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmkvV3IRHGFsZXhAc2hh
 emJvdC5vcmcACgkQI5ubbjuwiyIpIQ/9GwpjLH5Vdv0v2d9mkHmZIWFpG/tr3zJa
 +spQqOjO0etASc67PtIJArT9pWib+s6O8OaG7iFrdNR65HCSsXSZbIGbMThPODfy
 DdDj1ipAqMVwcaCZT8un2N8Sktu9YpFQMvc5IoXWWYhw88vili7bBx+OTrEFV2T0
 6qQijSBdhw1TXVFHG6BGSmqmisyMepIebA6GmPWdfYu6BfoWBYMdcMjDwd1J61Q5
 DDwFRzn/Dz2Tvb1jbXiiRMRuFIuegFQii+wtd30S/cRPFZhZLWzc+drimC6oOFiQ
 qL19vQQsBPnLtGvch40HsET/AbY5w0pLCkYX5qacxP3sq27+N+KuotzCvbnVMN+H
 e2BqOCujyoce8z1Br6BzV71Lr2yzPDcc5pXTuEuuBT+J/ptOY8hfEikOj85s5Wzj
 aKsTrdDRGMrn/o11NkGSzYwFcMs9MxCX9mo98U6OkWDr0+cmPLf4LGZgpJudWg4E
 POUlzPpnzJrTlX5d+OqCdKJG0a1hTlTa2udzRa5hCDANHaZWLoAssfgSEKfV9xt1
 PzOMf0UIJmPJmFcw/OpMO72/5xp8O4WslJS0ulSm6vrAJDtutLApHZ7bJ44KniNd
 4vte+gOjyZY8ibTDKRULhXVlCDxkEnZjRBbApgI9HJD61IElOzjqohRuRx77J09B
 7c8OSLI8d1U=
 =tpee
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Move libvfio selftest artifacts in preparation of more tightly
   coupled integration with KVM selftests (David Matlack)

 - Fix comment typo in mtty driver (Chu Guangqing)

 - Support for new hardware revision in the hisi_acc vfio-pci variant
   driver where the migration registers can now be accessed via the PF.
   When enabled for this support, the full BAR can be exposed to the
   user (Longfang Liu)

 - Fix vfio cdev support for VF token passing, using the correct size
   for the kernel structure, thereby actually allowing userspace to
   provide a non-zero UUID token. Also set the match token callback for
   the hisi_acc, fixing VF token support for this this vfio-pci variant
   driver (Raghavendra Rao Ananta)

 - Introduce internal callbacks on vfio devices to simplify and
   consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
   data, removing various ioctl intercepts with a more structured
   solution (Jason Gunthorpe)

 - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
   to be exposed through dma-buf objects with lifecycle managed through
   move operations. This enables low-level interactions such as a
   vfio-pci based SPDK drivers interacting directly with dma-buf capable
   RDMA devices to enable peer-to-peer operations. IOMMUFD is also now
   able to build upon this support to fill a long standing feature gap
   versus the legacy vfio type1 IOMMU backend with an implementation of
   P2P support for VM use cases that better manages the lifecycle of the
   P2P mapping (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)

 - Convert eventfd triggering for error and request signals to use RCU
   mechanisms in order to avoid a 3-way lockdep reported deadlock issue
   (Alex Williamson)

 - Fix a 32-bit overflow introduced via dma-buf support manifesting with
   large DMA buffers (Alex Mastro)

 - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
   fault rather than at mmap time. This conversion serves both to make
   use of huge PFNMAPs but also to both avoid corrected RAS events
   during reset by now being subject to vfio-pci-core's use of
   unmap_mapping_range(), and to enable a device readiness test after
   reset (Ankit Agrawal)

 - Refactoring of vfio selftests to support multi-device tests and split
   code to provide better separation between IOMMU and device objects.
   This work also enables a new test suite addition to measure parallel
   device initialization latency (David Matlack)

* tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio: (65 commits)
  vfio: selftests: Add vfio_pci_device_init_perf_test
  vfio: selftests: Eliminate INVALID_IOVA
  vfio: selftests: Split libvfio.h into separate header files
  vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c
  vfio: selftests: Rename vfio_util.h to libvfio.h
  vfio: selftests: Stop passing device for IOMMU operations
  vfio: selftests: Move IOVA allocator into iova_allocator.c
  vfio: selftests: Move IOMMU library code into iommu.c
  vfio: selftests: Rename struct vfio_dma_region to dma_region
  vfio: selftests: Upgrade driver logging to dev_err()
  vfio: selftests: Prefix logs with device BDF where relevant
  vfio: selftests: Eliminate overly chatty logging
  vfio: selftests: Support multiple devices in the same container/iommufd
  vfio: selftests: Introduce struct iommu
  vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode
  vfio: selftests: Allow passing multiple BDFs on the command line
  vfio: selftests: Split run.sh into separate scripts
  vfio: selftests: Move run.sh into scripts directory
  vfio/nvgrace-gpu: wait for the GPU mem to be ready
  vfio/nvgrace-gpu: Inform devmem unmapped after reset
  ...
2025-12-04 18:42:48 -08:00
Linus Torvalds
1b5dd29869 vfs-6.19-rc1.fd_prepare.fs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZwAKCRCRxhvAZXjc
 op0AAP4oNVJkFyvgKoPos5K2EGFB8M7merGhpYtsOoeg8UK6OwD/UySQErHsXQDR
 sUDDa5uFOhfrkcfM8REtAN4wF8p5qAc=
 =QgFD
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull fd prepare updates from Christian Brauner:
 "This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the
  common pattern of get_unused_fd_flags() + create file + fd_install()
  that is used extensively throughout the kernel and currently requires
  cumbersome cleanup paths.

  FD_ADD() - For simple cases where a file is installed immediately:

      fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device));
      if (fd < 0)
          vfio_device_put_registration(device);
      return fd;

  FD_PREPARE() - For cases requiring access to the fd or file, or
  additional work before publishing:

      FD_PREPARE(fdf, O_CLOEXEC, sync_file->file);
      if (fdf.err) {
          fput(sync_file->file);
          return fdf.err;
      }

      data.fence = fd_prepare_fd(fdf);
      if (copy_to_user((void __user *)arg, &data, sizeof(data)))
          return -EFAULT;

      return fd_publish(fdf);

  The primitives are centered around struct fd_prepare. FD_PREPARE()
  encapsulates all allocation and cleanup logic and must be followed by
  a call to fd_publish() which associates the fd with the file and
  installs it into the caller's fdtable. If fd_publish() isn't called,
  both are deallocated automatically. FD_ADD() is a shorthand that does
  fd_publish() immediately and never exposes the struct to the caller.

  I've implemented this in a way that it's compatible with the cleanup
  infrastructure while also being usable separately. IOW, it's centered
  around struct fd_prepare which is aliased to class_fd_prepare_t and so
  we can make use of all the basica guard infrastructure"

* tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
  io_uring: convert io_create_mock_file() to FD_PREPARE()
  file: convert replace_fd() to FD_PREPARE()
  vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD()
  tty: convert ptm_open_peer() to FD_ADD()
  ntsync: convert ntsync_obj_get_fd() to FD_PREPARE()
  media: convert media_request_alloc() to FD_PREPARE()
  hv: convert mshv_ioctl_create_partition() to FD_ADD()
  gpio: convert linehandle_create() to FD_PREPARE()
  pseries: port papr_rtas_setup_file_interface() to FD_ADD()
  pseries: convert papr_platform_dump_create_handle() to FD_ADD()
  spufs: convert spufs_gang_open() to FD_PREPARE()
  papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE()
  spufs: convert spufs_context_open() to FD_PREPARE()
  net/socket: convert __sys_accept4_file() to FD_ADD()
  net/socket: convert sock_map_fd() to FD_ADD()
  net/kcm: convert kcm_ioctl() to FD_PREPARE()
  net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE()
  secretmem: convert memfd_secret() to FD_ADD()
  memfd: convert memfd_create() to FD_ADD()
  bpf: convert bpf_token_create() to FD_PREPARE()
  ...
2025-12-01 17:32:07 -08:00
Michał Winiarski
1f5556ec8b vfio/xe: Add device specific vfio_pci driver variant for Intel graphics
In addition to generic VFIO PCI functionality, the driver implements
VFIO migration uAPI, allowing userspace to enable migration for Intel
Graphics SR-IOV Virtual Functions.
The driver binds to VF device and uses API exposed by Xe driver to
transfer the VF migration data under the control of PF device.

Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Link: https://patch.msgid.link/20251127093934.1462188-5-michal.winiarski@intel.com
Link: https://lore.kernel.org/all/20251128125322.34edbeaf.alex@shazbot.org/
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
(cherry picked from commit 2e38c50ae4)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-12-01 09:45:48 +01:00
Ankit Agrawal
a23b10608d vfio/nvgrace-gpu: wait for the GPU mem to be ready
Speculative prefetches from CPU to GPU memory until the GPU is
ready after reset can cause harmless corrected RAS events to
be logged on Grace systems. It is thus preferred that the
mapping not be re-established until the GPU is ready post reset.

The GPU readiness can be checked through BAR0 registers similar
to the checking at the time of device probe.

It can take several seconds for the GPU to be ready. So it is
desirable that the time overlaps as much of the VM startup as
possible to reduce impact on the VM bootup time. The GPU
readiness state is thus checked on the first fault/huge_fault
request or read/write access which amortizes the GPU readiness
time.

The first fault and read/write checks the GPU state when the
reset_done flag - which denotes whether the GPU has just been
reset. The memory_lock is taken across map/access to avoid
races with GPU reset.

Also check if the memory is enabled, before waiting for GPU
to be ready. Otherwise the readiness check would block for 30s.

Lastly added PM handling wrapping on read/write access.

Cc: Shameer Kolothum <skolothumtho@nvidia.com>
Cc: Alex Williamson <alex@shazbot.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Vikram Sethi <vsethi@nvidia.com>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-7-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:09:26 -07:00
Ankit Agrawal
dfe765499a vfio/nvgrace-gpu: Inform devmem unmapped after reset
Introduce a new flag reset_done to notify that the GPU has just
been reset and the mapping to the GPU memory is zapped.

Implement the reset_done handler to set this new variable. It
will be used later in the patches to wait for the GPU memory
to be ready before doing any mapping or access.

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-6-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:07:26 -07:00
Ankit Agrawal
7d055071d7 vfio/nvgrace-gpu: split the code to wait for GPU ready
Split the function that check for the GPU device being ready on
the probe.

Move the code to wait for the GPU to be ready through BAR0 register
reads to a separate function. This would help reuse the code.

This also fixes a bug where the return status in case of timeout
gets overridden by return from pci_enable_device. With the fix,
a timeout generate an error as initially intended.

Fixes: d85f69d520 ("vfio/nvgrace-gpu: Check the HBM training and C2C link status")

Reviewed-by: Zhi Wang <zhiw@nvidia.com>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-5-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:07:25 -07:00
Ankit Agrawal
7f5764e179 vfio: use vfio_pci_core_setup_barmap to map bar in mmap
Remove code duplication in vfio_pci_core_mmap by calling
vfio_pci_core_setup_barmap to perform the bar mapping.

No functional change is intended.

Cc: Donald Dutile <ddutile@redhat.com>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Zhi Wang <zhiw@nvidia.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-4-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:07:25 -07:00
Ankit Agrawal
9db65489b8 vfio/nvgrace-gpu: Add support for huge pfnmap
NVIDIA's Grace based systems have large device memory. The device
memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu
module could make use of the huge PFNMAP support added in mm [1].

To make use of the huge pfnmap support, fault/huge_fault ops
based mapping mechanism needs to be implemented. Currently nvgrace-gpu
module relies on remap_pfn_range to do the mapping during VM bootup.
Replace it to instead rely on fault and use vfio_pci_vmf_insert_pfn
to setup the mapping.

Moreover to enable huge pfnmap, nvgrace-gpu module is updated by
adding huge_fault ops implementation. The implementation establishes
mapping according to the order request. Note that if the PFN or the
VMA address is unaligned to the order, the mapping fallbacks to
the PTE level.

Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]

Cc: Shameer Kolothum <skolothumtho@nvidia.com>
Cc: Alex Williamson <alex@shazbot.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Vikram Sethi <vsethi@nvidia.com>
Reviewed-by: Zhi Wang <zhiw@nvidia.com>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-3-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:07:25 -07:00
Ankit Agrawal
9b92bc7554 vfio: refactor vfio_pci_mmap_huge_fault function
Refactor vfio_pci_mmap_huge_fault to take out the implementation
to map the VMA to the PTE/PMD/PUD as a separate function.

Export the new function to be used by nvgrace-gpu module.

Move the alignment check code to verify that pfn and VMA VA is
aligned to the page order to the header file and make it inline.

No functional change is intended.

Cc: Shameer Kolothum <skolothumtho@nvidia.com>
Cc: Alex Williamson <alex@shazbot.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-2-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:07:25 -07:00
Alex Williamson
98693e0897 vfio/pci: Use RCU for error/request triggers to avoid circular locking
Thanks to a device generating an ACS violation during bus reset,
lockdep reported the following circular locking issue:

CPU0: SET_IRQS (MSI/X): holds igate, acquires memory_lock
CPU1: HOT_RESET: holds memory_lock, acquires pci_bus_sem
CPU2: AER: holds pci_bus_sem, acquires igate

This results in a potential 3-way deadlock.

Remove the pci_bus_sem->igate leg of the triangle by using RCU
to peek at the eventfd rather than locking it with igate.

Fixes: 3be3a074cf ("vfio-pci: Don't use device_lock around AER interrupt setup")
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20251124223623.2770706-1-alex@shazbot.org
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:04:27 -07:00
Christian Brauner
5f3ea1c201
vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD()
Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-43-b6efa1706cfd@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-28 12:42:36 +01:00
Jason Gunthorpe
96ce2aeb15 vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
This function is used to establish the "private interconnect" between the
VFIO DMABUF exporter and the iommufd DMABUF importer. This is intended to
be a temporary API until the core DMABUF interface is improved to natively
support a private interconnect and revocable negotiation.

This function should only be called by iommufd when trying to map a
DMABUF. For now iommufd will only support VFIO DMABUFs.

The following improvements are needed in the DMABUF API to generically
support more exporters with iommufd/kvm type importers that cannot use the
DMA API:

 1) Revoke semantics. VFIO needs to be able to prevent access to the MMIO
    during FLR, and so it will use dma_buf_move_notify() to prevent
    access. iommmufd does not support fault handling so it cannot
    implement the full move_notify. Instead if revoke is negotiated the
    exporter promises not to use move_notify() unless the importer can
    experiance failures. iommufd will unmap the dmabuf from the iommu page
    tables while it is revoked.

 2) Private interconnect negotiation. iommufd will only be able to map
    a "private interconnect" that provides a phys_addr_t and a
    struct p2pdma_provider * to describe the memory. It cannot use a DMA
    mapped scatterlist since it is directly calling iommu_map().

 3) NULL device during dma_buf_dynamic_attach(). Since iommufd doesn't use
    the DMA API it doesn't have a DMAable struct device to pass here.

Link: https://patch.msgid.link/r/1-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Acked-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-25 11:30:15 -04:00
Alex Williamson
fa804aa4ac [v9] vfio/pci: Allow MMIO regions to be exported through dma-buf
https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmkf53sRHGFsZXhAc2hh
 emJvdC5vcmcACgkQI5ubbjuwiyJv5w//TdfL5p6yz8O9CxJCQrm0W6raqDx+LE7u
 MNyCktSdokkPKSz/ms10vgl9CGqpVzDHNlgmVBGkAFQaRYQKUgMryp1IQ0jlspz0
 Ee1zy6AtlMemAyL1bybnk6yvc2nh/xa3LHa4FJ6sgL3KKnt9KjpY4sGGV/KlNfJV
 lYs+20+NyNU1rgyPtkHcrCrcYTMkDumvHDsn51O8Zx12b++qkZuf3b+mcWNNlNam
 DJl58O9tio0ol5a4rf63BxgPROgEVqs4G4rSRelJqr6g8IIFttihplhyZ83af9sC
 jtV0NEqsWt0nrKZtg1N9IIBgfQho5eamB99J0cPU0dhZSKXzBwIalBUb6zKZeVVY
 QEN6ZLoynpKRpZ1bhe3EhduNE1LOm2+wOJ7s93gSAtvsSXHHGDf1cXHL5CzKMLwG
 76fkO6b0m60mwXyjTgEVjqE9GTcgZLc9SKI9GN303v53W5Y1BOtW9B4QyX9eEPui
 qqtSbqXMsYx95lJASkjwE+u2b33mFrks8UC7Xg1nZkJWLy6hDPbQ6AIcM0mHd1EE
 2UJNJAzeUtzr0Vd3B92RkT0BPY299XlUUp50In42/g5y6IMSO7R/jZXrQyEXEDez
 dsqedYJE0vzO07pfkRYh1TJtwIl0WfdNnqClOKSlolfSn0vzI5qz9RKkpHM5T8Ou
 FIIb7Bf3hvs=
 =li88
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.19-dma-buf-v9+' into v6.19/vfio/next

[v9] vfio/pci: Allow MMIO regions to be exported through dma-buf

https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com

Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 21:20:00 -07:00
Jason Gunthorpe
5415d887db vfio/nvgrace: Support get_dmabuf_phys
Call vfio_pci_core_fill_phys_vec() with the proper physical ranges for the
synthetic BAR 2 and BAR 4 regions. Otherwise use the normal flow based on
the PCI bar.

This demonstrates a DMABUF that follows the region info report to only
allow mapping parts of the region that are mmapable. Since the BAR is
power of two sized and the "CXL" region is just page aligned the there can
be a padding region at the end that is not mmaped or passed into the
DMABUF.

The "CXL" ranges that are remapped into BAR 2 and BAR 4 areas are not PCI
MMIO, they actually run over the CXL-like coherent interconnect and for
the purposes of DMA behave identically to DRAM. We don't try to model this
distinction between true PCI BAR memory that takes a real PCI path and the
"CXL" memory that takes a different path in the p2p framework for now.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-11-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 21:15:17 -07:00
Leon Romanovsky
5d74781ebc vfio/pci: Add dma-buf export support for MMIO regions
Add support for exporting PCI device MMIO regions through dma-buf,
enabling safe sharing of non-struct page memory with controlled
lifetime management. This allows RDMA and other subsystems to import
dma-buf FDs and build them into memory regions for PCI P2P operations.

The implementation provides a revocable attachment mechanism using
dma-buf move operations. MMIO regions are normally pinned as BARs
don't change physical addresses, but access is revoked when the VFIO
device is closed or a PCI reset is issued. This ensures kernel
self-defense against potentially hostile userspace.

Currently VFIO can take MMIO regions from the device's BAR and map
them into a PFNMAP VMA with special PTEs. This mapping type ensures
the memory cannot be used with things like pin_user_pages(), hmm, and
so on. In practice only the user process CPU and KVM can safely make
use of these VMA. When VFIO shuts down these VMAs are cleaned by
unmap_mapping_range() to prevent any UAF of the MMIO beyond driver
unbind.

However, VFIO type 1 has an insecure behavior where it uses
follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back
into the IOMMU. This has a long history of enabling P2P DMA inside
VMs, but has serious lifetime problems by allowing a UAF of the MMIO
after the VFIO driver has been unbound.

Introduce DMABUF as a new safe way to export a FD based handle for the
MMIO regions. This can be consumed by existing DMABUF importers like
RDMA or DRM without opening an UAF. A following series will add an
importer to iommufd to obsolete the type 1 code and allow safe
UAF-free MMIO P2P in VM cases.

DMABUF has a built in synchronous invalidation mechanism called
move_notify. VFIO keeps track of all drivers importing its MMIO and
can invoke a synchronous invalidation callback to tell the importing
drivers to DMA unmap and forget about the MMIO pfns. This process is
being called revoke. This synchronous invalidation fully prevents any
lifecycle problems. VFIO will do this before unbinding its driver
ensuring there is no UAF of the MMIO beyond the driver lifecycle.

Further, VFIO has additional behavior to block access to the MMIO
during things like Function Level Reset. This is because some poor
platforms may experience a MCE type crash when touching MMIO of a PCI
device that is undergoing a reset. Today this is done by using
unmap_mapping_range() on the VMAs. Extend that into the DMABUF world
and temporarily revoke the MMIO from the DMABUF importers during FLR
as well. This will more robustly prevent an errant P2P from possibly
upsetting the platform.

A DMABUF FD is a preferred handle for MMIO compared to using something
like a pgmap because:
 - VFIO is supported, including its P2P feature, on archs that don't
   support pgmap
 - PCI devices have all sorts of BAR sizes, including ones smaller
   than a section so a pgmap cannot always be created
 - It is undesirable to waste a lot of memory for struct pages,
   especially for a case like a GPU with ~100GB of BAR size
 - We want a synchronous revoke semantic to support FLR with light
   hardware requirements

Use the P2P subsystem to help generate the DMA mapping. This is a
significant upgrade over the abuse of dma_map_resource() that has
historically been used by DMABUF exporters. Experience with an OOT
version of this patch shows that real systems do need this. This
approach deals with all the P2P scenarios:
 - Non-zero PCI bus_offset
 - ACS flags routing traffic to the IOMMU
 - ACS flags that bypass the IOMMU - though vfio noiommu is required
   to hit this.

There will be further work to formalize the revoke semantic in
DMABUF. For now this acts like a move_notify dynamic exporter where
importer fault handling will get a failure when they attempt to map.
This means that only fully restartable fault capable importers can
import the VFIO DMABUFs. A future revoke semantic should open this up
to more HW as the HW only needs to invalidate, not handle restartable
faults.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-10-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 21:12:19 -07:00
Leon Romanovsky
35c3503908 vfio/pci: Enable peer-to-peer DMA transactions by default
Make sure that all VFIO PCI devices have peer-to-peer capabilities
enables, so we would be able to export their MMIO memory through DMABUF,

VFIO has always supported P2P mappings with itself. VFIO type 1
insecurely reads PFNs directly out of a VMA's PTEs and programs them
into the IOMMU allowing any two VFIO devices to perform P2P to each
other.

All existing VMMs use this capability to export P2P into a VM where
the VM could setup any kind of DMA it likes. Projects like DPDK/SPDK
are also known to make use of this, though less frequently.

As a first step to more properly integrating VFIO with the P2P
subsystem unconditionally enable P2P support for VFIO PCI devices. The
struct p2pdma_provider will act has a handle to the P2P subsystem to
do things like DMA mapping.

While real PCI devices have to support P2P (they can't even tell if an
IOVA is P2P or not) there may be fake PCI devices that may trigger
some kind of catastrophic system failure. To date VFIO has never
tripped up on such a case, but if one is discovered the plan is to add
a PCI quirk and have pcim_p2pdma_init() fail. This will fully block
the broken device throughout any users of the P2P subsystem in the
kernel.

Thus P2P through DMABUF will follow the historical VFIO model and be
unconditionally enabled by vfio-pci.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-9-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:02:46 -07:00
Vivek Kasireddy
47d13c939d vfio/pci: Share the core device pointer while invoking feature functions
There is no need to share the main device pointer (struct vfio_device *)
with all the feature functions as they only need the core device
pointer. Therefore, extract the core device pointer once in the
caller (vfio_pci_core_ioctl_feature) and share it instead.

Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-8-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:02:40 -07:00
Vivek Kasireddy
64a5dedcff vfio: Export vfio device get and put registration helpers
These helpers are useful for managing additional references taken
on the device from other associated VFIO modules.

Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-7-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:02:32 -07:00
Jason Gunthorpe
56c069307d vfio: Remove the get_region_info op
No driver uses it now, all are using get_region_info_caps().

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/22-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:06:41 -07:00
Jason Gunthorpe
dc10734610 vfio: Move the remaining drivers to get_region_info_caps
Remove the duplicate code and change info to a pointer. caps are not used.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/21-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:06:41 -07:00
Jason Gunthorpe
182c62861b vfio/platform: Convert to get_region_info_caps
Remove the duplicate code and change info to a pointer. caps are not used.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:06:41 -07:00
Jason Gunthorpe
1b0ecb5baf vfio/pci: Convert all PCI drivers to get_region_info_caps
Since the core function signature changes it has to flow up to all
drivers.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/19-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:06:40 -07:00
Jason Gunthorpe
775f726a74 vfio: Add get_region_info_caps op
This op does the copy to/from user for the info and can return back
a cap chain through a vfio_info_cap * result.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/15-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
f978595038 vfio: Require drivers to implement get_region_info
Remove the fallback through the ioctl callback, no drivers use this now.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/14-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
b9827eff6b vfio/cdx: Provide a get_region_info op
Change the signature of vfio_cdx_ioctl_get_region_info() and hook it to
the op.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/11-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
6cdae5d0c3 vfio/fsl: Provide a get_region_info op
Move it out of vfio_fsl_mc_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/10-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
d4635df279 vfio/platform: Provide a get_region_info op
Move it out of vfio_platform_ioctl() and re-indent it. Add it to all
platform drivers.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/9-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
f3fddb71dd vfio/pci: Fill in the missing get_region_info ops
Now that every variant driver provides a get_region_info op remove the
ioctl based dispatch from vfio_pci_core_ioctl().

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00