twx-linux

Author	SHA1	Message	Date
Georg Gottleuber	42385f9c4c	nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk commit `11cb3529d1` upstream. On the TUXEDO InfinityBook Pro Gen9 Intel, a Samsung 990 Evo NVMe leads to a high power consumption in s2idle sleep (4 watts). This patch applies 'Force No Simple Suspend' quirk to achieve a sleep with a lower power consumption, typically around 1.2 watts. Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com> Cc: stable@vger.kernel.org Signed-off-by: Werner Sembach <wse@tuxedocomputers.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-02-17 09:40:26 +01:00
Georg Gottleuber	9db27ba3a1	nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk commit `dbf2bb1a13` upstream. On the TUXEDO InfinityFlex, a Samsung 990 Evo NVMe leads to a high power consumption in s2idle sleep (4 watts). This patch applies 'Force No Simple Suspend' quirk to achieve a sleep with a lower power consumption, typically around 1.4 watts. Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com> Cc: stable@vger.kernel.org Signed-off-by: Werner Sembach <wse@tuxedocomputers.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-02-17 09:40:26 +01:00
Daniel Wagner	40a29e8499	nvme-fc: use ctrl state getter [ Upstream commit `c8ed6cb5d3` ] Do not access the state variable directly, instead use proper synchronization so not stale data is read. Fixes: `e6e7f7ac03` ("nvme: ensure reset state check ordering") Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-02-17 09:40:12 +01:00
Daniel Wagner	7957863032	nvme: handle connectivity loss in nvme_set_queue_count [ Upstream commit `294b2b7516` ] When the set feature attempts fails with any NVME status code set in nvme_set_queue_count, the function still report success. Though the numbers of queues set to 0. This is done to support controllers in degraded state (the admin queue is still up and running but no IO queues). Though there is an exception. When nvme_set_features reports an host path error, nvme_set_queue_count should propagate this error as the connectivity is lost, which means also the admin queue is not working anymore. Fixes: `9a0be7abb6` ("nvme: refactor set_queue_count") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-02-17 09:40:11 +01:00
Jens Axboe	db996ed199	nvme: fix bogus kzalloc() return check in nvme_init_effects_log() [ Upstream commit `170e086ad3` ] nvme_init_effects_log() returns failure when kzalloc() is successful, which is obviously wrong and causes failures to boot. Correct the check. Fixes: `d4a95adeab` ("nvme: Add error path for xa_store in nvme_init_effects") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-02-08 09:51:43 +01:00
Keisuke Nishimura	525dc0f604	nvme: Add error path for xa_store in nvme_init_effects [ Upstream commit `d4a95adeab` ] The xa_store() may fail due to memory allocation failure because there is no guarantee that the index NVME_CSI_NVM is already used. This fix introduces a new function to handle the error path. Fixes: `cc115cbe12` ("nvme: always initialize known command effects") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-02-08 09:51:43 +01:00
Keisuke Nishimura	48ef61d25e	nvme: Add error check for xa_store in nvme_get_effects_log [ Upstream commit `ac32057acc` ] The xa_store() may fail due to memory allocation failure because there is no guarantee that the index csi is already used. This fix adds an error check of the return value of xa_store() in nvme_get_effects_log(). Fixes: `1cf7a12e09` ("nvme: use an xarray to lookup the Commands Supported and Effects log") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-02-08 09:51:43 +01:00
Luis Chamberlain	002b2efb11	nvmet: propagate npwg topology [ Upstream commit `b579d6fdc3` ] Ensure we propagate npwg to the target as well instead of assuming its the same logical blocks per physical block. This ensures devices with large IUs information properly propagated on the target. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-23 17:21:14 +01:00
Robert Beckett	347654387b	nvme-pci: 512 byte aligned dma pool segment quirk [ Upstream commit `ebefac5647` ] We initially introduced a quick fix limiting the queue depth to 1 as experimentation showed that it fixed data corruption on 64GB steamdecks. Further experimentation revealed corruption only happens when the last PRP data element aligns to the end of the page boundary. The device appears to treat this as a PRP chain to a new list instead of the data element that it actually is. This implementation is in violation of the spec. Encountering this errata with the Linux driver requires the host request a 128k transfer and coincidently be handed the last small pool dma buffer within a page. The QD1 quirk effectly works around this because the last data PRP always was at a 248 byte offset from the page start, so it never appeared at the end of the page, but comes at the expense of throttling IO and wasting the remainder of the PRP page beyond 256 bytes. Also to note, the MDTS on these devices is small enough that the "large" prp pool can hold enough PRP elements to never reach the end, so that pool is not a problem either. Introduce a new quirk to ensure the small pool is always aligned such that the last PRP element can't appear a the end of the page. This comes at the expense of wasting 256 bytes per small pool page allocated. Link: https://lore.kernel.org/linux-nvme/20241113043151.GA20077@lst.de/T/#u Fixes: `83bdfcbdbe` ("nvme-pci: qdepth 1 quirk") Cc: Paweł Anikiel <panikiel@google.com> Signed-off-by: Robert Beckett <bob.beckett@collabora.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-09 13:31:56 +01:00
Nilay Shroff	ddcc7d71be	Revert "nvme: make keep-alive synchronous operation" [ Upstream commit `8448828216` ] This reverts commit `d06923670b`. It was realized that the fix implemented to contain the race condition among the keep alive task and the fabric shutdown code path in the commit d06923670b5ia ("nvme: make keep-alive synchronous operation") is not optimal. The reason being keep-alive runs under the workqueue and making it synchronous would waste a workqueue context. Furthermore, we later found that the above race condition is a regression caused due to the changes implemented in commit `a54a93d0e3` ("nvme: move stopping keep-alive into nvme_uninit_ctrl()"). So we decided to revert the commit `d06923670b` ("nvme: make keep-alive synchronous operation") and then fix the regression. Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/ Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-09 13:31:47 +01:00
Nilay Shroff	801acf741c	nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function [ Upstream commit `599d9f3a10` ] We no more need acquiring ctrl->lock before accessing the NVMe controller state and instead we can now use the helper nvme_ctrl_state. So replace the use of ctrl->lock from nvme_keep_alive_finish function with nvme_ctrl_state call. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: `8448828216` ("Revert "nvme: make keep-alive synchronous operation"") Signed-off-by: Sasha Levin <sashal@kernel.org>	2025-01-09 13:31:47 +01:00
Breno Leitao	1e20e4ff3a	nvme/multipath: Fix RCU list traversal to use SRCU primitive [ Upstream commit `5dd18f09ce` ] The code currently uses list_for_each_entry_rcu() while holding an SRCU lock, triggering false positive warnings with CONFIG_PROVE_RCU=y enabled: drivers/nvme/host/multipath.c:168 RCU-list traversed in non-reader section!! drivers/nvme/host/multipath.c:227 RCU-list traversed in non-reader section!! drivers/nvme/host/multipath.c:260 RCU-list traversed in non-reader section!! While the list is properly protected by SRCU lock, the code uses the wrong list traversal primitive. Replace list_for_each_entry_rcu() with list_for_each_entry_srcu() to correctly indicate SRCU-based protection and eliminate the false warning. Signed-off-by: Breno Leitao <leitao@debian.org> Fixes: `be647e2c76` ("nvme: use srcu for iterating namespace list") Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-09 10:32:56 +01:00
Hannes Reinecke	f0679539fa	nvme-multipath: avoid hang on inaccessible namespaces [ Upstream commit `3b97f5a05c` ] During repetitive namespace remapping operations on the target the namespace might have changed between the time the initial scan was performed, and partition scan was invoked by device_add_disk() in nvme_mpath_set_live(). We then end up with a stuck scanning process: [<0>] folio_wait_bit_common+0x12a/0x310 [<0>] filemap_read_folio+0x97/0xd0 [<0>] do_read_cache_folio+0x108/0x390 [<0>] read_part_sector+0x31/0xa0 [<0>] read_lba+0xc5/0x160 [<0>] efi_partition+0xd9/0x8f0 [<0>] bdev_disk_changed+0x23d/0x6d0 [<0>] blkdev_get_whole+0x78/0xc0 [<0>] bdev_open+0x2c6/0x3b0 [<0>] bdev_file_open_by_dev+0xcb/0x120 [<0>] disk_scan_partitions+0x5d/0x100 [<0>] device_add_disk+0x402/0x420 [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core] This happens when we have several paths, some of which are inaccessible, and the active paths are removed first. Then nvme_find_path() will requeue I/O in the ns_head (as paths are present), but the requeue list is never triggered as all remaining paths are inactive. This patch checks for NVME_NSHEAD_DISK_LIVE in nvme_available_path(), and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once the last path has been removed to properly terminate pending I/O. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: `5dd18f09ce` ("nvme/multipath: Fix RCU list traversal to use SRCU primitive") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-09 10:32:56 +01:00
Thomas Song	85b9f3e63e	nvme-multipath: implement "queue-depth" iopolicy [ Upstream commit `f227345f0a` ] The round-robin path selector is inefficient in cases where there is a difference in latency between paths. In the presence of one or more high latency paths the round-robin selector continues to use the high latency path equally. This results in a bias towards the highest latency path and can cause a significant decrease in overall performance as IOs pile on the highest latency path. This problem is acute with NVMe-oF controllers. The queue-depth path selector sends I/O down the path with the lowest number of requests in its request queue. Paths with lower latency will clear requests more quickly and have less requests queued compared to higher latency paths. The goal of this path selector is to make more use of lower latency paths which will bring down overall IO latency and increase throughput and performance. Signed-off-by: Thomas Song <tsong@purestorage.com> [emilne: commandeered patch developed by Thomas Song @ Pure Storage] Co-developed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ewan D. Milne <emilne@redhat.com> Co-developed-by: John Meneghini <jmeneghi@redhat.com> Signed-off-by: John Meneghini <jmeneghi@redhat.com> Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@redhat.com/ Tested-by: Marco Patalano <mpatalan@redhat.com> Tested-by: Jyoti Rani <jrani@purestorage.com> Tested-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Randy Jennings <randyj@purestorage.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: `5dd18f09ce` ("nvme/multipath: Fix RCU list traversal to use SRCU primitive") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-09 10:32:56 +01:00
John Meneghini	a7071e2bd0	nvme-multipath: prepare for "queue-depth" iopolicy [ Upstream commit `3d7c2fd2ea` ] This patch prepares for the introduction of a new iopolicy by breaking up the nvme_find_path() code path into sub-routines. Signed-off-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: `5dd18f09ce` ("nvme/multipath: Fix RCU list traversal to use SRCU primitive") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-09 10:32:56 +01:00
Puranjay Mohan	6b42ded89b	nvme: fix metadata handling in nvme-passthrough commit `7c2fd76048` upstream. On an NVMe namespace that does not support metadata, it is possible to send an IO command with metadata through io-passthru. This allows issues like [1] to trigger in the completion code path. nvme_map_user_request() doesn't check if the namespace supports metadata before sending it forward. It also allows admin commands with metadata to be processed as it ignores metadata when bdev == NULL and may report success. Reject an IO command with metadata when the NVMe namespace doesn't support it and reject an admin command if it has metadata. [1] https://lore.kernel.org/all/mb61pcylvnym8.fsf@amazon.com/ Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Puranjay Mohan <pjy@amazon.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org> [ Minor changes to make it work on 6.6 ] Signed-off-by: Hagar Hemdan <hagarhem@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-12-09 10:32:39 +01:00
Christoph Hellwig	5198992912	nvme-pci: reverse request order in nvme_queue_rqs [ Upstream commit `beadf00885` ] blk_mq_flush_plug_list submits requests in the reverse order that they were submitted, which leads to a rather suboptimal I/O pattern especially in rotational devices. Fix this by rewriting nvme_queue_rqs so that it always pops the requests from the passed in request list, and then adds them to the head of a local submit list. This actually simplifies the code a bit as it removes the complicated list splicing, at the cost of extra updates of the rq_next pointer. As that should be cache hot anyway it should be an easy price to pay. Fixes: `d62cbcf62f` ("nvme: add support for mq_ops->queue_rqs()") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-09 10:31:46 +01:00
Christoph Hellwig	cee3bff51a	nvme-pci: fix freeing of the HMB descriptor table [ Upstream commit `3c2fb1ca80` ] The HMB descriptor table is sized to the maximum number of descriptors that could be used for a given device, but __nvme_alloc_host_mem could break out of the loop earlier on memory allocation failure and end up using less descriptors than planned for, which leads to an incorrect size passed to dma_free_coherent. In practice this was not showing up because the number of descriptors tends to be low and the dma coherent allocator always allocates and frees at least a page. Fixes: `87ad72a59a` ("nvme-pci: implement host memory buffer support") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-09 10:31:44 +01:00
Keith Busch	f7d9a18572	nvme: apple: fix device reference counting [ Upstream commit `b9ecbfa455` ] Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl. Split the allocation side out to make the error handling boundary easier to navigate. The apple driver had been doing this wrong, leaking the controller device memory on a tagset failure. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> [ Resolve minor conflicts ] Signed-off-by: Bin Lan <bin.lan.cn@windriver.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-12-09 10:31:42 +01:00
Breno Leitao	5a526388d0	nvme/host: Fix RCU list traversal to use SRCU primitive [ Upstream commit `6d1c69945c` ] The code currently uses list_for_each_entry_rcu() while holding an SRCU lock, triggering false positive warnings with CONFIG_PROVE_RCU=y enabled: drivers/nvme/host/core.c:3770 RCU-list traversed in non-reader section!! While the list is properly protected by SRCU lock, the code uses the wrong list traversal primitive. Replace list_for_each_entry_rcu() with list_for_each_entry_srcu() to correctly indicate SRCU-based protection and eliminate the false warning. Fixes: `be647e2c76` ("nvme: use srcu for iterating namespace list") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-17 15:08:57 +01:00
Nilay Shroff	1a1bcca5c9	nvme: make keep-alive synchronous operation [ Upstream commit `d06923670b` ] The nvme keep-alive operation, which executes at a periodic interval, could potentially sneak in while shutting down a fabric controller. This may lead to a race between the fabric controller admin queue destroy code path (invoked while shutting down controller) and hw/hctx queue dispatcher called from the nvme keep-alive async request queuing operation. This race could lead to the kernel crash shown below: Call Trace: autoremove_wake_function+0x0/0xbc (unreliable) __blk_mq_sched_dispatch_requests+0x114/0x24c blk_mq_sched_dispatch_requests+0x44/0x84 blk_mq_run_hw_queue+0x140/0x220 nvme_keep_alive_work+0xc8/0x19c [nvme_core] process_one_work+0x200/0x4e0 worker_thread+0x340/0x504 kthread+0x138/0x140 start_kernel_thread+0x14/0x18 While shutting down fabric controller, if nvme keep-alive request sneaks in then it would be flushed off. The nvme_keep_alive_end_io function is then invoked to handle the end of the keep-alive operation which decrements the admin->q_usage_counter and assuming this is the last/only request in the admin queue then the admin->q_usage_counter becomes zero. If that happens then blk-mq destroy queue operation (blk_mq_destroy_ queue()) which could be potentially running simultaneously on another cpu (as this is the controller shutdown code path) would forward progress and deletes the admin queue. So, now from this point onward we are not supposed to access the admin queue resources. However the issue here's that the nvme keep-alive thread running hw/hctx queue dispatch operation hasn't yet finished its work and so it could still potentially access the admin queue resource while the admin queue had been already deleted and that causes the above crash. This fix helps avoid the observed crash by implementing keep-alive as a synchronous operation so that we decrement admin->q_usage_counter only after keep-alive command finished its execution and returns the command status back up to its caller (blk_execute_rq()). This would ensure that fabric shutdown code path doesn't destroy the fabric admin queue until keep-alive request finished execution and also keep-alive thread is not running hw/hctx queue dispatch operation. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-17 15:08:57 +01:00
Nilay Shroff	e4f9fffbb1	nvme-loop: flush off pending I/O while shutting down loop controller [ Upstream commit `c199fac88f` ] While shutting down loop controller, we first quiesce the admin/IO queue, delete the admin/IO tag-set and then at last destroy the admin/IO queue. However it's quite possible that during the window between quiescing and destroying of the admin/IO queue, some admin/IO request might sneak in and if that happens then we could potentially encounter a hung task because shutdown operation can't forward progress until any pending I/O is flushed off. This commit helps ensure that before destroying the admin/IO queue, we unquiesce the admin/IO queue so that any outstanding requests, which are added after the admin/IO queue is quiesced, are now flushed to its completion. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-17 15:08:57 +01:00
Keith Busch	4a57f42e5e	nvme-multipath: defer partition scanning [ Upstream commit `1f021341ee` ] We need to suppress the partition scan from occuring within the controller's scan_work context. If a path error occurs here, the IO will wait until a path becomes available or all paths are torn down, but that action also occurs within scan_work, so it would deadlock. Defer the partion scan to a different context that does not block scan_work. Reported-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-17 15:08:57 +01:00
Greg Joyce	e04e648058	nvme: disable CC.CRIME (NVME_CC_CRIME) [ Upstream commit `0ce96a6708` ] Disable NVME_CC_CRIME so that CSTS.RDY indicates that the media is ready and able to handle commands without returning NVME_SC_ADMIN_COMMAND_MEDIA_NOT_READY. Signed-off-by: Greg Joyce <gjoyce@linux.ibm.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Tested-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-17 15:08:56 +01:00
Hannes Reinecke	975cb1d212	nvme: tcp: avoid race between queue_lock lock and destroy [ Upstream commit `782373ba27` ] Commit `76d54bf20c` ("nvme-tcp: don't access released socket during error recovery") added a mutex_lock() call for the queue->queue_lock in nvme_tcp_get_address(). However, the mutex_lock() races with mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below. DEBUG_LOCKS_WARN_ON(lock->magic != lock) WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 __mutex_lock+0xcf0/0x1220 Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs parport_pc parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink zram bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm sym53c8xx floppy nvme scsi_transport_spi nvme_core nvme_auth serio_raw ata_generic pata_acpi dm_multipath qemu_fw_cfg [last unloaded: ib_uverbs] CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:__mutex_lock+0xcf0/0x1220 Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 0b e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 RSP: 0018:ffff88811305f760 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001 RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341 R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000 R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058 FS: 00007f9713ae4980(0000) GS:ffff8883ae180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __warn.cold+0x5b/0x1af ? __mutex_lock+0xcf0/0x1220 ? report_bug+0x1ec/0x390 ? handle_bug+0x3c/0x80 ? exc_invalid_op+0x13/0x40 ? asm_exc_invalid_op+0x16/0x20 ? __mutex_lock+0xcf0/0x1220 ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] ? __pfx___mutex_lock+0x10/0x10 ? __lock_acquire+0xd6a/0x59e0 ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp] nvme_sysfs_show_address+0x81/0xc0 [nvme_core] dev_attr_show+0x42/0x80 ? __asan_memset+0x1f/0x40 sysfs_kf_seq_show+0x1f0/0x370 seq_read_iter+0x2cb/0x1130 ? rw_verify_area+0x3b1/0x590 ? __mutex_lock+0x433/0x1220 vfs_read+0x6a6/0xa20 ? lockdep_hardirqs_on+0x78/0x100 ? __pfx_vfs_read+0x10/0x10 ksys_read+0xf7/0x1d0 ? __pfx_ksys_read+0x10/0x10 ? __x64_sys_openat+0x105/0x1d0 do_syscall_64+0x93/0x180 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx_ksys_read+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f9713f55cfa Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011 RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148 R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0 </TASK> The WARN is observed when the blktests test case nvme/014 is repeated with tcp transport. It is rare, and 200 times repeat is required to recreate in some test environments. To avoid the WARN, check the NVME_TCP_Q_LIVE flag before locking queue->queue_lock. The flag is cleared long time before the lock gets destroyed. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-17 15:08:56 +01:00
Vitaliy Shevtsov	c60af16e1d	nvmet-auth: assign dh_key to NULL after kfree_sensitive [ Upstream commit `d2f551b1f7` ] ctrl->dh_key might be used across multiple calls to nvmet_setup_dhgroup() for the same controller. So it's better to nullify it after release on error path in order to avoid double free later in nvmet_destroy_auth(). Found by Linux Verification Center (linuxtesting.org) with Svace. Fixes: `7a277c37d3` ("nvmet-auth: Diffie-Hellman key exchange support") Cc: stable@vger.kernel.org Signed-off-by: Vitaliy Shevtsov <v.shevtsov@maxima.ru> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-08 16:28:26 +01:00
Maurizio Lombardi	4ed32cc093	nvme-pci: fix race condition between reset and nvme_dev_disable() [ Upstream commit `26bc0a81f6` ] nvme_dev_disable() modifies the dev->online_queues field, therefore nvme_pci_update_nr_queues() should avoid racing against it, otherwise we could end up passing invalid values to blk_mq_update_nr_hw_queues(). WARNING: CPU: 39 PID: 61303 at drivers/pci/msi/api.c:347 pci_irq_get_affinity+0x187/0x210 Workqueue: nvme-reset-wq nvme_reset_work [nvme] RIP: 0010:pci_irq_get_affinity+0x187/0x210 Call Trace: <TASK> ? blk_mq_pci_map_queues+0x87/0x3c0 ? pci_irq_get_affinity+0x187/0x210 blk_mq_pci_map_queues+0x87/0x3c0 nvme_pci_map_queues+0x189/0x460 [nvme] blk_mq_update_nr_hw_queues+0x2a/0x40 nvme_reset_work+0x1be/0x2a0 [nvme] Fix the bug by locking the shutdown_lock mutex before using dev->online_queues. Give up if nvme_dev_disable() is running or if it has been executed already. Fixes: `949928c1c7` ("NVMe: Fix possible queue use after freed") Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-01 01:58:25 +01:00
William Butler	e2c03a2c83	nvme-pci: set doorbell config before unquiescing [ Upstream commit `06c59d4270` ] During resets, if queues are unquiesced first, then the host can submit IOs to the controller using shadow doorbell logic but the controller won't be aware. This can lead to necessary MMIO doorbells from being not issued, causing requests to be delayed and timed-out. Signed-off-by: William Butler <wab@google.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: `26bc0a81f6` ("nvme-pci: fix race condition between reset and nvme_dev_disable()") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-11-01 01:58:25 +01:00
Hannes Reinecke	30b9bf4b41	nvme-multipath: system fails to create generic nvme device [ Upstream commit `63bcf9014e` ] NVME_NSHEAD_DISK_LIVE is a flag for struct nvme_ns_head, not nvme_ns. The current code has a typo causing NVME_NSHEAD_DISK_LIVE never to be cleared once device_add_disk_fails, causing the system never to create the 'generic' character device. Even several rescan attempts will change the situation and the system has to be rebooted to fix the issue. Fixes: `11384580e3` ("nvme-multipath: add error handling support for add_disk()") Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-10-04 16:29:38 +02:00
Keith Busch	ab51a98de8	nvme-pci: qdepth 1 quirk commit `83bdfcbdbe` upstream. Another device has been reported to be unreliable if we have more than one outstanding command. In this new case, data corruption may occur. Since we have two devices now needing this quirky behavior, make a generic quirk flag. The same Apple quirk is clearly not "temporary", so update the comment while moving it. Link: https://lore.kernel.org/linux-nvme/191d810a4e3.fcc6066c765804.973611676137075390@collabora.com/ Reported-by: Robert Beckett <bob.beckett@collabora.com> Reviewed-by: Christoph Hellwig hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Cc: "Gagniuc, Alexandru" <alexandru.gagniuc@hp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-09-30 16:25:14 +02:00
Keith Busch	05500a48d8	nvme-pci: allocate tagset on reset if necessary [ Upstream commit `6f01bdbfef` ] If a drive is unable to create IO queues on the initial probe, a subsequent reset will need to allocate the tagset if IO queue creation is successful. Without this, blk_mq_update_nr_hw_queues will crash on a bad pointer due to the invalid tagset. Fixes: `eac3ef2629` ("nvme-pci: split the initial probe from the rest path") Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-09-12 11:11:44 +02:00
Maurizio Lombardi	489f2913a6	nvmet-tcp: fix kernel crash if commands allocation fails [ Upstream commit `5572a55a6f` ] If the commands allocation fails in nvmet_tcp_alloc_cmds() the kernel crashes in nvmet_tcp_release_queue_work() because of a NULL pointer dereference. nvmet: failed to install queue 0 cntlid 1 ret 6 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 Fix the bug by setting queue->nr_cmds to zero in case nvmet_tcp_alloc_cmd() fails. Fixes: `872d26a391` ("nvmet-tcp: add NVMe over TCP target driver") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-09-12 11:11:44 +02:00
Georg Gottleuber	77ee2eaee4	nvme-pci: Add sleep quirk for Samsung 990 Evo commit `61aa894e7a` upstream. On some TUXEDO platforms, a Samsung 990 Evo NVMe leads to a high power consumption in s2idle sleep (2-3 watts). This patch applies 'Force No Simple Suspend' quirk to achieve a sleep with a lower power consumption, typically around 0.5 watts. Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com> Signed-off-by: Werner Sembach <wse@tuxedocomputers.com> Cc: <stable@vger.kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-09-12 11:11:25 +02:00
Keith Busch	9c375a9566	nvme: fix namespace removal list [ Upstream commit `ff0ffe5b7c` ] This function wants to move a subset of a list from one element to the tail into another list. It also needs to use the srcu synchronize instead of the regular rcu version. Do this one element at a time because that's the only to do it. Fixes: `be647e2c76` ("nvme: use srcu for iterating namespace list") Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-29 17:33:42 +02:00
Keith Busch	82f20194bf	nvme: use srcu for iterating namespace list [ Upstream commit `be647e2c76` ] The nvme pci driver synchronizes with all the namespace queues during a reset to ensure that there's no pending timeout work. Meanwhile the timeout work potentially iterates those same namespaces to freeze their queues. Each of those namespace iterations use the same read lock. If a write lock should somehow get between the synchronize and freeze steps, then forward progress is deadlocked. We had been relying on the nvme controller state machine to ensure the reset work wouldn't conflict with timeout work. That guarantee may be a bit fragile to rely on, so iterate the namespace lists without taking potentially circular locks, as reported by lockdep. Link: https://lore.kernel.org/all/20220930001943.zdbvolc3gkekfmcv@shindev/ Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-29 17:33:41 +02:00
Sagi Grimberg	73c50bd104	nvmet-rdma: fix possible bad dereference when freeing rsps [ Upstream commit `73964c1d07` ] It is possible that the host connected and saw a cm established event and started sending nvme capsules on the qp, however the ctrl did not yet see an established event. This is why the rsp_wait_list exists (for async handling of these cmds, we move them to a pending list). Furthermore, it is possible that the ctrl cm times out, resulting in a connect-error cm event. in this case we hit a bad deref [1] because in nvmet_rdma_free_rsps we assume that all the responses are in the free list. We are freeing the cmds array anyways, so don't even bother to remove the rsp from the free_list. It is also guaranteed that we are not racing anything when we are releasing the queue so no other context accessing this array should be running. [1]: -- Workqueue: nvmet-free-wq nvmet_rdma_free_queue_work [nvmet_rdma] [...] pc : nvmet_rdma_free_rsps+0x78/0xb8 [nvmet_rdma] lr : nvmet_rdma_free_queue_work+0x88/0x120 [nvmet_rdma] Call trace: nvmet_rdma_free_rsps+0x78/0xb8 [nvmet_rdma] nvmet_rdma_free_queue_work+0x88/0x120 [nvmet_rdma] process_one_work+0x1ec/0x4a0 worker_thread+0x48/0x490 kthread+0x158/0x160 ret_from_fork+0x10/0x18 -- Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-29 17:33:40 +02:00
Hannes Reinecke	4ff710fdf7	nvmet-tcp: do not continue for invalid icreq [ Upstream commit `0889d13b9e` ] When the length check for an icreq sqe fails we should not continue processing but rather return immediately as all other contents of that sqe cannot be relied on. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-29 17:33:39 +02:00
Keith Busch	5469f609be	nvme: clear caller pointer on identify failure [ Upstream commit `7e80eb792b` ] The memory allocated for the identification is freed on failure. Set it to NULL so the caller doesn't have a pointer to that freed address. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-29 17:33:38 +02:00
Daniel Wagner	5b7766ab91	nvmet-trace: avoid dereferencing pointer too early [ Upstream commit `0e716cec6f` ] The first command issued from the host to the target is the fabrics connect command. At this point, neither the target queue nor the controller have been allocated. But we already try to trace this command in nvmet_req_init. Reported by KASAN. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-29 17:33:32 +02:00
WangYuli	a47b54846a	nvme/pci: Add APST quirk for Lenovo N60z laptop commit `ab091ec536` upstream. There is a hardware power-saving problem with the Lenovo N60z board. When turn it on and leave it for 10 hours, there is a 20% chance that a nvme disk will not wake up until reboot. Link: https://lore.kernel.org/all/2B5581C46AC6E335+9c7a81f1-05fb-4fd0-9fbb-108757c21628@uniontech.com Signed-off-by: hmy <huanglin@uniontech.com> Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: WangYuli <wangyuli@uniontech.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-08-19 06:04:24 +02:00
Mikulas Patocka	c63b44fb33	block: change rq_integrity_vec to respect the iterator [ Upstream commit `cf546dd289` ] If we allocate a bio that is larger than NVMe maximum request size, attach integrity metadata to it and send it to the NVMe subsystem, the integrity metadata will be corrupted. Splitting the bio works correctly. The function bio_split will clone the bio, trim the iterator of the first bio and advance the iterator of the second bio. However, the function rq_integrity_vec has a bug - it returns the first vector of the bio's metadata and completely disregards the metadata iterator that was advanced when the bio was split. Thus, the second bio uses the same metadata as the first bio and this leads to metadata corruption. This commit changes rq_integrity_vec, so that it calls mp_bvec_iter_bvec instead of returning the first vector. mp_bvec_iter_bvec reads the iterator and uses it to build a bvec for the current position in the iterator. The "queue_max_integrity_segments(rq->q) > 1" check was removed, because the updated rq_integrity_vec function works correctly with multiple segments. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/49d1afaa-f934-6ed2-a678-e0d428c63a65@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-14 13:58:41 +02:00
Leon Romanovsky	77848b379e	nvme-pci: add missing condition check for existence of mapped data [ Upstream commit `c31fad1470` ] nvme_map_data() is called when request has physical segments, hence the nvme_unmap_data() should have same condition to avoid dereference. Fixes: `4aedb70543` ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-03 08:54:41 +02:00
Bart Van Assche	8772be9670	nvme-pci: Fix the instructions for disabling power management [ Upstream commit `92fc2c469e` ] pcie_aspm=off tells the kernel not to modify the ASPM configuration. This setting does not guarantee that ASPM (Active State Power Management) is disabled. Hence add pcie_port_pm=off. This disables power management for all PCIe ports. This patch has been tested on a workstation with a Samsung SSD 970 EVO Plus NVMe SSD. Fixes: `4641a8e6e1` ("nvme-pci: add trouble shooting steps for timeouts") Cc: Keith Busch <kbusch@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-03 08:54:40 +02:00
Gaosheng Cui	908ce7f568	nvmet-auth: fix nvmet_auth hash error handling [ Upstream commit `89f58f96d1` ] If we fail to call nvme_auth_augmented_challenge, or fail to kmalloc for shash, we should free the memory allocation for challenge, so add err path out_free_challenge to fix the memory leak. Fixes: `7a277c37d3` ("nvmet-auth: Diffie-Hellman key exchange support") Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-08-03 08:53:22 +02:00
Boyang Yu	e7ba46682a	nvme: fix NVME_NS_DEAC may incorrectly identifying the disk as EXT_LBA. [ Upstream commit `9570a48847` ] The value of NVME_NS_DEAC is 3, which means NVME_NS_METADATA_SUPPORTED \| NVME_NS_EXT_LBAS. Provide a unique value for this feature flag. Fixes `1b96f862ec` ("nvme: implement the DEAC bit for the Write Zeroes command") Signed-off-by: Boyang Yu <yuboyang@dapustor.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-07-25 09:50:50 +02:00
Daniel Wagner	10967873b8	nvmet: always initialize cqe.result [ Upstream commit `cd0c1b8e04` ] The spec doesn't mandate that the first two double words (aka results) for the command queue entry need to be set to 0 when they are not used (not specified). Though, the target implemention returns 0 for TCP and FC but not for RDMA. Let's make RDMA behave the same and thus explicitly initializing the result field. This prevents leaking any data from the stack. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-07-25 09:50:46 +02:00
Chunguang Xu	ae84383c96	nvme: avoid double free special payload [ Upstream commit `e5d574ab37` ] If a discard request needs to be retried, and that retry may fail before a new special payload is added, a double free will result. Clear the RQF_SPECIAL_LOAD when the request is cleaned. Signed-off-by: Chunguang Xu <chunguang.xu@shopee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-07-25 09:50:46 +02:00
Sagi Grimberg	5502c1f1d0	nvmet: fix a possible leak when destroy a ctrl during qp establishment [ Upstream commit `c758b77d4a` ] In nvmet_sq_destroy we capture sq->ctrl early and if it is non-NULL we know that a ctrl was allocated (in the admin connect request handler) and we need to release pending AERs, clear ctrl->sqs and sq->ctrl (for nvme-loop primarily), and drop the final reference on the ctrl. However, a small window is possible where nvmet_sq_destroy starts (as a result of the client giving up and disconnecting) concurrently with the nvme admin connect cmd (which may be in an early stage). But before kill_and_confirm of sq->ref (i.e. the admin connect managed to get an sq live reference). In this case, sq->ctrl was allocated however after it was captured in a local variable in nvmet_sq_destroy. This prevented the final reference drop on the ctrl. Solve this by re-capturing the sq->ctrl after all inflight request has completed, where for sure sq->ctrl reference is final, and move forward based on that. This issue was observed in an environment with many hosts connecting multiple ctrls simoutanuosly, creating a delay in allocating a ctrl leading up to this race window. Reported-by: Alex Turin <alex@vastdata.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-07-11 12:49:21 +02:00
Kundan Kumar	b955b47905	nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset [ Upstream commit `1bd293fcf3` ] bio_vec start offset may be relatively large particularly when large folio gets added to the bio. A bigger offset will result in avoiding the single-segment mapping optimization and end up using expensive mempool_alloc further. Rather than using absolute value, adjust bv_offset by NVME_CTRL_PAGE_SIZE while checking if segment can be fitted into one/two PRP entries. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-07-11 12:49:20 +02:00
Nilay Shroff	e6e1eda06b	nvme-multipath: find NUMA path only for online numa-node [ Upstream commit `d3a043733f` ] In current native multipath design when a shared namespace is created, we loop through each possible numa-node, calculate the NUMA distance of that node from each nvme controller and then cache the optimal IO path for future reference while sending IO. The issue with this design is that we may refer to the NUMA distance table for an offline node which may not be populated at the time and so we may inadvertently end up finding and caching a non-optimal path for IO. Then latter when the corresponding numa-node becomes online and hence the NUMA distance table entry for that node is created, ideally we should re-calculate the multipath node distance for the newly added node however that doesn't happen unless we rescan/reset the controller. So essentially, we may keep using non-optimal IO path for a node which is made online after namespace is created. This patch helps fix this issue ensuring that when a shared namespace is created, we calculate the multipath node distance for each online numa-node instead of each possible numa-node. Then latter when a node becomes online and we receive any IO on that newly added node, we would calculate the multipath node distance for newly added node but this time NUMA distance table would have been already populated for newly added node. Hence we would be able to correctly calculate the multipath node distance and choose the optimal path for the IO. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-07-11 12:49:20 +02:00

1 2 3 4 5 ...

3299 Commits