twx-linux/net/core
Linus Torvalds 672dcda246 vfs-6.17-rc1.pidfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCiQAKCRCRxhvAZXjc
 orltAQDq3y1anYETz5/FD6P2gXY1W5hXdSm3EHHeacQ1JjTXvgEA2g1lWO7J4anf
 oOVE8aSvMow/FOjivLZBYmI65pkYJAE=
 =oDKB
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull pidfs updates from Christian Brauner:

 - persistent info

   Persist exit and coredump information independent of whether anyone
   currently holds a pidfd for the struct pid.

   The current scheme allocated pidfs dentries on-demand repeatedly.
   This scheme is reaching it's limits as it makes it impossible to pin
   information that needs to be available after the task has exited or
   coredumped and that should not be lost simply because the pidfd got
   closed temporarily. The next opener should still see the stashed
   information.

   This is also a prerequisite for supporting extended attributes on
   pidfds to allow attaching meta information to them.

   If someone opens a pidfd for a struct pid a pidfs dentry is allocated
   and stashed in pid->stashed. Once the last pidfd for the struct pid
   is closed the pidfs dentry is released and removed from pid->stashed.

   So if 10 callers create a pidfs dentry for the same struct pid
   sequentially, i.e., each closing the pidfd before the other creates a
   new one then a new pidfs dentry is allocated every time.

   Because multiple tasks acquiring and releasing a pidfd for the same
   struct pid can race with each another a task may still find a valid
   pidfs entry from the previous task in pid->stashed and reuse it. Or
   it might find a dead dentry in there and fail to reuse it and so
   stashes a new pidfs dentry. Multiple tasks may race to stash a new
   pidfs dentry but only one will succeed, the other ones will put their
   dentry.

   The current scheme aims to ensure that a pidfs dentry for a struct
   pid can only be created if the task is still alive or if a pidfs
   dentry already existed before the task was reaped and so exit
   information has been was stashed in the pidfs inode.

   That's great except that it's buggy. If a pidfs dentry is stashed in
   pid->stashed after pidfs_exit() but before __unhash_process() is
   called we will return a pidfd for a reaped task without exit
   information being available.

   The pidfds_pid_valid() check does not guard against this race as it
   doens't sync at all with pidfs_exit(). The pid_has_task() check might
   be successful simply because we're before __unhash_process() but
   after pidfs_exit().

   Introduce a new scheme where the lifetime of information associated
   with a pidfs entry (coredump and exit information) isn't bound to the
   lifetime of the pidfs inode but the struct pid itself.

   The first time a pidfs dentry is allocated for a struct pid a struct
   pidfs_attr will be allocated which will be used to store exit and
   coredump information.

   If all pidfs for the pidfs dentry are closed the dentry and inode can
   be cleaned up but the struct pidfs_attr will stick until the struct
   pid itself is freed. This will ensure minimal memory usage while
   persisting relevant information.

   The new scheme has various advantages. First, it allows to close the
   race where we end up handing out a pidfd for a reaped task for which
   no exit information is available. Second, it minimizes memory usage.
   Third, it allows to remove complex lifetime tracking via dentries
   when registering a struct pid with pidfs. There's no need to get or
   put a reference. Instead, the lifetime of exit and coredump
   information associated with a struct pid is bound to the lifetime of
   struct pid itself.

 - extended attributes

   Now that we have a way to persist information for pidfs dentries we
   can start supporting extended attributes on pidfds. This will allow
   userspace to attach meta information to tasks.

   One natural extension would be to introduce a custom pidfs.* extended
   attribute space and allow for the inheritance of extended attributes
   across fork() and exec().

   The first simple scheme will allow privileged userspace to set
   trusted extended attributes on pidfs inodes.

 - Allow autonomous pidfs file handles

   Various filesystems such as pidfs and drm support opening file
   handles without having to require a file descriptor to identify the
   filesystem. The filesystem are global single instances and can be
   trivially identified solely on the information encoded in the file
   handle.

   This makes it possible to not have to keep or acquire a sentinal file
   descriptor just to pass it to open_by_handle_at() to identify the
   filesystem. That's especially useful when such sentinel file
   descriptor cannot or should not be acquired.

   For pidfs this means a file handle can function as full replacement
   for storing a pid in a file. Instead a file handle can be stored and
   reopened purely based on the file handle.

   Such autonomous file handles can be opened with or without specifying
   a a file descriptor. If no proper file descriptor is used the
   FD_PIDFS_ROOT sentinel must be passed. This allows us to define
   further special negative fd sentinels in the future.

   Userspace can trivially test for support by trying to open the file
   handle with an invalid file descriptor.

 - Allow pidfds for reaped tasks with SCM_PIDFD messages

   This is a logical continuation of the earlier work to create pidfds
   for reaped tasks through the SO_PEERPIDFD socket option merged in
   923ea4d4482b ("Merge patch series "net, pidfs: enable handing out
   pidfds for reaped sk->sk_peer_pid"").

 - Two minor fixes:

    * Fold fs_struct->{lock,seq} into a seqlock

    * Don't bother with path_{get,put}() in unix_open_file()

* tag 'vfs-6.17-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
  don't bother with path_get()/path_put() in unix_open_file()
  fold fs_struct->{lock,seq} into a seqlock
  selftests: net: extend SCM_PIDFD test to cover stale pidfds
  af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD
  af_unix: stash pidfs dentry when needed
  af_unix/scm: fix whitespace errors
  af_unix: introduce and use scm_replace_pid() helper
  af_unix: introduce unix_skb_to_scm helper
  af_unix: rework unix_maybe_add_creds() to allow sleep
  selftests/pidfd: decode pidfd file handles withou having to specify an fd
  fhandle, pidfs: support open_by_handle_at() purely based on file handle
  uapi/fcntl: add FD_PIDFS_ROOT
  uapi/fcntl: add FD_INVALID
  fcntl/pidfd: redefine PIDFD_SELF_THREAD_GROUP
  uapi/fcntl: mark range as reserved
  fhandle: reflow get_path_anchor()
  pidfs: add pidfs_root_path() helper
  fhandle: rename to get_path_anchor()
  fhandle: hoist copy_from_user() above get_path_from_fd()
  fhandle: raise FILEID_IS_DIR in handle_type
  ...
2025-07-28 14:10:15 -07:00
..
bpf_sk_storage.c bpf: Remove unnecessary BTF lookups in bpf_sk_storage_tracing_allowed 2025-01-29 08:51:51 -08:00
datagram.c net: devmem: support single IOV with sendmsg 2025-05-26 10:00:48 +01:00
dev_addr_lists_test.c net: dev_addr_lists: move locking out of init/exit in kunit 2024-04-15 10:26:35 +01:00
dev_addr_lists.c net: ti: icssg-prueth: Add Multicast Filtering support for VLAN in MAC mode 2025-01-14 12:17:27 +01:00
dev_api.c net: core: Convert dev_set_mac_address_user() to use struct sockaddr_storage 2025-05-27 08:25:43 +02:00
dev_ioctl.c net: core: Convert dev_set_mac_address_user() to use struct sockaddr_storage 2025-05-27 08:25:43 +02:00
dev.c net: annotate data-races around cleanup_net_task 2025-06-05 08:02:26 -07:00
dev.h netdev: add "ops compat locking" helpers 2025-04-09 17:01:51 -07:00
devmem.c net: devmem: move list_add to net_devmem_bind_dmabuf. 2025-05-27 19:19:35 -07:00
devmem.h net: Fix net_devmem_bind_dmabuf for non-devmem configs 2025-05-30 19:23:36 -07:00
drop_monitor.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
dst_cache.c net: dst_cache: Use nested-BH locking for dst_cache::cache 2025-05-15 15:23:30 +02:00
dst.c net: decrease cached dst counters in dst_release 2025-04-03 13:05:07 +02:00
failover.c
fib_notifier.c net: do not acquire rtnl in fib_seq_sum() 2024-10-11 15:35:05 -07:00
fib_rules.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-04-17 12:26:50 -07:00
filter.c net: clear the dst when changing skb protocol 2025-06-11 17:02:29 -07:00
flow_dissector.c net: remove '__' from __skb_flow_get_ports() 2025-02-24 14:27:53 -08:00
flow_offload.c
gen_estimator.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
gen_stats.c
gro_cells.c net: move netdev_max_backlog to net_hotdata 2024-03-07 21:12:42 -08:00
gro.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2025-02-27 10:20:58 -08:00
gso.c net: introduce struct net_hotdata 2024-03-07 21:12:41 -08:00
hotdata.c net: introduce per netns packet chains 2025-03-24 13:58:22 -07:00
hwbm.c
ieee8021q_helpers.c net: add IEEE 802.1q specific helpers 2024-05-08 10:35:09 +01:00
link_watch.c net: hold instance lock during NETDEV_CHANGE 2025-04-07 11:13:39 -07:00
lock_debug.c netdev: fix the locking for netdev notifications 2025-04-17 18:55:14 -07:00
lwt_bpf.c bpf: lwtunnel: Prepare bpf_lwt_xmit_reroute() to future .flowi4_tos conversion. 2024-11-14 19:07:49 -08:00
lwtunnel.c inet: Remove rtnl_is_held arg of lwtunnel_valid_encap_type(_attr)?(). 2025-05-20 19:18:24 -07:00
Makefile net: rename rtnl_net_debug to lock_debug 2025-04-03 15:32:08 -07:00
mp_dmabuf_devmem.h memory-provider: dmabuf devmem memory provider 2024-09-11 20:44:31 -07:00
neighbour.c treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
net_namespace.c netns: use stable inode number for initial mount ns 2025-06-11 11:59:08 +02:00
net_test.c pfcp: always set pfcp metadata 2024-04-01 10:49:28 +01:00
net-procfs.c net: add data-race annotations in softnet_seq_show() 2025-04-08 12:30:55 -07:00
net-sysfs.c net: designate queue counts as "double ops protected" by instance lock 2025-03-25 10:06:49 -07:00
net-sysfs.h
net-traces.c move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
netclassid_cgroup.c
netdev_rx_queue.c net: avoid false positive warnings in __net_mp_close_rxq() 2025-04-04 07:35:38 -07:00
netdev-genl-gen.c net: devmem: TCP tx netlink api 2025-05-13 11:12:48 +02:00
netdev-genl-gen.h net: devmem: TCP tx netlink api 2025-05-13 11:12:48 +02:00
netdev-genl.c net: devmem: move list_add to net_devmem_bind_dmabuf. 2025-05-27 19:19:35 -07:00
netevent.c
netmem_priv.h page_pool: Track DMA-mapped pages and unmap them when destroying the pool 2025-04-14 16:30:29 -07:00
netpoll.c net: netpoll: Initialize UDP checksum field before checksumming 2025-06-23 13:14:51 -07:00
netprio_cgroup.c
of_net.c
page_pool_priv.h net: page_pool: don't try to stash the napi id 2025-01-27 14:37:41 -08:00
page_pool_user.c net: use napi_id_valid helper 2025-02-17 16:43:04 -08:00
page_pool.c page_pool: Fix use-after-free in page_pool_recycle_in_ring 2025-05-28 19:19:36 -07:00
pktgen.c net: pktgen: fix code style (WARNING: Prefer strscpy over strcpy) 2025-04-17 13:02:41 +02:00
ptp_classifier.c
request_sock.c tcp: make sure init the accept_queue's spinlocks once 2024-01-19 21:13:25 -08:00
rtnetlink.c net: prevent a NULL deref in rtnl_create_link() 2025-06-05 08:03:00 -07:00
scm.c af_unix: enable handing out pidfds for reaped tasks in SCM_PIDFD 2025-07-04 09:32:35 +02:00
secure_seq.c net: Retire DCCP socket. 2025-04-11 18:58:10 -07:00
selftests.c net: selftests: fix TCP packet checksum 2025-06-26 10:50:49 +02:00
skb_fault_injection.c net: Implement fault injection forcing skb reallocation 2024-11-12 12:05:33 +01:00
skbuff.c net: netmem: fix skb_ensure_writable with unreadable skbs 2025-06-17 15:48:20 -07:00
skmsg.c bpf, sockmap: Avoid using sk_socket after free when sending 2025-05-22 16:16:37 -07:00
sock_destructor.h
sock_diag.c net: Retire DCCP socket. 2025-04-11 18:58:10 -07:00
sock_map.c BPF fixes: 2025-02-20 15:37:17 -08:00
sock_reuseport.c net: core: annotate socks of struct sock_reuseport with __counted_by 2024-08-02 17:16:59 -07:00
sock.c Fix sock_exceed_buf_limit not being triggered in __sk_mem_raise_allocated 2025-05-28 19:07:53 -07:00
stream.c
sysctl_net_core.c net: rps: remove kfree_rcu_mightsleep() use 2025-04-08 12:30:55 -07:00
timestamping.c net: Add the possibility to support a selected hwtstamp in netdevice 2024-12-16 12:51:40 +00:00
tso.c move asm/unaligned.h to linux/unaligned.h 2024-10-02 17:23:23 -04:00
utils.c net: Fix checksum update for ILA adj-transport 2025-05-30 19:53:51 -07:00
xdp.c xsk: add missing virtual address conversion for page 2025-05-27 11:46:47 +02:00