twx-linux/include/uapi/linux
Daniel Borkmann e420bed025 bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.

Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:

  - From Meta: "It's especially important for applications that are deployed
    fleet-wide and that don't "control" hosts they are deployed to. If such
    application crashes and no one notices and does anything about that, BPF
    program will keep running draining resources or even just, say, dropping
    packets. We at FB had outages due to such permanent BPF attachment
    semantics. With fd-based BPF link we are getting a framework, which allows
    safe, auto-detachable behavior by default, unless application explicitly
    opts in by pinning the BPF link." [1]

  - From Cilium-side the tc BPF programs we attach to host-facing veth devices
    and phys devices build the core datapath for Kubernetes Pods, and they
    implement forwarding, load-balancing, policy, EDT-management, etc, within
    BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
    experienced hard-to-debug issues in a user's staging environment where
    another Kubernetes application using tc BPF attached to the same prio/handle
    of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
    it. The goal is to establish a clear/safe ownership model via links which
    cannot accidentally be overridden. [0,2]

BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.

Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.

We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.

For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.

For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.

The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.

tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.

The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.

  [0] https://lpc.events/event/16/contributions/1353/
  [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
  [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
  [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
  [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:07:27 -07:00
..
android
byteorder
caif
can can: uapi: move CAN_RAW_FILTER_MAX definition to raw.h 2023-06-22 09:44:28 +02:00
cifs
dvb media: dvb: bump DVB API version 2023-05-14 16:05:28 +01:00
genwqe
hdlc
hsi
iio
isdn
misc
mmc
netfilter netfilter: nf_tables: Introduce NFT_MSG_GETSETELEM_RESET 2023-06-26 08:05:57 +02:00
netfilter_arp
netfilter_bridge
netfilter_ipv4
netfilter_ipv6
nfsd NFSD: Handle new xprtsec= export option 2023-04-27 18:49:24 -04:00
raid
sched
spi spi: add SPI_MOSI_IDLE_LOW mode bit 2023-05-30 15:20:08 +01:00
sunrpc
surface_aggregator
tc_act
tc_ematch
usb hardening fixes for v6.5-rc1 2023-07-08 12:08:39 -07:00
a.out.h
acct.h
acrn.h
adb.h
adfs_fs.h
affs_hardblocks.h block: change all __u32 annotations to __be32 in affs_hardblocks.h 2023-06-20 14:28:17 -06:00
agpgart.h
aio_abi.h
am437x-vpfe.h
amt.h
apm_bios.h
arcfb.h
arm_sdei.h
aspeed-lpc-ctrl.h
aspeed-p2a-ctrl.h
aspeed-video.h
atalk.h
atm_eni.h
atm_he.h
atm_idt77105.h
atm_nicstar.h
atm_tcp.h
atm_zatm.h
atm.h
atmapi.h
atmarp.h
atmbr2684.h
atmclip.h
atmdev.h
atmioc.h
atmlec.h
atmmpc.h
atmppp.h
atmsap.h
atmsvc.h
audit.h
auto_dev-ioctl.h autofs: use flexible array in ioctl structure 2023-05-30 16:42:00 -07:00
auto_fs4.h
auto_fs.h
auxvec.h
ax25.h
batadv_packet.h
batman_adv.h
baycom.h
bcm933xx_hcs.h
bfs_fs.h
binfmts.h
blkpg.h
blktrace_api.h
blkzoned.h
bpf_common.h
bpf_perf_event.h
bpf.h bpf: Add fd-based tcx multi-prog infra with link support 2023-07-19 10:07:27 -07:00
bpfilter.h
bpqether.h
bsg.h
bt-bmc.h
btf.h
btrfs_tree.h
btrfs.h
cachefiles.h
can.h can: uapi: move CAN_RAW_FILTER_MAX definition to raw.h 2023-06-22 09:44:28 +02:00
capability.h capability: erase checker warnings about struct __user_cap_data_struct 2023-06-06 17:05:54 -04:00
capi.h
cciss_defs.h
cciss_ioctl.h
ccs.h
cdrom.h
cec-funcs.h
cec.h
cfm_bridge.h
cgroupstats.h
chio.h
close_range.h
cn_proc.h
coda.h
coff.h
comedi.h
connector.h
const.h uapi/linux/const.h: prefer ISO-friendly __typeof__ 2023-04-18 16:39:34 -07:00
coresight-stm.h
counter.h counter: i8254: Introduce the Intel 8254 interface library module 2023-06-08 10:11:17 -04:00
cramfs_fs.h
cryptouser.h
cuda.h
cxl_mem.h cxl/mbox: Deprecate poison commands 2023-04-22 14:41:30 -07:00
cyclades.h
cycx_cfm.h
dcbnl.h
dccp.h
devlink.h
dlm_device.h
dlm_plock.h
dlm.h
dlmconstants.h
dm-ioctl.h
dm-log-userspace.h
dma-buf.h
dma-heap.h
dns_resolver.h
dqblk_xfs.h
dw100.h
edd.h
efs_fs_sb.h
elf-em.h
elf-fdpic.h
elf.h RISC-V Patches for the 6.5 Merge Window, Part 1 2023-06-30 09:37:26 -07:00
errno.h
errqueue.h
erspan.h
ethtool_netlink.h net: ethtool: correct MAX attribute value for stats 2023-06-12 08:50:48 +01:00
ethtool.h
eventfd.h eventfd: add a uapi header for eventfd userspace APIs 2023-06-15 14:55:15 +02:00
eventpoll.h
ext4.h ext4: Add a uapi header for ext4 userspace APIs 2023-04-19 23:39:42 -04:00
f2fs.h
fadvise.h
falloc.h
fanotify.h
fb.h
fcntl.h exportfs: allow exporting non-decodeable file handles to userspace 2023-05-25 13:16:57 +02:00
fd.h
fdreg.h
fib_rules.h
fiemap.h
filter.h
firewire-cdev.h firewire: fix warnings to generate UAPI documentation 2023-06-06 07:54:00 +09:00
firewire-constants.h
fou.h
fpga-dfl.h
fs.h
fscrypt.h
fsi.h
fsl_hypervisor.h
fsl_mc.h
fsmap.h
fsverity.h
fuse.h
futex.h
gameport.h
gen_stats.h
genetlink.h
gfs2_ondisk.h
gpio.h
gsmmux.h
gtp.h
handshake.h net/handshake: Enable the SNI extension to work properly 2023-05-24 22:05:24 -07:00
hash_info.h
hdlc.h
hdlcdrv.h
hdreg.h
hid.h
hiddev.h
hidraw.h
hpet.h
hsr_netlink.h
hw_breakpoint.h
hyperv.h
i2c-dev.h
i2c.h
i2o-dev.h
i8k.h
icmp.h
icmpv6.h
idxd.h
if_addr.h
if_addrlabel.h
if_alg.h
if_arcnet.h
if_arp.h
if_bonding.h
if_bridge.h bridge: vlan: Allow setting VLAN neighbor suppression state 2023-04-21 08:25:50 +01:00
if_cablemodem.h
if_eql.h
if_ether.h
if_fc.h
if_fddi.h
if_hippi.h
if_infiniband.h
if_link.h net: vxlan: Add nolocalbypass option to vxlan. 2023-05-13 17:02:33 +01:00
if_ltalk.h
if_macsec.h
if_packet.h net/packet: support mergeable feature of virtio 2023-04-21 12:01:58 +01:00
if_phonet.h
if_plip.h
if_ppp.h
if_pppol2tp.h
if_pppox.h
if_slip.h
if_team.h
if_tun.h
if_tunnel.h
if_vlan.h
if_x25.h
if_xdp.h xsk: introduce XSK_USE_SG bind flag for xsk socket 2023-07-19 09:56:48 -07:00
if.h
ife.h
igmp.h
ila.h
in6.h
in_route.h
in.h ipv{4,6}/raw: fix output xfrm lookup wrt protocol 2023-05-23 15:38:59 +02:00
inet_diag.h
inotify.h
input-event-codes.h
input.h
io_uring.h nvme: improved uring polling 2023-06-28 16:09:41 -06:00
ioam6_genl.h
ioam6_iptunnel.h
ioam6.h
ioctl.h
iommu.h
iommufd.h
ioprio.h scsi: block: Improve ioprio value validity checks 2023-06-16 12:04:30 -04:00
ip6_tunnel.h
ip_vs.h
ip.h
ipc.h
ipmi_bmc.h
ipmi_msgdefs.h
ipmi_ssif_bmc.h
ipmi.h
ipsec.h
ipv6_route.h
ipv6.h
irqnr.h
iso_fs.h
isst_if.h
ivtv.h
ivtvfb.h
jffs2.h
joystick.h
kcm.h
kcmp.h
kcov.h
kd.h
kdev_t.h
kernel-page-flags.h
kernel.h
kernelcapi.h
kexec.h
keyboard.h
keyctl.h
kfd_ioctl.h drm/amdkfd: bump kfd ioctl minor version for event age availability 2023-06-15 11:37:55 -04:00
kfd_sysfs.h drm/amdkfd: display debug capabilities 2023-06-09 12:34:45 -04:00
kvm_para.h
kvm.h Common KVM changes for 6.5: 2023-07-01 07:07:55 -04:00
l2tp.h
landlock.h
libc-compat.h
limits.h
lirc.h
llc.h
loadpin.h
loop.h
lp.h
lwtunnel.h
magic.h
major.h
map_to_7segment.h
map_to_14segment.h
matroxfb.h
max2175.h
mctp.h
mdio.h net: mdio: add clause 73 to ethtool conversion helper 2023-05-24 09:13:22 -07:00
media-bus-format.h
media.h media: uapi: Use unsigned int values for assigning bits in u32 fields 2023-05-25 16:21:22 +02:00
mei_uuid.h
mei.h
membarrier.h
memfd.h
mempolicy.h
mii.h
minix_fs.h
mman.h cachestat: implement cachestat syscall 2023-06-09 16:25:16 -07:00
mmtimer.h
module.h
mount.h fs: allow to mount beneath top mount 2023-05-19 04:30:22 +02:00
mpls_iptunnel.h
mpls.h
mptcp.h mptcp: introduce MPTCP_FULL_INFO getsockopt 2023-06-21 22:45:57 -07:00
mqueue.h
mroute6.h
mroute.h
mrp_bridge.h
msdos_fs.h
msg.h
mtio.h
nbd-netlink.h
nbd.h uapi nbd: add cookie alias to handle 2023-04-27 19:15:11 -06:00
ncsi.h
ndctl.h
neighbour.h
net_dropmon.h
net_namespace.h
net_tstamp.h
net.h
netconf.h
netdev.h xsk: add new netlink attribute dedicated for ZC max frags 2023-07-19 09:56:49 -07:00
netdevice.h
netfilter_arp.h
netfilter_bridge.h
netfilter_ipv4.h
netfilter_ipv6.h
netfilter.h
netlink_diag.h
netlink.h
netrom.h
nexthop.h
nfc.h
nfs2.h
nfs3.h
nfs4_mount.h
nfs4.h
nfs_fs.h
nfs_idmap.h
nfs_mount.h
nfs.h
nfsacl.h
nilfs2_api.h
nilfs2_ondisk.h
nitro_enclaves.h
nl80211-vnd-intel.h
nl80211.h wifi: nl80211/reg: add no-EHT regulatory flag 2023-06-21 14:01:29 +02:00
nsfs.h
nubus.h
nvme_ioctl.h
nvram.h
omap3isp.h
omapfb.h
oom.h
openat2.h
openvswitch.h net: openvswitch: add support for l4 symmetric hashing 2023-06-12 09:46:30 +01:00
packet_diag.h
param.h
parport.h
patchkey.h
pci_regs.h PCI: Add PCI_EXT_CAP_ID_PL_32GT define 2023-05-31 16:34:38 -05:00
pci.h
pcitest.h
perf_event.h
personality.h
pfkeyv2.h
pfrut.h
pg.h
phantom.h
phonet.h
pidfd.h
pkt_cls.h net: flower: add support for matching cfm fields 2023-06-12 17:01:45 -07:00
pkt_sched.h net/sched: taprio: add netlink reporting for offload statistics counters 2023-05-31 10:00:30 +01:00
pktcdvd.h pktcdvd: Get rid of custom printing macros 2023-06-07 14:26:09 -06:00
pmu.h
poll.h
posix_acl_xattr.h
posix_acl.h
posix_types.h
ppdev.h
ppp_defs.h
ppp-comp.h
ppp-ioctl.h
pps.h
pr.h
prctl.h riscv: Add prctl controls for userspace vector management 2023-06-08 07:16:53 -07:00
psample.h
psci.h
psp-sev.h
ptp_clock.h ptp: Add .getmaxphase callback to ptp_clock_info 2023-06-20 09:02:33 +01:00
ptrace.h
qemu_fw_cfg.h
qnx4_fs.h
qnxtypes.h
qrtr.h
quota.h
radeonfb.h
random.h
rds.h
reboot.h
reiserfs_fs.h
reiserfs_xattr.h
remoteproc_cdev.h
resource.h
rfkill.h
rio_cm_cdev.h
rio_mport_cdev.h
rkisp1-config.h
romfs_fs.h
rose.h
route.h
rpl_iptunnel.h
rpl.h
rpmsg_types.h
rpmsg.h
rseq.h
rtc.h
rtnetlink.h
rxrpc.h
scc.h
sched.h
scif_ioctl.h
screen_info.h
sctp.h
seccomp.h
securebits.h
sed-opal.h sed-opal: geometry feature reporting command 2023-04-19 14:07:13 -06:00
seg6_genl.h
seg6_hmac.h
seg6_iptunnel.h
seg6_local.h
seg6.h
selinux_netlink.h
sem.h
serial_core.h
serial_reg.h
serial.h
serio.h
sev-guest.h
shm.h
signal.h
signalfd.h
smc_diag.h
smc.h
smiapp.h
snmp.h
sock_diag.h
socket.h
sockios.h
sonet.h
sonypi.h
sound.h
soundcard.h
stat.h
stddef.h
stm.h
string.h
suspend_ioctls.h
swab.h
switchtec_ioctl.h
sync_file.h
synclink.h
sysctl.h
sysinfo.h
target_core_user.h
taskstats.h
tcp_metrics.h
tcp.h
tdx-guest.h
tee.h
termios.h
thermal.h
time_types.h
time.h
timerfd.h
times.h
timex.h
tiocl.h
tipc_config.h
tipc_netlink.h
tipc_sockets_diag.h
tipc.h
tls.h
toshiba.h
tps6594_pfsm.h misc: tps6594-pfsm: Add driver for TI TPS6594 PFSM 2023-06-15 13:41:53 +02:00
tty_flags.h
tty.h
types.h types: Introduce [us]128 2023-06-05 09:36:35 +02:00
ublk_cmd.h ublk: add control command of UBLK_U_CMD_GET_FEATURES 2023-06-04 08:34:14 -06:00
udf_fs_i.h
udmabuf.h
udp.h
uhid.h
uinput.h
uio.h
uleds.h
ultrasound.h
um_timetravel.h
un.h
unistd.h
unix_diag.h
usbdevice_fs.h
usbip.h
user_events.h
userfaultfd.h
userio.h
utime.h
utsname.h
uuid.h
uvcvideo.h
v4l2-common.h
v4l2-controls.h media: Add AV1 uAPI 2023-06-09 16:13:01 +01:00
v4l2-dv-timings.h
v4l2-mediabus.h
v4l2-subdev.h
vbox_err.h
vbox_vmmdev_types.h
vboxguest.h
vdpa.h
vduse.h
veth.h
vfio_ccw.h
vfio_zdev.h
vfio.h VFIO updates for v6.5-rc1 2023-06-30 15:22:09 -07:00
vhost_types.h vhost: allow userspace to create workers 2023-07-03 12:15:14 -04:00
vhost.h vhost: Allow worker switching while work is queueing 2023-07-03 12:15:14 -04:00
videodev2.h media: Add NV15_4L4 pixel format 2023-06-09 16:14:40 +01:00
virtio_9p.h
virtio_balloon.h
virtio_blk.h
virtio_bt.h
virtio_config.h virtio: add VIRTIO_F_NOTIFICATION_DATA feature support 2023-04-21 03:02:35 -04:00
virtio_console.h
virtio_crypto.h
virtio_fs.h
virtio_gpio.h
virtio_gpu.h
virtio_i2c.h
virtio_ids.h
virtio_input.h
virtio_iommu.h
virtio_mem.h
virtio_mmio.h
virtio_net.h
virtio_pci.h
virtio_pcidev.h
virtio_pmem.h
virtio_ring.h
virtio_rng.h
virtio_scmi.h
virtio_scsi.h
virtio_snd.h
virtio_types.h
virtio_vsock.h
vm_sockets_diag.h
vm_sockets.h
vmcore.h
vsockmon.h
vt.h
vtpm_proxy.h
wait.h
watch_queue.h
watchdog.h
wireguard.h
wireless.h uapi: wireless: Replace zero-length array with flexible-array member 2023-05-28 19:07:48 -06:00
wmi.h
wwan.h
x25.h
xattr.h
xdp_diag.h
xfrm.h
xilinx-v4l2-controls.h
zorro_ids.h
zorro.h