twx-linux/include/net
Jussi Maki 9e2ee5c7e7 net, bonding: Add XDP support to the bonding driver
XDP is implemented in the bonding driver by transparently delegating
the XDP program loading, removal and xmit operations to the bonding
slave devices. The overall goal of this work is that XDP programs
can be attached to a bond device *without* any further changes (or
awareness) necessary to the program itself, meaning the same XDP
program can be attached to a native device but also a bonding device.

Semantics of XDP_TX when attached to a bond are equivalent in such
setting to the case when a tc/BPF program would be attached to the
bond, meaning transmitting the packet out of the bond itself using one
of the bond's configured xmit methods to select a slave device (rather
than XDP_TX on the slave itself). Handling of XDP_TX to transmit
using the configured bonding mechanism is therefore implemented by
rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
performance impact this check is guarded by a static key, which is
incremented when a XDP program is loaded onto a bond device. This
approach was chosen to avoid changes to drivers implementing XDP. If
the slave device does not match the receive device, then XDP_REDIRECT
is transparently used to perform the redirection in order to have
the network driver release the packet from its RX ring. The bonding
driver hashing functions have been refactored to allow reuse with
xdp_buff's to avoid code duplication.

The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters. An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.

Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter (see patch
2/3). The statistics were collected using "sar -n dev -u 1 10". On top
of that, for ice, further work is in progress on improving the XDP_TX
numbers [4].

 -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
 without patch (1 dev):
   XDP_DROP:              3.15%      48.6Mpps
   XDP_TX:                3.12%      18.3Mpps     18.3Mpps
   XDP_DROP (RSS):        9.47%      116.5Mpps
   XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
 -----------------------
 with patch, bond (1 dev):
   XDP_DROP:              3.14%      46.7Mpps
   XDP_TX:                3.15%      13.9Mpps     13.9Mpps
   XDP_DROP (RSS):        10.33%     117.2Mpps
   XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
 -----------------------
 with patch, bond (2 devs):
   XDP_DROP:              6.27%      92.7Mpps
   XDP_TX:                6.26%      17.6Mpps     17.5Mpps
   XDP_DROP (RSS):       11.38%      117.2Mpps
   XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
 --------------------------------------------------------------

RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.

  [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
  [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
  [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
  [4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#t

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com
2021-08-09 23:20:14 +02:00
..
9p 9p: apply review requests for fid refcounting 2020-11-19 17:21:34 +01:00
bluetooth Bluetooth: Fix Set Extended (Scan Response) Data 2021-06-26 07:12:44 +02:00
caif net: remove the caif_hsi driver 2021-07-01 13:19:48 -07:00
iucv net/af_iucv: don't track individual TX skbs for TRANS_HIPER sockets 2021-01-28 20:36:21 -08:00
netfilter netfilter: conntrack: nf_ct_gre_keymap_flush() removal 2021-07-02 02:07:01 +02:00
netns mctp: Allow per-netns default networks 2021-07-29 15:06:50 +01:00
nfc nfc: hci: pass callback data param as pointer in nci_request() 2021-08-02 15:11:37 +01:00
phonet
sctp sctp: send pmtu probe only if packet loss in Search Complete state 2021-07-25 23:06:02 +01:00
tc_act net/sched: act_vlan: Fix modify to allow 0 2021-06-01 16:54:42 -07:00
6lowpan.h
act_api.h net_sched: refactor TC action init API 2021-08-02 10:24:38 +01:00
addrconf.h net: bridge: mcast: fix broken length + header check for MRDv6 Adv. 2021-04-27 14:02:06 -07:00
af_ieee802154.h
af_rxrpc.h afs: Don't truncate iter during data fetch 2021-04-23 10:17:26 +01:00
af_unix.h af_unix: Implement unix_dgram_bpf_recvmsg() 2021-07-15 18:17:50 -07:00
af_vsock.h af_vsock: rest of SEQPACKET support 2021-06-11 13:32:46 -07:00
ah.h
arp.h
atmclip.h
ax25.h
ax88796.h
bareudp.h bareudp: Reverted support to enable & disable rx metadata collection 2020-07-21 18:30:47 -07:00
bond_3ad.h
bond_alb.h bonding/alb: Add helper functions to get the xmit slave 2020-05-01 12:15:37 -07:00
bond_options.h
bonding.h net, bonding: Add XDP support to the bonding driver 2021-08-09 23:20:14 +02:00
bpf_sk_storage.h bpf: struct sock is declared twice in bpf_sk_storage header 2021-03-26 17:43:55 +01:00
busy_poll.h net: annotate data race around sk_ll_usec 2021-07-01 11:23:50 -07:00
calipso.h
cfg80211-wext.h
cfg80211.h Lots of changes: 2021-06-28 13:06:12 -07:00
cfg802154.h
checksum.h csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter() 2021-06-10 11:45:14 -04:00
cipso_ipv4.h cipso: Remove unused inline functions 2020-07-15 07:45:24 -07:00
cls_cgroup.h
codel_impl.h
codel_qdisc.h
codel.h
compat.h compat: always include linux/compat.h from net/compat.h 2020-11-23 13:31:54 -08:00
datalink.h
dcbevent.h
dcbnl.h
devlink.h devlink: Allocate devlink directly in requested net namespace 2021-07-30 13:16:38 -07:00
dn_dev.h
dn_fib.h net: convert fib_treeref from int to refcount_t 2021-07-30 15:33:24 +02:00
dn_neigh.h
dn_nsp.h
dn_route.h
dn.h
dsa.h net: dsa: remove the struct packet_type argument from dsa_device_ops::rcv() 2021-08-02 15:13:15 +01:00
dsfield.h
dst_cache.h
dst_metadata.h net: validate lwtstate->data before returning from skb_tunnel_info() 2021-07-09 13:55:53 -07:00
dst_ops.h net/dst: use a smaller percpu_counter batch for dst entries accounting 2020-05-08 21:33:33 -07:00
dst.h sk_buff: track dst status in slow_gro 2021-07-29 12:18:11 +01:00
erspan.h erspan: Add type I version 0 support. 2020-05-05 13:23:29 -07:00
esp.h
espintcp.h
ethoc.h
failover.h
fib_notifier.h
fib_rules.h fib: use indirect call wrappers in the most common fib_rules_ops 2020-07-28 17:42:31 -07:00
firewire.h
flow_dissector.h flow_dissector: constify raw input data argument 2021-03-14 14:46:32 -07:00
flow_offload.h flow_offload: action should not be NULL when it is referenced 2021-06-28 14:24:06 -07:00
flow.h flow: remove spi key from flowi struct 2021-04-19 12:25:11 +02:00
fou.h
fq_impl.h net/fq_impl: do not maintain a backlog-sorted list of flows 2021-01-21 13:33:45 +01:00
fq.h net/fq_impl: do not maintain a backlog-sorted list of flows 2021-01-21 13:33:45 +01:00
garp.h
gen_stats.h
genetlink.h mptcp: avoid lock_fast usage in accept path 2021-02-12 16:31:46 -08:00
geneve.h
gre.h ip_gre: add csum offload support for gre header 2021-01-29 20:39:14 -08:00
gro_cells.h
gro.h gro: add combined call_gro_receive() + INDIRECT_CALL_INET() helper 2021-03-18 19:51:12 -07:00
gtp.h
gue.h GUE: Fix a typo 2020-06-22 21:12:44 -07:00
hwbm.h
icmp.h ipv6: ICMPV6: add response to ICMPV6 RFC 8335 PROBE messages 2021-06-28 14:29:45 -07:00
ieee80211_radiotap.h mac80211: add radiotap flag to assure frames are not reordered 2020-11-06 11:01:01 +01:00
ieee802154_netdev.h
if_inet6.h mld: add mc_lock for protecting per-interface mld data 2021-03-26 15:14:56 -07:00
ife.h
ila.h
inet6_connection_sock.h
inet6_hashtables.h
inet_common.h bpf: Allow rewriting to ports under ip_unprivileged_port_start 2021-01-27 18:18:15 -08:00
inet_connection_sock.h tcp: change ICSK_CA_PRIV_SIZE definition 2021-06-29 11:54:36 -07:00
inet_ecn.h inet_ecn: Use csum16_add() helper for IP_ECN_set_* helpers 2020-12-14 18:38:58 -08:00
inet_frag.h inet: frags: batch fqdir destroy works 2020-12-12 15:08:54 -08:00
inet_hashtables.h tcp: seq_file: Replace listening_hash with lhash2 2021-07-23 16:44:57 -07:00
inet_sock.h inet: remove inet_sk_copy_descendant() 2020-08-26 07:33:19 -07:00
inet_timewait_sock.h
inetpeer.h
ioam6.h ipv6: ioam: Support for IOAM injection with lwtunnels 2021-07-21 08:14:33 -07:00
ip6_checksum.h tcp: remove indirect calls for icsk->icsk_af_ops->send_check 2020-06-20 17:47:53 -07:00
ip6_fib.h IPv6: Add "offload failed" indication to routes 2021-02-08 16:47:03 -08:00
ip6_route.h net: ipv6: introduce ip6_dst_mtu_maybe_forward 2021-07-21 08:22:02 -07:00
ip6_tunnel.h
ip_fib.h net: convert fib_treeref from int to refcount_t 2021-07-30 15:33:24 +02:00
ip_tunnels.h ip_tunnel: use ndo_siocdevprivate 2021-07-27 20:11:44 +01:00
ip_vs.h netfilter: move handlers to net/ip_vs.h 2021-02-04 18:37:57 -08:00
ip.h net: ipv4: Consolidate ipv4_mtu and ip_dst_mtu_maybe_forward 2021-07-21 08:22:03 -07:00
ipcomp.h
ipconfig.h
ipv6_frag.h ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module 2020-11-19 10:49:50 -08:00
ipv6_stubs.h ipv6: add ipv6_dev_find to stubs 2021-03-30 13:29:39 -07:00
ipv6.h ipv6: Add a sysctl to control multipath hash fields 2021-05-18 13:27:32 -07:00
ipx.h
iw_handler.h
kcm.h
l3mdev.h l3mdev: add infrastructure for table to VRF mapping 2020-06-20 17:22:22 -07:00
lag.h
lapb.h net: lapb: Make "lapb_t1timer_running" able to detect an already running timer 2021-03-23 14:14:50 -07:00
lib80211.h
llc_c_ac.h
llc_c_ev.h
llc_c_st.h
llc_conn.h
llc_if.h
llc_pdu.h net: llc: fix skb_over_panic 2021-07-27 13:05:56 +01:00
llc_s_ac.h
llc_s_ev.h
llc_s_st.h
llc_sap.h
llc.h
lwtunnel.h
mac80211.h mac80211: Switch to a virtual time-based airtime scheduler 2021-06-23 18:12:00 +02:00
mac802154.h
macsec.h net: macsec: fix the length used to copy the key for offloading 2021-06-24 12:41:12 -07:00
mctp.h mctp: Allow per-netns default networks 2021-07-29 15:06:50 +01:00
mctpdevice.h mctp: Add neighbour implementation 2021-07-29 15:06:50 +01:00
mip6.h
mld.h mld: add new workqueues for process mld events 2021-03-26 15:14:56 -07:00
mpls_iptunnel.h
mpls.h net: Make mpls_entry_encode() available for generic users 2020-05-29 21:20:20 -07:00
mptcp.h mptcp: avoid processing packet if a subflow reset 2021-07-09 18:38:53 -07:00
mrp.h
ncsi.h
ndisc.h ipv6: ndisc: adjust ndisc_ifinfo_sysctl_change prototype 2020-08-24 06:40:07 -07:00
neighbour.h net: Exempt multicast addresses from five-second neighbor lifetime 2020-11-13 14:24:39 -08:00
net_failover.h
net_namespace.h mctp: Add initial routing framework 2021-07-29 15:06:50 +01:00
net_ratelimit.h
netevent.h
netlabel.h
netlink.h net: netlink: add the case when nlh is NULL 2021-07-27 11:43:50 +01:00
netprio_cgroup.h
netrom.h
nexthop.h nexthop: Rename artifacts related to legacy multipath nexthop groups 2021-03-28 17:53:39 -07:00
nl802154.h
nsh.h
p8022.h
page_pool.h page_pool: Allow drivers to hint on SKB recycling 2021-06-07 14:11:47 -07:00
pie.h
ping.h
pkt_cls.h net_sched: refactor TC action init API 2021-08-02 10:24:38 +01:00
pkt_sched.h net: sched: fix tx action rescheduling issue during deactivation 2021-05-14 15:05:46 -07:00
pptp.h
protocol.h net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
psample.h psample: Add additional metadata attributes 2021-03-14 15:00:43 -07:00
psnap.h
raw.h
rawv6.h
red.h sch_red: fix off-by-one checks in red_check_params() 2021-03-25 17:40:43 -07:00
regulatory.h net/wireless: regulatory.h: drop duplicate word in comment 2020-07-31 09:24:23 +02:00
request_sock.h tcp: bpf: Optionally store mac header in TCP_SAVE_SYN 2020-08-24 14:35:00 -07:00
rose.h
route.h lsm,selinux: pass flowi_common instead of flowi to the LSM hooks 2020-11-23 18:36:21 -05:00
rpl.h net: ipv6: Use struct_size() helper and kcalloc() 2020-06-23 20:27:09 -07:00
rsi_91x.h
rtnetlink.h rtnetlink: add alloc() method to rtnl_link_ops 2021-06-12 13:16:45 -07:00
rtnh.h
sch_generic.h net_sched: refactor TC action init API 2021-08-02 10:24:38 +01:00
scm.h fs: Move __scm_install_fd() to __receive_fd() 2020-07-13 11:03:44 -07:00
secure_seq.h
seg6_hmac.h
seg6_local.h
seg6.h seg6: fix seg6_validate_srh() to avoid slab-out-of-bounds 2020-06-04 15:39:32 -07:00
selftests.h net: selftest: fix build issue if INET is disabled 2021-04-28 14:06:45 -07:00
slhc_vj.h
smc.h net/smc: introduce CHID callback for ISM devices 2020-09-28 15:19:03 -07:00
snmp.h
sock_reuseport.h tcp: Add reuseport_migrate_sock() to select a new listener. 2021-06-15 18:01:05 +02:00
sock.h skbuff: allow 'slow_gro' for skb carring sock reference 2021-07-29 12:18:12 +01:00
Space.h
stp.h
strparser.h
switchdev.h net: switchdev: remove stray semicolon in switchdev_handle_fdb_del_to_device shim 2021-07-21 08:00:16 -07:00
tcp_states.h
tcp.h Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2021-07-31 11:23:26 -07:00
timewait_sock.h
tipc.h
tls_toe.h
tls.h net: sock: introduce sk_error_report 2021-06-29 11:28:21 -07:00
transp_v6.h tcp: move ipv4_specific to tcp include file 2020-06-23 20:10:15 -07:00
tso.h net: tso: cache transport header length 2020-06-18 20:46:23 -07:00
tun_proto.h
udp_tunnel.h udp: call udp_encap_enable for v6 sockets when enabling encap 2021-02-04 18:37:14 -08:00
udp.h skmsg: Pass psock pointer to ->psock_update_sk_prot() 2021-04-12 17:34:27 +02:00
udplite.h
vsock_addr.h
vxlan.h net: sched: only keep the available bits when setting vxlan md->gbp 2020-09-14 16:49:39 -07:00
wext.h
x25.h
x25device.h
xdp_priv.h
xdp_sock_drv.h xsk: Introduce batched Tx descriptor interfaces 2020-11-17 22:07:40 +01:00
xdp_sock.h xdp: Add proper __rcu annotations to redirect map entries 2021-06-24 19:41:15 +02:00
xdp.h bpf: Add function for XDP meta data length check 2021-07-07 19:51:12 -07:00
xfrm.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-06-29 15:45:27 -07:00
xsk_buff_pool.h xsk: Fix missing validation for skb and unaligned mode 2021-06-18 16:57:19 +02:00