pwru: 一款基于 eBPF 的细粒度网络数据包排查工具

在 eBPF(Extended Berkeley Packet Filter)中,pwru 的主要功能是进行性能分析,它可以帮助理解系统中各种内核和用户空间任务的行为。这是通过在内核中执行 eBPF 程序并收集有关系统行为的数据来实现的。

Posted by 陈谭军 on Saturday, November 11, 2023 | 阅读 |,阅读约 10 分钟

介绍

Linux 中的网络是个很复杂的问题!!!我们常常会遇见以下问题。

  • 网络命名空间(network)使网络变得更加复杂;
  • 当一个数据包丢失时,作为网络工程师,如何确切知道问题出在Linux内核的哪个地方;
  • 一旦融合eBPF和XDP网络程序,如何确切知道需要查看的所有可能的代码路径?
  • 如何知道你不知道的事情?

pwru 是 Cilium 推出的基于 eBPF 开发的网络数据包排查工具,它提供了更细粒度的网络数据包排查方案。本文将介绍 pwru 的使用方法和经典场景,并介绍其实现原理。https://github.com/cilium/pwru 是一种基于eBPF的工具,用于在Linux内核中追踪网络数据包,并具有高级过滤能力。它允许对内核状态进行细粒度的跟踪,并可以用来调试网络连接性问题。

部署与安装

pwru 需要 5.3+ 版本的内核才能运行,对于 –output-skb,需要 5.9+ 版本的内核,对于 –backend=kprobe-multi,需要 5.18+ 版本的内核。 debugfs 必须挂载在 /sys/kernel/debug 下,如果该文件夹为空,可以使用以下命令进行挂载:

mount -t debugfs none /sys/kernel/debug

需要以下的内核配置:

OptionBackendNote
CONFIG_DEBUG_INFO_BTF=ybothavailable since >= 5.3
CONFIG_KPROBES=yboth
CONFIG_PERF_EVENTS=yboth
CONFIG_BPF=yboth
CONFIG_BPF_SYSCALL=yboth
CONFIG_FUNCTION_TRACER=ykprobe-multi/sys/kernel/debug/tracing/available_filter_functions
CONFIG_FPROBE=ykprobe-multiavailable since >= 5.18

可以从以下页面 https://github.com/cilium/pwru/releases 下载x86_64和arm64可执行文件。

root@instance-820epr0w:~/tanjunchen# ./pwru --help
Usage: ./pwru [options] [pcap-filter]
    Available pcap-filter: see "man 7 pcap-filter"
    Available options:
      --all-kmods                 attach to all available kernel modules
      --backend string            Tracing backend('kprobe', 'kprobe-multi'). Will auto-detect if not specified.
      --filter-func string        filter kernel functions to be probed by name (exact match, supports RE2 regular expression)
      --filter-ifname string      filter skb ifname in --filter-netns (if not specified, use current netns)
      --filter-mark uint32        filter skb mark
      --filter-netns string       filter netns ("/proc/<pid>/ns/net", "inode:<inode>")
      --filter-trace-tc           trace TC bpf progs
      --filter-track-skb          trace a packet even if it does not match given filters (e.g., after NAT or tunnel decapsulation)
  -h, --help                      display this message and exit
      --kernel-btf string         specify kernel BTF file
      --kmods strings             list of kernel modules names to attach to
      --output-file string        write traces to file
      --output-limit-lines uint   exit the program after the number of events has been received/printed
      --output-meta               print skb metadata
      --output-skb                print skb
      --output-stack              print stack
      --output-tuple              print L4 tuple
      --timestamp string          print timestamp per skb ("current", "relative", "absolute", "none") (default "none")
      --version                   show pwru version and exit

更多的详情可参考 https://github.com/cilium/pwru

使用示例

环境基础信息如下所示:

root@instance-820epr0w:~/tanjunchen# uname -a
Linux instance-820epr0w 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

以下示例展示修改 netfilter 表规则后,如何定位 curl 请求的数据包被丢弃。我们要测试的 iptables 规则如下所示:

iptables -t filter -I OUTPUT 1 -m tcp --proto tcp --dst 1.1.1.1/32 -j DROP
iptables -t filter -D OUTPUT -m tcp --proto tcp --dst 1.1.1.1/32 -j DROP

没有添加 iptables 规则前,pwru 的输出如下所示:

root@instance-820epr0w:~/tanjunchen# ./pwru 'dst host 1.1.1.1 and tcp and dst port 80'
2024/03/18 11:19:07 Attaching kprobes (via kprobe)...
1477 / 1477 [---------------------------------------------------------------------------------------------------] 100.00% 80 p/s
2024/03/18 11:19:26 Attached (ignored 0)
2024/03/18 11:19:26 Listening for events..
               SKB    CPU          PROCESS                     FUNC
0xff18015deb5c36e0      2  [curl(3378123)]             ip_local_out
0xff18015deb5c36e0      2  [curl(3378123)]           __ip_local_out
0xff18015deb5c36e0      2  [curl(3378123)]             nf_hook_slow
0xff18015deb5c36e0      2  [curl(3378123)]                ip_output
0xff18015deb5c36e0      2  [curl(3378123)]             nf_hook_slow
0xff18015deb5c36e0      2  [curl(3378123)]  apparmor_ipv4_postroute
0xff18015deb5c36e0      2  [curl(3378123)]         ip_finish_output
0xff18015deb5c36e0      2  [curl(3378123)] __cgroup_bpf_run_filter_skb
0xff18015deb5c36e0      2  [curl(3378123)]       __ip_finish_output
0xff18015deb5c36e0      2  [curl(3378123)]        ip_finish_output2
0xff18015deb5c36e0      2  [curl(3378123)]           dev_queue_xmit
0xff18015deb5c36e0      2  [curl(3378123)]         __dev_queue_xmit
0xff18015deb5c36e0      2  [curl(3378123)]       qdisc_pkt_len_init
0xff18015deb5c36e0      2  [curl(3378123)]      netdev_core_pick_tx
0xff18015deb5c36e0      2  [curl(3378123)]           netdev_pick_tx
0xff18015deb5c36e0      2  [curl(3378123)]      __get_xps_queue_idx
0xff18015deb5c36e0      2  [curl(3378123)]          sch_direct_xmit
0xff18015deb5c36e0      2  [curl(3378123)]   validate_xmit_skb_list
0xff18015deb5c36e0      2  [curl(3378123)]        validate_xmit_skb
0xff18015deb5c36e0      2  [curl(3378123)]       netif_skb_features
0xff18015deb5c36e0      2  [curl(3378123)]  passthru_features_check
0xff18015deb5c36e0      2  [curl(3378123)]     skb_network_protocol
0xff18015deb5c36e0      2  [curl(3378123)]       validate_xmit_vlan
0xff18015deb5c36e0      2  [curl(3378123)]  skb_csum_hwoffload_help
0xff18015deb5c36e0      2  [curl(3378123)]       validate_xmit_xfrm
0xff18015deb5c36e0      2  [curl(3378123)]      dev_hard_start_xmit
0xff18015deb5c36e0      2  [curl(3378123)]   skb_clone_tx_timestamp
0xff18015deb5c36e0      2  [curl(3378123)]             skb_to_sgvec
0xff18015deb5c36e0      2  [curl(3378123)]           __skb_to_sgvec
0xff18015deb5c36e0      4     [<empty>(0)]         napi_consume_skb
0xff18015deb5c36e0      4     [<empty>(0)]   skb_release_head_state
0xff18015deb5c36e0      4     [<empty>(0)]                tcp_wfree
0xff18015deb5c36e0      4     [<empty>(0)]         skb_release_data
0xff18015deb5c36e0      4     [<empty>(0)]             kfree_skbmem
0xff18015e9c940100      4     [<empty>(0)]             ip_local_out
0xff18015e9c940100      4     [<empty>(0)]           __ip_local_out
0xff18015e9c940100      4     [<empty>(0)]             nf_hook_slow
0xff18015e9c940100      4     [<empty>(0)]                ip_output
0xff18015e9c940100      4     [<empty>(0)]             nf_hook_slow
0xff18015e9c940100      4     [<empty>(0)]  apparmor_ipv4_postroute
0xff18015e9c940100      4     [<empty>(0)]         ip_finish_output
0xff18015e9c940100      4     [<empty>(0)] __cgroup_bpf_run_filter_skb
0xff18015e9c940100      4     [<empty>(0)]       __ip_finish_output
0xff18015e9c940100      4     [<empty>(0)]        ip_finish_output2
0xff18015e9c940100      4     [<empty>(0)]           dev_queue_xmit
0xff18015e9c940100      4     [<empty>(0)]         __dev_queue_xmit
0xff18015e9c940100      4     [<empty>(0)]       qdisc_pkt_len_init
0xff18015e9c940100      4     [<empty>(0)]      netdev_core_pick_tx
0xff18015e9c940100      4     [<empty>(0)]           netdev_pick_tx
0xff18015e9c940100      4     [<empty>(0)]      __get_xps_queue_idx
0xff18015e9c940100      4     [<empty>(0)]          sch_direct_xmit
0xff18015e9c940100      4     [<empty>(0)]   validate_xmit_skb_list
0xff18015e9c940100      4     [<empty>(0)]        validate_xmit_skb
0xff18015e9c940100      4     [<empty>(0)]       netif_skb_features
0xff18015e9c940100      4     [<empty>(0)]  passthru_features_check
0xff18015e9c940100      4     [<empty>(0)]     skb_network_protocol
0xff18015e9c940100      4     [<empty>(0)]       validate_xmit_vlan
0xff18015e9c940100      4     [<empty>(0)]  skb_csum_hwoffload_help
0xff18015e9c940100      4     [<empty>(0)]       validate_xmit_xfrm
0xff18015e9c940100      4     [<empty>(0)]      dev_hard_start_xmit
0xff18015e9c940100      4     [<empty>(0)]   skb_clone_tx_timestamp
0xff18015e9c940100      4     [<empty>(0)]             skb_to_sgvec
0xff18015e9c940100      4     [<empty>(0)]           __skb_to_sgvec
0xff18015e9c940100      1     [<empty>(0)]         napi_consume_skb
0xff18015e9c940100      1     [<empty>(0)]   skb_release_head_state
0xff18015e9c940100      1     [<empty>(0)]             __sock_wfree
0xff18015e9c940100      1     [<empty>(0)]         skb_release_data
0xff18015e9c940100      1     [<empty>(0)]            skb_free_head
2024/03/18 11:19:36 Received signal, exiting program..

在设置 iptables 规则之后,输出如下所示:

root@instance-820epr0w:~/tanjunchen# iptables -t filter -I OUTPUT 1 -m tcp --proto tcp --dst 1.1.1.1/32 -j DROP
root@instance-820epr0w:~/tanjunchen# ./pwru 'dst host 1.1.1.1 and tcp and dst port 80'
2024/03/18 11:22:13 Attaching kprobes (via kprobe)...
1477 / 1477 [---------------------------------------------------------------------------------------------------] 100.00% 80 p/s
2024/03/18 11:22:31 Attached (ignored 0)
2024/03/18 11:22:31 Listening for events..
               SKB    CPU          PROCESS                     FUNC
0xff180162daa02ae0      0  [curl(3381789)]             ip_local_out
0xff180162daa02ae0      0  [curl(3381789)]           __ip_local_out
0xff180162daa02ae0      0  [curl(3381789)]             nf_hook_slow
0xff180162daa02ae0      0  [curl(3381789)] kfree_skb_reason(SKB_DROP_REASON_NETFILTER_DROP)
0xff180162daa02ae0      0  [curl(3381789)]   skb_release_head_state
0xff180162daa02ae0      0  [curl(3381789)]                tcp_wfree
0xff180162daa02ae0      0  [curl(3381789)]         skb_release_data
0xff180162daa02ae0      0  [curl(3381789)]             kfree_skbmem
0xff180162daa02ae0      0     [<empty>(0)]              __skb_clone
0xff180162daa02ae0      0     [<empty>(0)]        __copy_skb_header
0xff180162daa02ae0      0     [<empty>(0)]             ip_local_out
0xff180162daa02ae0      0     [<empty>(0)]           __ip_local_out
0xff180162daa02ae0      0     [<empty>(0)]             nf_hook_slow
0xff180162daa02ae0      0     [<empty>(0)] kfree_skb_reason(SKB_DROP_REASON_NETFILTER_DROP)
0xff180162daa02ae0      0     [<empty>(0)]   skb_release_head_state
0xff180162daa02ae0      0     [<empty>(0)]                tcp_wfree
0xff180162daa02ae0      0     [<empty>(0)]         skb_release_data
0xff180162daa02ae0      0     [<empty>(0)]             kfree_skbmem
0xff180162daa02ae0      0     [<empty>(0)]              __skb_clone
0xff180162daa02ae0      0     [<empty>(0)]        __copy_skb_header
0xff180162daa02ae0      0     [<empty>(0)]             ip_local_out
0xff180162daa02ae0      0     [<empty>(0)]           __ip_local_out
0xff180162daa02ae0      0     [<empty>(0)]             nf_hook_slow
0xff180162daa02ae0      0     [<empty>(0)] kfree_skb_reason(SKB_DROP_REASON_NETFILTER_DROP)
0xff180162daa02ae0      0     [<empty>(0)]   skb_release_head_state
0xff180162daa02ae0      0     [<empty>(0)]                tcp_wfree
0xff180162daa02ae0      0     [<empty>(0)]         skb_release_data
0xff180162daa02ae0      0     [<empty>(0)]             kfree_skbmem
^C2024/03/18 11:22:43 Received signal, exiting program..
2024/03/18 11:22:43 Detaching kprobes...

在 nf_hook_slow 处看到 kfree_skb_reason(SKB_DROP_REASON_NETFILTER_DROP) 函数被调用,出现了丢包。 上述输出内容中,有很多重点的内核函数,如下所示:

  1. ip_local_out和__ip_local_out:这两个函数主要用于处理本地生成的IP数据包,它们将数据包传递给路由子系统以确定数据包的目标,然后将数据包加入到输出队列等待传输。
  2. nf_hook_slow:这是netfilter框架中的一个函数,它用于调用在网络钩子上注册的所有函数,以便将数据包传递给它们进行处理。这是实现防火墙功能的关键部分。
  3. ip_output:这个函数用于处理需要通过网络接口发送的IP数据包。
  4. apparmor_ipv4_postroute:这是AppArmor安全模块的一部分,用于处理路由后的IPv4数据包。
  5. ip_finish_output:这个函数负责将路由后的数据包发送到网络接口。
  6. __cgroup_bpf_run_filter_skb:这个函数用于运行eBPF程序,处理sk_buff结构的网络数据包。
  7. dev_queue_xmit和__dev_queue_xmit:这两个函数负责将数据包加入到设备的发送队列。
  8. qdisc_pkt_len_init:这个函数用于初始化数据包的队列规则。
  9. netdev_core_pick_tx和netdev_pick_tx:这两个函数用于在多队列网络设备中选择适合发送数据包的队列。
  10. __get_xps_queue_idx:此函数用于获取XPS队列索引。
  11. sch_direct_xmit:此函数用于直接发送数据包。
  12. validate_xmit_skb_list和validate_xmit_skb:这些函数用于验证要发送的数据包的各种特性和条件。
  13. netif_skb_features:这个函数用于获取网络设备支持的数据包特性。
  14. skb_network_protocol:此函数用于获取数据包的网络协议。
  15. dev_hard_start_xmit:此函数用于开始对数据包的硬件传输。
  16. skb_clone_tx_timestamp:此函数用于复制数据包的发送时间戳。
  17. skb_to_sgvec和__skb_to_sgvec:这两个函数用于将数据包从sk_buff结构转换为scatter/gather向量,用于DMA数据传输。
  18. napi_consume_skb:此函数用于在数据包处理完成后释放sk_buff结构。
  19. skb_release_head_state和skb_release_data:这两个函数用于释放sk_buff结构的头部和数据部分。
  20. tcp_wfree:此函数用于在TCP数据包发送后释放其内存。
  21. kfree_skbmem:此函数用于释放sk_buff结构占用的内存。
  22. kfree_skb_reason(SKB_DROP_REASON_NETFILTER_DROP):这个函数用于释放由于被Netfilter(Linux内核网络过滤模块)丢弃的网络数据包所占用的内存资源。SKB_DROP_REASON_NETFILTER_DROP是一个常量,它指示网络数据包是由于Netfilter而被丢弃的。
  23. skb_release_head_state:这个函数用于释放网络数据包的头部状态。在网络数据包被处理后,协议栈会调用这个函数来释放数据包头部状态所占用的资源。
  24. tcp_wfree:这个函数用于在TCP数据包被发送后释放其内存。在TCP数据包被成功发送后,协议栈会调用这个函数来释放数据包所占用的内存资源。
  25. skb_release_data:这个函数用于释放网络数据包的数据部分。在数据包被处理后,协议栈会调用这个函数来释放数据部分所占用的资源。

Kernel 5.15 内核 nf_hook_slow 的源码如下所示:(kfree_skb_reason 是在 5.10 版本的内核中被引入的)。

/* Returns 1 if okfn() needs to be executed by the caller,
 * -EPERM for NF_DROP, 0 otherwise.  Caller must hold rcu_read_lock. */
int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
		 const struct nf_hook_entries *e, unsigned int s)
{
	unsigned int verdict;
	int ret;

	for (; s < e->num_hook_entries; s++) {
		verdict = nf_hook_entry_hookfn(&e->hooks[s], skb, state);
		switch (verdict & NF_VERDICT_MASK) {
		case NF_ACCEPT:
			break;
		case NF_DROP:
            // 内存 kfree_skb_reason 
			kfree_skb_reason(skb,
					 SKB_DROP_REASON_NETFILTER_DROP);
			ret = NF_DROP_GETERR(verdict);
			if (ret == 0)
				ret = -EPERM;
			return ret;
		case NF_QUEUE:
			ret = nf_queue(skb, state, s, verdict);
			if (ret == 1)
				continue;
			return ret;
		default:
			/* Implicit handling for NF_STOLEN, as well as any other
			 * non conventional verdicts.
			 */
			return 0;
		}
	}

	return 1;
}

Kernel 5.10 内核 nf_hook_slow 的源码如下所示:

/* Returns 1 if okfn() needs to be executed by the caller,
 * -EPERM for NF_DROP, 0 otherwise.  Caller must hold rcu_read_lock. */
int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
		 const struct nf_hook_entries *e, unsigned int s)
{
	unsigned int verdict;
	int ret;

	for (; s < e->num_hook_entries; s++) {
		verdict = nf_hook_entry_hookfn(&e->hooks[s], skb, state);
		switch (verdict & NF_VERDICT_MASK) {
		case NF_ACCEPT:
			break;
		case NF_DROP:
            // 内存 kfree_skb  
			kfree_skb(skb);
			ret = NF_DROP_GETERR(verdict);
			if (ret == 0)
				ret = -EPERM;
			return ret;
		case NF_QUEUE:
			ret = nf_queue(skb, state, s, verdict);
			if (ret == 1)
				continue;
			return ret;
		default:
			/* Implicit handling for NF_STOLEN, as well as any other
			 * non conventional verdicts.
			 */
			return 0;
		}
	}
	return 1;
}

原理与源码

pwru 本质上是向 kprobe 注册了一些 eBPF 代码,根据 pwru 传入的参数可以更新 eBPF Map,改变限制条件,从而更新输出。在 FilterCfg 里面设置了过滤的 IP 地址和协议等条件。

type FilterCfg struct {
	FilterNetns   uint32
	FilterMark    uint32
	FilterIfindex uint32

	// TODO: if there are more options later, then you can consider using a bit map
	OutputMeta  uint8
	OutputTuple uint8
	OutputSkb   uint8
	OutputStack uint8

	IsSet    byte
	TrackSkb byte
}

会根据 pwru 传入的参数更新过滤条件。

func GetConfig(flags *Flags) (cfg FilterCfg, err error) {
	cfg = FilterCfg{
		FilterMark: flags.FilterMark,
		IsSet:      1,
	}
	if flags.OutputSkb {
		cfg.OutputSkb = 1
	}
	if flags.OutputMeta {
		cfg.OutputMeta = 1
	}
	if flags.OutputTuple {
		cfg.OutputTuple = 1
	}
	if flags.OutputStack {
		cfg.OutputStack = 1
	}
	if flags.FilterTrackSkb {
		cfg.TrackSkb = 1
	}

	netnsID, ns, err := parseNetns(flags.FilterNetns)
	if err != nil {
		err = fmt.Errorf("Failed to retrieve netns %s: %w", flags.FilterNetns, err)
		return
	}
	if flags.FilterIfname != "" || flags.FilterNetns != "" {
		cfg.FilterNetns = netnsID
	}
	if cfg.FilterIfindex, err = parseIfindex(flags.FilterIfname, ns); err != nil {
		return
	}
	return
}

eBPF 内核态的代码如下所示:

// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
/* Copyright Martynas Pumputis */
/* Copyright Authors of Cilium */

#include "vmlinux.h"
#include "bpf/bpf_helpers.h"
#include "bpf/bpf_core_read.h"
#include "bpf/bpf_tracing.h"
#include "bpf/bpf_ipv6.h"

#define PRINT_SKB_STR_SIZE    2048

#define ETH_P_IP              0x800
#define ETH_P_IPV6            0x86dd

const static bool TRUE = true;

volatile const static __u64 BPF_PROG_ADDR = 0;

union addr {
	u32 v4addr;
	struct {
		u64 d1;
		u64 d2;
	} v6addr;
} __attribute__((packed));

struct skb_meta {
	u32 netns;
	u32 mark;
	u32 ifindex;
	u32 len;
	u32 mtu;
	u16 protocol;
	u16 pad;
} __attribute__((packed));

struct tuple {
	union addr saddr;
	union addr daddr;
	u16 sport;
	u16 dport;
	u16 l3_proto;
	u8 l4_proto;
	u8 pad;
} __attribute__((packed));

u64 print_skb_id = 0;

struct event_t {
	u32 pid;
	u32 type;
	u64 addr;
	u64 skb_addr;
	u64 ts;
	typeof(print_skb_id) print_skb_id;
	struct skb_meta meta;
	struct tuple tuple;
	s64 print_stack_id;
	u64 param_second;
	u32 cpu_id;
} __attribute__((packed));

#define MAX_QUEUE_ENTRIES 10000
struct {
	__uint(type, BPF_MAP_TYPE_QUEUE);
	__type(value, struct event_t);
	__uint(max_entries, MAX_QUEUE_ENTRIES);
} events SEC(".maps");

#define MAX_TRACK_SIZE 1024
struct {
	__uint(type, BPF_MAP_TYPE_HASH);
	__type(key, __u64);
	__type(value, bool);
	__uint(max_entries, MAX_TRACK_SIZE);
} skb_addresses SEC(".maps");

struct config {
	u32 netns;
	u32 mark;
	u32 ifindex;
	u8 output_meta;
	u8 output_tuple;
	u8 output_skb;
	u8 output_stack;
	u8 is_set;
	u8 track_skb;
} __attribute__((packed));

static volatile const struct config CFG;
#define cfg (&CFG)

#define MAX_STACK_DEPTH 50
struct {
	__uint(type, BPF_MAP_TYPE_STACK_TRACE);
	__uint(max_entries, 256);
	__uint(key_size, sizeof(u32));
	__uint(value_size, MAX_STACK_DEPTH * sizeof(u64));
} print_stack_map SEC(".maps");

#ifdef OUTPUT_SKB
struct {
	__uint(type, BPF_MAP_TYPE_ARRAY);
	__uint(max_entries, 256);
	__type(key, u32);
	__type(value, char[PRINT_SKB_STR_SIZE]);
} print_skb_map SEC(".maps");
#endif

static __always_inline u32
get_netns(struct sk_buff *skb) {
	u32 netns = BPF_CORE_READ(skb, dev, nd_net.net, ns.inum);

	// if skb->dev is not initialized, try to get ns from sk->__sk_common.skc_net.net->ns.inum
	if (netns == 0)	{
		struct sock *sk = BPF_CORE_READ(skb, sk);
		if (sk != NULL)	{
			netns = BPF_CORE_READ(sk, __sk_common.skc_net.net, ns.inum);
		}
	}

	return netns;
}

static __always_inline bool
filter_meta(struct sk_buff *skb) {
	if (cfg->netns && get_netns(skb) != cfg->netns) {
			return false;
	}
	if (cfg->mark && BPF_CORE_READ(skb, mark) != cfg->mark) {
		return false;
	}
	if (cfg->ifindex != 0 && BPF_CORE_READ(skb, dev, ifindex) != cfg->ifindex) {
		return false;
	}
	return true;
}

static __noinline bool
filter_pcap_ebpf_l3(void *_skb, void *__skb, void *___skb, void *data, void* data_end)
{
	return data != data_end && _skb == __skb && __skb == ___skb;
}

static __always_inline bool
filter_pcap_l3(struct sk_buff *skb)
{
	void *skb_head = BPF_CORE_READ(skb, head);
	void *data = skb_head + BPF_CORE_READ(skb, network_header);
	void *data_end = skb_head + BPF_CORE_READ(skb, tail);
	return filter_pcap_ebpf_l3((void *)skb, (void *)skb, (void *)skb, data, data_end);
}

static __noinline bool
filter_pcap_ebpf_l2(void *_skb, void *__skb, void *___skb, void *data, void* data_end)
{
	return data != data_end && _skb == __skb && __skb == ___skb;
}

static __always_inline bool
filter_pcap_l2(struct sk_buff *skb)
{
	void *skb_head = BPF_CORE_READ(skb, head);
	void *data = skb_head + BPF_CORE_READ(skb, mac_header);
	void *data_end = skb_head + BPF_CORE_READ(skb, tail);
	return filter_pcap_ebpf_l2((void *)skb, (void *)skb, (void *)skb, data, data_end);
}

static __always_inline bool
filter_pcap(struct sk_buff *skb) {
	if (BPF_CORE_READ(skb, mac_len) == 0)
		return filter_pcap_l3(skb);
	return filter_pcap_l2(skb);
}

static __always_inline bool
filter(struct sk_buff *skb) {
	return filter_pcap(skb) && filter_meta(skb);
}

static __always_inline void
set_meta(struct sk_buff *skb, struct skb_meta *meta) {
	meta->netns = get_netns(skb);
	meta->mark = BPF_CORE_READ(skb, mark);
	meta->len = BPF_CORE_READ(skb, len);
	meta->protocol = BPF_CORE_READ(skb, protocol);
	meta->ifindex = BPF_CORE_READ(skb, dev, ifindex);
	meta->mtu = BPF_CORE_READ(skb, dev, mtu);
}

static __always_inline void
set_tuple(struct sk_buff *skb, struct tuple *tpl) {
	void *skb_head = BPF_CORE_READ(skb, head);
	u16 l3_off = BPF_CORE_READ(skb, network_header);
	u16 l4_off;

	struct iphdr *l3_hdr = (struct iphdr *) (skb_head + l3_off);
	u8 ip_vsn = BPF_CORE_READ_BITFIELD_PROBED(l3_hdr, version);

	if (ip_vsn == 4) {
		struct iphdr *ip4 = (struct iphdr *) l3_hdr;
		BPF_CORE_READ_INTO(&tpl->saddr, ip4, saddr);
		BPF_CORE_READ_INTO(&tpl->daddr, ip4, daddr);
		tpl->l4_proto = BPF_CORE_READ(ip4, protocol);
		tpl->l3_proto = ETH_P_IP;
		l4_off = l3_off + BPF_CORE_READ_BITFIELD_PROBED(ip4, ihl) * 4;

	} else if (ip_vsn == 6) {
		struct ipv6hdr *ip6 = (struct ipv6hdr *) l3_hdr;
		BPF_CORE_READ_INTO(&tpl->saddr, ip6, saddr);
		BPF_CORE_READ_INTO(&tpl->daddr, ip6, daddr);
		tpl->l4_proto = BPF_CORE_READ(ip6, nexthdr); // TODO: ipv6 l4 protocol
		tpl->l3_proto = ETH_P_IPV6;
		l4_off = l3_off + ipv6_hdrlen(ip6);
	}

	if (tpl->l4_proto == IPPROTO_TCP) {
		struct tcphdr *tcp = (struct tcphdr *) (skb_head + l4_off);
		tpl->sport= BPF_CORE_READ(tcp, source);
		tpl->dport= BPF_CORE_READ(tcp, dest);
	} else if (tpl->l4_proto == IPPROTO_UDP) {
		struct udphdr *udp = (struct udphdr *) (skb_head + l4_off);
		tpl->sport= BPF_CORE_READ(udp, source);
		tpl->dport= BPF_CORE_READ(udp, dest);
	}
}

static __always_inline void
set_skb_btf(struct sk_buff *skb, typeof(print_skb_id) *event_id) {
#ifdef OUTPUT_SKB
	static struct btf_ptr p = {};
	typeof(print_skb_id) id;
	char *str;

	p.type_id = bpf_core_type_id_kernel(struct sk_buff);
	p.ptr = skb;
	id = __sync_fetch_and_add(&print_skb_id, 1) % 256;

	str = bpf_map_lookup_elem(&print_skb_map, (u32 *) &id);
	if (!str) {
		return;
	}

	if (bpf_snprintf_btf(str, PRINT_SKB_STR_SIZE, &p, sizeof(p), 0) < 0) {
		return;
	}

	*event_id = id;
#endif
}

static __always_inline void
set_output(void *ctx, struct sk_buff *skb, struct event_t *event) {
	if (cfg->output_meta) {
		set_meta(skb, &event->meta);
	}

	if (cfg->output_tuple) {
		set_tuple(skb, &event->tuple);
	}

	if (cfg->output_skb) {
		set_skb_btf(skb, &event->print_skb_id);
	}

	if (cfg->output_stack) {
		event->print_stack_id = bpf_get_stackid(ctx, &print_stack_map, BPF_F_FAST_STACK_CMP);
	}
}

static __noinline bool
handle_everything(struct sk_buff *skb, void *ctx, struct event_t *event) {
	bool tracked = false;
	u64 skb_addr = (u64) skb;

	if (cfg->is_set) {
		if (cfg->track_skb && bpf_map_lookup_elem(&skb_addresses, &skb_addr)) {
			tracked = true;
			goto cont;
		}

		if (!filter(skb)) {
			return false;
		}

cont:
		set_output(ctx, skb, event);
	}

	if (cfg->track_skb && !tracked) {
		bpf_map_update_elem(&skb_addresses, &skb_addr, &TRUE, BPF_ANY);
	}

	event->pid = bpf_get_current_pid_tgid() >> 32;
	event->ts = bpf_ktime_get_ns();
	event->cpu_id = bpf_get_smp_processor_id();

	return true;
}

static __always_inline int
kprobe_skb(struct sk_buff *skb, struct pt_regs *ctx, bool has_get_func_ip) {
	struct event_t event = {};

	if (!handle_everything(skb, ctx, &event))
		return BPF_OK;

	event.skb_addr = (u64) skb;
	event.addr = has_get_func_ip ? bpf_get_func_ip(ctx) : PT_REGS_IP(ctx);
	event.param_second = PT_REGS_PARM2(ctx);
	bpf_map_push_elem(&events, &event, BPF_EXIST);

	return BPF_OK;
}

#ifdef HAS_KPROBE_MULTI
#define PWRU_KPROBE_TYPE "kprobe.multi"
#define PWRU_HAS_GET_FUNC_IP true
#else
#define PWRU_KPROBE_TYPE "kprobe"
#define PWRU_HAS_GET_FUNC_IP false
#endif /* HAS_KPROBE_MULTI */

#define PWRU_ADD_KPROBE(X)                                                     \
  SEC(PWRU_KPROBE_TYPE "/skb-" #X)                                             \
  int kprobe_skb_##X(struct pt_regs *ctx) {                                    \
    struct sk_buff *skb = (struct sk_buff *) PT_REGS_PARM##X(ctx);             \
    return kprobe_skb(skb, ctx, PWRU_HAS_GET_FUNC_IP);                         \
  }

PWRU_ADD_KPROBE(1)
PWRU_ADD_KPROBE(2)
PWRU_ADD_KPROBE(3)
PWRU_ADD_KPROBE(4)
PWRU_ADD_KPROBE(5)

#undef PWRU_KPROBE
#undef PWRU_HAS_GET_FUNC_IP
#undef PWRU_KPROBE_TYPE

SEC("kprobe/skb_lifetime_termination")
int kprobe_skb_lifetime_termination(struct pt_regs *ctx) {
	u64 skb = (u64) PT_REGS_PARM1(ctx);

	bpf_map_delete_elem(&skb_addresses, &skb);

	return BPF_OK;
}

static __always_inline int
track_skb_clone(u64 old, u64 new) {
	if (bpf_map_lookup_elem(&skb_addresses, &old))
		bpf_map_update_elem(&skb_addresses, &new, &TRUE, BPF_ANY);

	return BPF_OK;
}

SEC("fexit/skb_clone")
int BPF_PROG(fexit_skb_clone, u64 old, gfp_t mask, u64 new) {
	return track_skb_clone(old, new);
}

SEC("fexit/skb_copy")
int BPF_PROG(fexit_skb_copy, u64 old, gfp_t mask, u64 new) {
	return track_skb_clone(old, new);
}

SEC("fentry/tc")
int BPF_PROG(fentry_tc, struct sk_buff *skb) {
	struct event_t event = {};

	if (!handle_everything(skb, ctx, &event))
		return BPF_OK;

	event.skb_addr = (u64) skb;
	event.addr = BPF_PROG_ADDR;
	bpf_map_push_elem(&events, &event, BPF_EXIST);

	return BPF_OK;
}

char __license[] SEC("license") = "Dual BSD/GPL";

我们在上述的代码中看到配置执行了 bpf_map_lookup_elem,然后执行 filter 过滤功能。 pwru 借助 github.com/cilium/ebpf/cmd/bpf2go 进行编译,build.go 源码如下所示:

// SPDX-License-Identifier: Apache-2.0
// Copyright (C) 2021 Authors of Cilium */

//go:generate sh -c "echo Generating for $TARGET_GOARCH"
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -target $TARGET_GOARCH -cc clang -no-strip KProbePWRU ./bpf/kprobe_pwru.c -- -DOUTPUT_SKB -I./bpf/headers -Wno-address-of-packed-member
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -target $TARGET_GOARCH -cc clang -no-strip KProbeMultiPWRU ./bpf/kprobe_pwru.c -- -DOUTPUT_SKB -DHAS_KPROBE_MULTI -I./bpf/headers -Wno-address-of-packed-member
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -target $TARGET_GOARCH -cc clang -no-strip KProbePWRUWithoutOutputSKB ./bpf/kprobe_pwru.c -- -I./bpf/headers -Wno-address-of-packed-member
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -target $TARGET_GOARCH -cc clang -no-strip KProbeMultiPWRUWithoutOutputSKB ./bpf/kprobe_pwru.c -- -D HAS_KPROBE_MULTI -I./bpf/headers -Wno-address-of-packed-member

package main

用户态使用 perf-event 输出 Map 中的数据如下所示:

func main(){
    printSkbMap := coll.Maps["print_skb_map"]
	printStackMap := coll.Maps["print_stack_map"]
	output, err := pwru.NewOutput(&flags, printSkbMap, printStackMap, addr2name, useKprobeMulti, btfSpec)
	if err != nil {
		log.Fatalf("Failed to create outputer: %s", err)
	}
	defer output.Close()
	output.PrintHeader()

	if !flags.OutputJson {
		output.PrintHeader()
	}

	defer func() {
		select {
		case <-ctx.Done():
			log.Println("Received signal, exiting program..")
		default:
			log.Printf("Printed %d events, exiting program..\n", flags.OutputLimitLines)
		}
	}()
}

参考

  1. https://fosdem.org/2024/events/attachments/fosdem-2024-2364-packet-where-are-you-an-ebpf-based-tool-for-diagnosing-linux-networking/slides/22481/FOSDEM_2024_Packet_Where_aRe_You_HGLqMLe.pdf
  2. https://github.com/cilium/pwru?tab=readme-ov-file