介绍
Linux 中的网络是个很复杂的问题!!!我们常常会遇见以下问题。
- 网络命名空间(network)使网络变得更加复杂;
- 当一个数据包丢失时,作为网络工程师,如何确切知道问题出在Linux内核的哪个地方;
- 一旦融合eBPF和XDP网络程序,如何确切知道需要查看的所有可能的代码路径?
- 如何知道你不知道的事情?
pwru 是 Cilium 推出的基于 eBPF 开发的网络数据包排查工具,它提供了更细粒度的网络数据包排查方案。本文将介绍 pwru 的使用方法和经典场景,并介绍其实现原理。https://github.com/cilium/pwru 是一种基于eBPF的工具,用于在Linux内核中追踪网络数据包,并具有高级过滤能力。它允许对内核状态进行细粒度的跟踪,并可以用来调试网络连接性问题。
部署与安装
pwru 需要 5.3+ 版本的内核才能运行,对于 –output-skb,需要 5.9+ 版本的内核,对于 –backend=kprobe-multi,需要 5.18+ 版本的内核。 debugfs 必须挂载在 /sys/kernel/debug 下,如果该文件夹为空,可以使用以下命令进行挂载:
mount -t debugfs none /sys/kernel/debug
需要以下的内核配置:
Option | Backend | Note |
---|---|---|
CONFIG_DEBUG_INFO_BTF=y | both | available since >= 5.3 |
CONFIG_KPROBES=y | both | |
CONFIG_PERF_EVENTS=y | both | |
CONFIG_BPF=y | both | |
CONFIG_BPF_SYSCALL=y | both | |
CONFIG_FUNCTION_TRACER=y | kprobe-multi | /sys/kernel/debug/tracing/available_filter_functions |
CONFIG_FPROBE=y | kprobe-multi | available since >= 5.18 |
可以从以下页面 https://github.com/cilium/pwru/releases 下载x86_64和arm64可执行文件。
root@instance-820epr0w:~/tanjunchen# ./pwru --help
Usage: ./pwru [options] [pcap-filter]
Available pcap-filter: see "man 7 pcap-filter"
Available options:
--all-kmods attach to all available kernel modules
--backend string Tracing backend('kprobe', 'kprobe-multi'). Will auto-detect if not specified.
--filter-func string filter kernel functions to be probed by name (exact match, supports RE2 regular expression)
--filter-ifname string filter skb ifname in --filter-netns (if not specified, use current netns)
--filter-mark uint32 filter skb mark
--filter-netns string filter netns ("/proc/<pid>/ns/net", "inode:<inode>")
--filter-trace-tc trace TC bpf progs
--filter-track-skb trace a packet even if it does not match given filters (e.g., after NAT or tunnel decapsulation)
-h, --help display this message and exit
--kernel-btf string specify kernel BTF file
--kmods strings list of kernel modules names to attach to
--output-file string write traces to file
--output-limit-lines uint exit the program after the number of events has been received/printed
--output-meta print skb metadata
--output-skb print skb
--output-stack print stack
--output-tuple print L4 tuple
--timestamp string print timestamp per skb ("current", "relative", "absolute", "none") (default "none")
--version show pwru version and exit
更多的详情可参考 https://github.com/cilium/pwru 。
使用示例
环境基础信息如下所示:
root@instance-820epr0w:~/tanjunchen# uname -a
Linux instance-820epr0w 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
以下示例展示修改 netfilter 表规则后,如何定位 curl 请求的数据包被丢弃。我们要测试的 iptables 规则如下所示:
iptables -t filter -I OUTPUT 1 -m tcp --proto tcp --dst 1.1.1.1/32 -j DROP
iptables -t filter -D OUTPUT -m tcp --proto tcp --dst 1.1.1.1/32 -j DROP
没有添加 iptables 规则前,pwru 的输出如下所示:
root@instance-820epr0w:~/tanjunchen# ./pwru 'dst host 1.1.1.1 and tcp and dst port 80'
2024/03/18 11:19:07 Attaching kprobes (via kprobe)...
1477 / 1477 [---------------------------------------------------------------------------------------------------] 100.00% 80 p/s
2024/03/18 11:19:26 Attached (ignored 0)
2024/03/18 11:19:26 Listening for events..
SKB CPU PROCESS FUNC
0xff18015deb5c36e0 2 [curl(3378123)] ip_local_out
0xff18015deb5c36e0 2 [curl(3378123)] __ip_local_out
0xff18015deb5c36e0 2 [curl(3378123)] nf_hook_slow
0xff18015deb5c36e0 2 [curl(3378123)] ip_output
0xff18015deb5c36e0 2 [curl(3378123)] nf_hook_slow
0xff18015deb5c36e0 2 [curl(3378123)] apparmor_ipv4_postroute
0xff18015deb5c36e0 2 [curl(3378123)] ip_finish_output
0xff18015deb5c36e0 2 [curl(3378123)] __cgroup_bpf_run_filter_skb
0xff18015deb5c36e0 2 [curl(3378123)] __ip_finish_output
0xff18015deb5c36e0 2 [curl(3378123)] ip_finish_output2
0xff18015deb5c36e0 2 [curl(3378123)] dev_queue_xmit
0xff18015deb5c36e0 2 [curl(3378123)] __dev_queue_xmit
0xff18015deb5c36e0 2 [curl(3378123)] qdisc_pkt_len_init
0xff18015deb5c36e0 2 [curl(3378123)] netdev_core_pick_tx
0xff18015deb5c36e0 2 [curl(3378123)] netdev_pick_tx
0xff18015deb5c36e0 2 [curl(3378123)] __get_xps_queue_idx
0xff18015deb5c36e0 2 [curl(3378123)] sch_direct_xmit
0xff18015deb5c36e0 2 [curl(3378123)] validate_xmit_skb_list
0xff18015deb5c36e0 2 [curl(3378123)] validate_xmit_skb
0xff18015deb5c36e0 2 [curl(3378123)] netif_skb_features
0xff18015deb5c36e0 2 [curl(3378123)] passthru_features_check
0xff18015deb5c36e0 2 [curl(3378123)] skb_network_protocol
0xff18015deb5c36e0 2 [curl(3378123)] validate_xmit_vlan
0xff18015deb5c36e0 2 [curl(3378123)] skb_csum_hwoffload_help
0xff18015deb5c36e0 2 [curl(3378123)] validate_xmit_xfrm
0xff18015deb5c36e0 2 [curl(3378123)] dev_hard_start_xmit
0xff18015deb5c36e0 2 [curl(3378123)] skb_clone_tx_timestamp
0xff18015deb5c36e0 2 [curl(3378123)] skb_to_sgvec
0xff18015deb5c36e0 2 [curl(3378123)] __skb_to_sgvec
0xff18015deb5c36e0 4 [<empty>(0)] napi_consume_skb
0xff18015deb5c36e0 4 [<empty>(0)] skb_release_head_state
0xff18015deb5c36e0 4 [<empty>(0)] tcp_wfree
0xff18015deb5c36e0 4 [<empty>(0)] skb_release_data
0xff18015deb5c36e0 4 [<empty>(0)] kfree_skbmem
0xff18015e9c940100 4 [<empty>(0)] ip_local_out
0xff18015e9c940100 4 [<empty>(0)] __ip_local_out
0xff18015e9c940100 4 [<empty>(0)] nf_hook_slow
0xff18015e9c940100 4 [<empty>(0)] ip_output
0xff18015e9c940100 4 [<empty>(0)] nf_hook_slow
0xff18015e9c940100 4 [<empty>(0)] apparmor_ipv4_postroute
0xff18015e9c940100 4 [<empty>(0)] ip_finish_output
0xff18015e9c940100 4 [<empty>(0)] __cgroup_bpf_run_filter_skb
0xff18015e9c940100 4 [<empty>(0)] __ip_finish_output
0xff18015e9c940100 4 [<empty>(0)] ip_finish_output2
0xff18015e9c940100 4 [<empty>(0)] dev_queue_xmit
0xff18015e9c940100 4 [<empty>(0)] __dev_queue_xmit
0xff18015e9c940100 4 [<empty>(0)] qdisc_pkt_len_init
0xff18015e9c940100 4 [<empty>(0)] netdev_core_pick_tx
0xff18015e9c940100 4 [<empty>(0)] netdev_pick_tx
0xff18015e9c940100 4 [<empty>(0)] __get_xps_queue_idx
0xff18015e9c940100 4 [<empty>(0)] sch_direct_xmit
0xff18015e9c940100 4 [<empty>(0)] validate_xmit_skb_list
0xff18015e9c940100 4 [<empty>(0)] validate_xmit_skb
0xff18015e9c940100 4 [<empty>(0)] netif_skb_features
0xff18015e9c940100 4 [<empty>(0)] passthru_features_check
0xff18015e9c940100 4 [<empty>(0)] skb_network_protocol
0xff18015e9c940100 4 [<empty>(0)] validate_xmit_vlan
0xff18015e9c940100 4 [<empty>(0)] skb_csum_hwoffload_help
0xff18015e9c940100 4 [<empty>(0)] validate_xmit_xfrm
0xff18015e9c940100 4 [<empty>(0)] dev_hard_start_xmit
0xff18015e9c940100 4 [<empty>(0)] skb_clone_tx_timestamp
0xff18015e9c940100 4 [<empty>(0)] skb_to_sgvec
0xff18015e9c940100 4 [<empty>(0)] __skb_to_sgvec
0xff18015e9c940100 1 [<empty>(0)] napi_consume_skb
0xff18015e9c940100 1 [<empty>(0)] skb_release_head_state
0xff18015e9c940100 1 [<empty>(0)] __sock_wfree
0xff18015e9c940100 1 [<empty>(0)] skb_release_data
0xff18015e9c940100 1 [<empty>(0)] skb_free_head
2024/03/18 11:19:36 Received signal, exiting program..
在设置 iptables 规则之后,输出如下所示:
root@instance-820epr0w:~/tanjunchen# iptables -t filter -I OUTPUT 1 -m tcp --proto tcp --dst 1.1.1.1/32 -j DROP
root@instance-820epr0w:~/tanjunchen# ./pwru 'dst host 1.1.1.1 and tcp and dst port 80'
2024/03/18 11:22:13 Attaching kprobes (via kprobe)...
1477 / 1477 [---------------------------------------------------------------------------------------------------] 100.00% 80 p/s
2024/03/18 11:22:31 Attached (ignored 0)
2024/03/18 11:22:31 Listening for events..
SKB CPU PROCESS FUNC
0xff180162daa02ae0 0 [curl(3381789)] ip_local_out
0xff180162daa02ae0 0 [curl(3381789)] __ip_local_out
0xff180162daa02ae0 0 [curl(3381789)] nf_hook_slow
0xff180162daa02ae0 0 [curl(3381789)] kfree_skb_reason(SKB_DROP_REASON_NETFILTER_DROP)
0xff180162daa02ae0 0 [curl(3381789)] skb_release_head_state
0xff180162daa02ae0 0 [curl(3381789)] tcp_wfree
0xff180162daa02ae0 0 [curl(3381789)] skb_release_data
0xff180162daa02ae0 0 [curl(3381789)] kfree_skbmem
0xff180162daa02ae0 0 [<empty>(0)] __skb_clone
0xff180162daa02ae0 0 [<empty>(0)] __copy_skb_header
0xff180162daa02ae0 0 [<empty>(0)] ip_local_out
0xff180162daa02ae0 0 [<empty>(0)] __ip_local_out
0xff180162daa02ae0 0 [<empty>(0)] nf_hook_slow
0xff180162daa02ae0 0 [<empty>(0)] kfree_skb_reason(SKB_DROP_REASON_NETFILTER_DROP)
0xff180162daa02ae0 0 [<empty>(0)] skb_release_head_state
0xff180162daa02ae0 0 [<empty>(0)] tcp_wfree
0xff180162daa02ae0 0 [<empty>(0)] skb_release_data
0xff180162daa02ae0 0 [<empty>(0)] kfree_skbmem
0xff180162daa02ae0 0 [<empty>(0)] __skb_clone
0xff180162daa02ae0 0 [<empty>(0)] __copy_skb_header
0xff180162daa02ae0 0 [<empty>(0)] ip_local_out
0xff180162daa02ae0 0 [<empty>(0)] __ip_local_out
0xff180162daa02ae0 0 [<empty>(0)] nf_hook_slow
0xff180162daa02ae0 0 [<empty>(0)] kfree_skb_reason(SKB_DROP_REASON_NETFILTER_DROP)
0xff180162daa02ae0 0 [<empty>(0)] skb_release_head_state
0xff180162daa02ae0 0 [<empty>(0)] tcp_wfree
0xff180162daa02ae0 0 [<empty>(0)] skb_release_data
0xff180162daa02ae0 0 [<empty>(0)] kfree_skbmem
^C2024/03/18 11:22:43 Received signal, exiting program..
2024/03/18 11:22:43 Detaching kprobes...
在 nf_hook_slow 处看到 kfree_skb_reason(SKB_DROP_REASON_NETFILTER_DROP) 函数被调用,出现了丢包。 上述输出内容中,有很多重点的内核函数,如下所示:
- ip_local_out和__ip_local_out:这两个函数主要用于处理本地生成的IP数据包,它们将数据包传递给路由子系统以确定数据包的目标,然后将数据包加入到输出队列等待传输。
- nf_hook_slow:这是netfilter框架中的一个函数,它用于调用在网络钩子上注册的所有函数,以便将数据包传递给它们进行处理。这是实现防火墙功能的关键部分。
- ip_output:这个函数用于处理需要通过网络接口发送的IP数据包。
- apparmor_ipv4_postroute:这是AppArmor安全模块的一部分,用于处理路由后的IPv4数据包。
- ip_finish_output:这个函数负责将路由后的数据包发送到网络接口。
- __cgroup_bpf_run_filter_skb:这个函数用于运行eBPF程序,处理sk_buff结构的网络数据包。
- dev_queue_xmit和__dev_queue_xmit:这两个函数负责将数据包加入到设备的发送队列。
- qdisc_pkt_len_init:这个函数用于初始化数据包的队列规则。
- netdev_core_pick_tx和netdev_pick_tx:这两个函数用于在多队列网络设备中选择适合发送数据包的队列。
- __get_xps_queue_idx:此函数用于获取XPS队列索引。
- sch_direct_xmit:此函数用于直接发送数据包。
- validate_xmit_skb_list和validate_xmit_skb:这些函数用于验证要发送的数据包的各种特性和条件。
- netif_skb_features:这个函数用于获取网络设备支持的数据包特性。
- skb_network_protocol:此函数用于获取数据包的网络协议。
- dev_hard_start_xmit:此函数用于开始对数据包的硬件传输。
- skb_clone_tx_timestamp:此函数用于复制数据包的发送时间戳。
- skb_to_sgvec和__skb_to_sgvec:这两个函数用于将数据包从sk_buff结构转换为scatter/gather向量,用于DMA数据传输。
- napi_consume_skb:此函数用于在数据包处理完成后释放sk_buff结构。
- skb_release_head_state和skb_release_data:这两个函数用于释放sk_buff结构的头部和数据部分。
- tcp_wfree:此函数用于在TCP数据包发送后释放其内存。
- kfree_skbmem:此函数用于释放sk_buff结构占用的内存。
- kfree_skb_reason(SKB_DROP_REASON_NETFILTER_DROP):这个函数用于释放由于被Netfilter(Linux内核网络过滤模块)丢弃的网络数据包所占用的内存资源。SKB_DROP_REASON_NETFILTER_DROP是一个常量,它指示网络数据包是由于Netfilter而被丢弃的。
- skb_release_head_state:这个函数用于释放网络数据包的头部状态。在网络数据包被处理后,协议栈会调用这个函数来释放数据包头部状态所占用的资源。
- tcp_wfree:这个函数用于在TCP数据包被发送后释放其内存。在TCP数据包被成功发送后,协议栈会调用这个函数来释放数据包所占用的内存资源。
- skb_release_data:这个函数用于释放网络数据包的数据部分。在数据包被处理后,协议栈会调用这个函数来释放数据部分所占用的资源。
Kernel 5.15 内核 nf_hook_slow 的源码如下所示:(kfree_skb_reason 是在 5.10 版本的内核中被引入的)。
/* Returns 1 if okfn() needs to be executed by the caller,
* -EPERM for NF_DROP, 0 otherwise. Caller must hold rcu_read_lock. */
int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
const struct nf_hook_entries *e, unsigned int s)
{
unsigned int verdict;
int ret;
for (; s < e->num_hook_entries; s++) {
verdict = nf_hook_entry_hookfn(&e->hooks[s], skb, state);
switch (verdict & NF_VERDICT_MASK) {
case NF_ACCEPT:
break;
case NF_DROP:
// 内存 kfree_skb_reason
kfree_skb_reason(skb,
SKB_DROP_REASON_NETFILTER_DROP);
ret = NF_DROP_GETERR(verdict);
if (ret == 0)
ret = -EPERM;
return ret;
case NF_QUEUE:
ret = nf_queue(skb, state, s, verdict);
if (ret == 1)
continue;
return ret;
default:
/* Implicit handling for NF_STOLEN, as well as any other
* non conventional verdicts.
*/
return 0;
}
}
return 1;
}
Kernel 5.10 内核 nf_hook_slow 的源码如下所示:
/* Returns 1 if okfn() needs to be executed by the caller,
* -EPERM for NF_DROP, 0 otherwise. Caller must hold rcu_read_lock. */
int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
const struct nf_hook_entries *e, unsigned int s)
{
unsigned int verdict;
int ret;
for (; s < e->num_hook_entries; s++) {
verdict = nf_hook_entry_hookfn(&e->hooks[s], skb, state);
switch (verdict & NF_VERDICT_MASK) {
case NF_ACCEPT:
break;
case NF_DROP:
// 内存 kfree_skb
kfree_skb(skb);
ret = NF_DROP_GETERR(verdict);
if (ret == 0)
ret = -EPERM;
return ret;
case NF_QUEUE:
ret = nf_queue(skb, state, s, verdict);
if (ret == 1)
continue;
return ret;
default:
/* Implicit handling for NF_STOLEN, as well as any other
* non conventional verdicts.
*/
return 0;
}
}
return 1;
}
原理与源码
pwru 本质上是向 kprobe 注册了一些 eBPF 代码,根据 pwru 传入的参数可以更新 eBPF Map,改变限制条件,从而更新输出。在 FilterCfg 里面设置了过滤的 IP 地址和协议等条件。
type FilterCfg struct {
FilterNetns uint32
FilterMark uint32
FilterIfindex uint32
// TODO: if there are more options later, then you can consider using a bit map
OutputMeta uint8
OutputTuple uint8
OutputSkb uint8
OutputStack uint8
IsSet byte
TrackSkb byte
}
会根据 pwru 传入的参数更新过滤条件。
func GetConfig(flags *Flags) (cfg FilterCfg, err error) {
cfg = FilterCfg{
FilterMark: flags.FilterMark,
IsSet: 1,
}
if flags.OutputSkb {
cfg.OutputSkb = 1
}
if flags.OutputMeta {
cfg.OutputMeta = 1
}
if flags.OutputTuple {
cfg.OutputTuple = 1
}
if flags.OutputStack {
cfg.OutputStack = 1
}
if flags.FilterTrackSkb {
cfg.TrackSkb = 1
}
netnsID, ns, err := parseNetns(flags.FilterNetns)
if err != nil {
err = fmt.Errorf("Failed to retrieve netns %s: %w", flags.FilterNetns, err)
return
}
if flags.FilterIfname != "" || flags.FilterNetns != "" {
cfg.FilterNetns = netnsID
}
if cfg.FilterIfindex, err = parseIfindex(flags.FilterIfname, ns); err != nil {
return
}
return
}
eBPF 内核态的代码如下所示:
// SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
/* Copyright Martynas Pumputis */
/* Copyright Authors of Cilium */
#include "vmlinux.h"
#include "bpf/bpf_helpers.h"
#include "bpf/bpf_core_read.h"
#include "bpf/bpf_tracing.h"
#include "bpf/bpf_ipv6.h"
#define PRINT_SKB_STR_SIZE 2048
#define ETH_P_IP 0x800
#define ETH_P_IPV6 0x86dd
const static bool TRUE = true;
volatile const static __u64 BPF_PROG_ADDR = 0;
union addr {
u32 v4addr;
struct {
u64 d1;
u64 d2;
} v6addr;
} __attribute__((packed));
struct skb_meta {
u32 netns;
u32 mark;
u32 ifindex;
u32 len;
u32 mtu;
u16 protocol;
u16 pad;
} __attribute__((packed));
struct tuple {
union addr saddr;
union addr daddr;
u16 sport;
u16 dport;
u16 l3_proto;
u8 l4_proto;
u8 pad;
} __attribute__((packed));
u64 print_skb_id = 0;
struct event_t {
u32 pid;
u32 type;
u64 addr;
u64 skb_addr;
u64 ts;
typeof(print_skb_id) print_skb_id;
struct skb_meta meta;
struct tuple tuple;
s64 print_stack_id;
u64 param_second;
u32 cpu_id;
} __attribute__((packed));
#define MAX_QUEUE_ENTRIES 10000
struct {
__uint(type, BPF_MAP_TYPE_QUEUE);
__type(value, struct event_t);
__uint(max_entries, MAX_QUEUE_ENTRIES);
} events SEC(".maps");
#define MAX_TRACK_SIZE 1024
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, __u64);
__type(value, bool);
__uint(max_entries, MAX_TRACK_SIZE);
} skb_addresses SEC(".maps");
struct config {
u32 netns;
u32 mark;
u32 ifindex;
u8 output_meta;
u8 output_tuple;
u8 output_skb;
u8 output_stack;
u8 is_set;
u8 track_skb;
} __attribute__((packed));
static volatile const struct config CFG;
#define cfg (&CFG)
#define MAX_STACK_DEPTH 50
struct {
__uint(type, BPF_MAP_TYPE_STACK_TRACE);
__uint(max_entries, 256);
__uint(key_size, sizeof(u32));
__uint(value_size, MAX_STACK_DEPTH * sizeof(u64));
} print_stack_map SEC(".maps");
#ifdef OUTPUT_SKB
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 256);
__type(key, u32);
__type(value, char[PRINT_SKB_STR_SIZE]);
} print_skb_map SEC(".maps");
#endif
static __always_inline u32
get_netns(struct sk_buff *skb) {
u32 netns = BPF_CORE_READ(skb, dev, nd_net.net, ns.inum);
// if skb->dev is not initialized, try to get ns from sk->__sk_common.skc_net.net->ns.inum
if (netns == 0) {
struct sock *sk = BPF_CORE_READ(skb, sk);
if (sk != NULL) {
netns = BPF_CORE_READ(sk, __sk_common.skc_net.net, ns.inum);
}
}
return netns;
}
static __always_inline bool
filter_meta(struct sk_buff *skb) {
if (cfg->netns && get_netns(skb) != cfg->netns) {
return false;
}
if (cfg->mark && BPF_CORE_READ(skb, mark) != cfg->mark) {
return false;
}
if (cfg->ifindex != 0 && BPF_CORE_READ(skb, dev, ifindex) != cfg->ifindex) {
return false;
}
return true;
}
static __noinline bool
filter_pcap_ebpf_l3(void *_skb, void *__skb, void *___skb, void *data, void* data_end)
{
return data != data_end && _skb == __skb && __skb == ___skb;
}
static __always_inline bool
filter_pcap_l3(struct sk_buff *skb)
{
void *skb_head = BPF_CORE_READ(skb, head);
void *data = skb_head + BPF_CORE_READ(skb, network_header);
void *data_end = skb_head + BPF_CORE_READ(skb, tail);
return filter_pcap_ebpf_l3((void *)skb, (void *)skb, (void *)skb, data, data_end);
}
static __noinline bool
filter_pcap_ebpf_l2(void *_skb, void *__skb, void *___skb, void *data, void* data_end)
{
return data != data_end && _skb == __skb && __skb == ___skb;
}
static __always_inline bool
filter_pcap_l2(struct sk_buff *skb)
{
void *skb_head = BPF_CORE_READ(skb, head);
void *data = skb_head + BPF_CORE_READ(skb, mac_header);
void *data_end = skb_head + BPF_CORE_READ(skb, tail);
return filter_pcap_ebpf_l2((void *)skb, (void *)skb, (void *)skb, data, data_end);
}
static __always_inline bool
filter_pcap(struct sk_buff *skb) {
if (BPF_CORE_READ(skb, mac_len) == 0)
return filter_pcap_l3(skb);
return filter_pcap_l2(skb);
}
static __always_inline bool
filter(struct sk_buff *skb) {
return filter_pcap(skb) && filter_meta(skb);
}
static __always_inline void
set_meta(struct sk_buff *skb, struct skb_meta *meta) {
meta->netns = get_netns(skb);
meta->mark = BPF_CORE_READ(skb, mark);
meta->len = BPF_CORE_READ(skb, len);
meta->protocol = BPF_CORE_READ(skb, protocol);
meta->ifindex = BPF_CORE_READ(skb, dev, ifindex);
meta->mtu = BPF_CORE_READ(skb, dev, mtu);
}
static __always_inline void
set_tuple(struct sk_buff *skb, struct tuple *tpl) {
void *skb_head = BPF_CORE_READ(skb, head);
u16 l3_off = BPF_CORE_READ(skb, network_header);
u16 l4_off;
struct iphdr *l3_hdr = (struct iphdr *) (skb_head + l3_off);
u8 ip_vsn = BPF_CORE_READ_BITFIELD_PROBED(l3_hdr, version);
if (ip_vsn == 4) {
struct iphdr *ip4 = (struct iphdr *) l3_hdr;
BPF_CORE_READ_INTO(&tpl->saddr, ip4, saddr);
BPF_CORE_READ_INTO(&tpl->daddr, ip4, daddr);
tpl->l4_proto = BPF_CORE_READ(ip4, protocol);
tpl->l3_proto = ETH_P_IP;
l4_off = l3_off + BPF_CORE_READ_BITFIELD_PROBED(ip4, ihl) * 4;
} else if (ip_vsn == 6) {
struct ipv6hdr *ip6 = (struct ipv6hdr *) l3_hdr;
BPF_CORE_READ_INTO(&tpl->saddr, ip6, saddr);
BPF_CORE_READ_INTO(&tpl->daddr, ip6, daddr);
tpl->l4_proto = BPF_CORE_READ(ip6, nexthdr); // TODO: ipv6 l4 protocol
tpl->l3_proto = ETH_P_IPV6;
l4_off = l3_off + ipv6_hdrlen(ip6);
}
if (tpl->l4_proto == IPPROTO_TCP) {
struct tcphdr *tcp = (struct tcphdr *) (skb_head + l4_off);
tpl->sport= BPF_CORE_READ(tcp, source);
tpl->dport= BPF_CORE_READ(tcp, dest);
} else if (tpl->l4_proto == IPPROTO_UDP) {
struct udphdr *udp = (struct udphdr *) (skb_head + l4_off);
tpl->sport= BPF_CORE_READ(udp, source);
tpl->dport= BPF_CORE_READ(udp, dest);
}
}
static __always_inline void
set_skb_btf(struct sk_buff *skb, typeof(print_skb_id) *event_id) {
#ifdef OUTPUT_SKB
static struct btf_ptr p = {};
typeof(print_skb_id) id;
char *str;
p.type_id = bpf_core_type_id_kernel(struct sk_buff);
p.ptr = skb;
id = __sync_fetch_and_add(&print_skb_id, 1) % 256;
str = bpf_map_lookup_elem(&print_skb_map, (u32 *) &id);
if (!str) {
return;
}
if (bpf_snprintf_btf(str, PRINT_SKB_STR_SIZE, &p, sizeof(p), 0) < 0) {
return;
}
*event_id = id;
#endif
}
static __always_inline void
set_output(void *ctx, struct sk_buff *skb, struct event_t *event) {
if (cfg->output_meta) {
set_meta(skb, &event->meta);
}
if (cfg->output_tuple) {
set_tuple(skb, &event->tuple);
}
if (cfg->output_skb) {
set_skb_btf(skb, &event->print_skb_id);
}
if (cfg->output_stack) {
event->print_stack_id = bpf_get_stackid(ctx, &print_stack_map, BPF_F_FAST_STACK_CMP);
}
}
static __noinline bool
handle_everything(struct sk_buff *skb, void *ctx, struct event_t *event) {
bool tracked = false;
u64 skb_addr = (u64) skb;
if (cfg->is_set) {
if (cfg->track_skb && bpf_map_lookup_elem(&skb_addresses, &skb_addr)) {
tracked = true;
goto cont;
}
if (!filter(skb)) {
return false;
}
cont:
set_output(ctx, skb, event);
}
if (cfg->track_skb && !tracked) {
bpf_map_update_elem(&skb_addresses, &skb_addr, &TRUE, BPF_ANY);
}
event->pid = bpf_get_current_pid_tgid() >> 32;
event->ts = bpf_ktime_get_ns();
event->cpu_id = bpf_get_smp_processor_id();
return true;
}
static __always_inline int
kprobe_skb(struct sk_buff *skb, struct pt_regs *ctx, bool has_get_func_ip) {
struct event_t event = {};
if (!handle_everything(skb, ctx, &event))
return BPF_OK;
event.skb_addr = (u64) skb;
event.addr = has_get_func_ip ? bpf_get_func_ip(ctx) : PT_REGS_IP(ctx);
event.param_second = PT_REGS_PARM2(ctx);
bpf_map_push_elem(&events, &event, BPF_EXIST);
return BPF_OK;
}
#ifdef HAS_KPROBE_MULTI
#define PWRU_KPROBE_TYPE "kprobe.multi"
#define PWRU_HAS_GET_FUNC_IP true
#else
#define PWRU_KPROBE_TYPE "kprobe"
#define PWRU_HAS_GET_FUNC_IP false
#endif /* HAS_KPROBE_MULTI */
#define PWRU_ADD_KPROBE(X) \
SEC(PWRU_KPROBE_TYPE "/skb-" #X) \
int kprobe_skb_##X(struct pt_regs *ctx) { \
struct sk_buff *skb = (struct sk_buff *) PT_REGS_PARM##X(ctx); \
return kprobe_skb(skb, ctx, PWRU_HAS_GET_FUNC_IP); \
}
PWRU_ADD_KPROBE(1)
PWRU_ADD_KPROBE(2)
PWRU_ADD_KPROBE(3)
PWRU_ADD_KPROBE(4)
PWRU_ADD_KPROBE(5)
#undef PWRU_KPROBE
#undef PWRU_HAS_GET_FUNC_IP
#undef PWRU_KPROBE_TYPE
SEC("kprobe/skb_lifetime_termination")
int kprobe_skb_lifetime_termination(struct pt_regs *ctx) {
u64 skb = (u64) PT_REGS_PARM1(ctx);
bpf_map_delete_elem(&skb_addresses, &skb);
return BPF_OK;
}
static __always_inline int
track_skb_clone(u64 old, u64 new) {
if (bpf_map_lookup_elem(&skb_addresses, &old))
bpf_map_update_elem(&skb_addresses, &new, &TRUE, BPF_ANY);
return BPF_OK;
}
SEC("fexit/skb_clone")
int BPF_PROG(fexit_skb_clone, u64 old, gfp_t mask, u64 new) {
return track_skb_clone(old, new);
}
SEC("fexit/skb_copy")
int BPF_PROG(fexit_skb_copy, u64 old, gfp_t mask, u64 new) {
return track_skb_clone(old, new);
}
SEC("fentry/tc")
int BPF_PROG(fentry_tc, struct sk_buff *skb) {
struct event_t event = {};
if (!handle_everything(skb, ctx, &event))
return BPF_OK;
event.skb_addr = (u64) skb;
event.addr = BPF_PROG_ADDR;
bpf_map_push_elem(&events, &event, BPF_EXIST);
return BPF_OK;
}
char __license[] SEC("license") = "Dual BSD/GPL";
我们在上述的代码中看到配置执行了 bpf_map_lookup_elem,然后执行 filter 过滤功能。 pwru 借助 github.com/cilium/ebpf/cmd/bpf2go 进行编译,build.go 源码如下所示:
// SPDX-License-Identifier: Apache-2.0
// Copyright (C) 2021 Authors of Cilium */
//go:generate sh -c "echo Generating for $TARGET_GOARCH"
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -target $TARGET_GOARCH -cc clang -no-strip KProbePWRU ./bpf/kprobe_pwru.c -- -DOUTPUT_SKB -I./bpf/headers -Wno-address-of-packed-member
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -target $TARGET_GOARCH -cc clang -no-strip KProbeMultiPWRU ./bpf/kprobe_pwru.c -- -DOUTPUT_SKB -DHAS_KPROBE_MULTI -I./bpf/headers -Wno-address-of-packed-member
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -target $TARGET_GOARCH -cc clang -no-strip KProbePWRUWithoutOutputSKB ./bpf/kprobe_pwru.c -- -I./bpf/headers -Wno-address-of-packed-member
//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -target $TARGET_GOARCH -cc clang -no-strip KProbeMultiPWRUWithoutOutputSKB ./bpf/kprobe_pwru.c -- -D HAS_KPROBE_MULTI -I./bpf/headers -Wno-address-of-packed-member
package main
用户态使用 perf-event 输出 Map 中的数据如下所示:
func main(){
printSkbMap := coll.Maps["print_skb_map"]
printStackMap := coll.Maps["print_stack_map"]
output, err := pwru.NewOutput(&flags, printSkbMap, printStackMap, addr2name, useKprobeMulti, btfSpec)
if err != nil {
log.Fatalf("Failed to create outputer: %s", err)
}
defer output.Close()
output.PrintHeader()
if !flags.OutputJson {
output.PrintHeader()
}
defer func() {
select {
case <-ctx.Done():
log.Println("Received signal, exiting program..")
default:
log.Printf("Printed %d events, exiting program..\n", flags.OutputLimitLines)
}
}()
}
参考
- https://fosdem.org/2024/events/attachments/fosdem-2024-2364-packet-where-are-you-an-ebpf-based-tool-for-diagnosing-linux-networking/slides/22481/FOSDEM_2024_Packet_Where_aRe_You_HGLqMLe.pdf
- https://github.com/cilium/pwru?tab=readme-ov-file