io_uring api

动机

探索scylladb因高效io带来的极速性能，然而并未用上io_uring这一技术。不过无所谓了，如果io_uring在其执行框架seastar前提出，应该就被用上了。

资料

nginx把读取文件的操作异步地提交给内核后，内核会通知IO设备独立地执行操作，这样，nginx进程可以继续充分地占用CPU。而且，当大量读事件堆积到IO设备的队列中时，将会发挥出内核中“电梯算法”的优势，从而降低随机读取磁盘扇区的成本。

IO模型

all_kins_of_io

libaio：linux kernel实现的native aio
posix aio：glibc实现的aio

AIO的缺陷

但是它仍然不够完美，同样存在很多缺陷，还是以nginx为例，目前，nginx仅支持在读取文件时使用AIO，因为正常写入文件往往是写入内存就立刻返回，即只支持Direct write，效率很高，如果替换成AIO写入速度会明显下降。

仅支持direct IO。在采用AIO的时候，只能使用O_DIRECT，不能借助文件系统缓存来缓存当前的IO请求，还存在size对齐（直接操作磁盘，所有写入内存块数量必须是文件系统块大小的倍数，而且要与内存页大小对齐。）等限制，直接影响了aio在很多场景的使用。
仍然可能被阻塞。语义不完备。即使应用层主观上，希望系统层采用异步IO，但是客观上，有时候还是可能会被阻塞。io_getevents(2)调用read_events读取AIO的完成events，read_events中的wait_event_interruptible_hrtimeout等待aio_read_events，如果条件不成立（events未完成）则调用__wait_event_hrtimeout进入睡眠（当然，支持用户态设置最大等待时间）。
拷贝开销大。每个IO提交需要拷贝64+8字节，每个IO完成需要拷贝32字节，总共104字节的拷贝。这个拷贝开销是否可以承受，和单次IO大小有关：如果需要发送的IO本身就很大，相较之下，这点消耗可以忽略，而在大量小IO的场景下，这样的拷贝影响比较大。
API不友好。每一个IO至少需要两次系统调用才能完成（submit和wait-for-completion)，需要非常小心地使用完成事件以避免丢事件。
系统调用开销大。也正是因为上一条，io_submit/io_getevents造成了较大的系统调用开销，在存在spectre/meltdown（CPU熔断幽灵漏洞，CVE-2017-5754）的机器上，若如果要避免漏洞问题，系统调用性能则会大幅下降。在存储场景下，高频系统调用的性能影响较大。

io_uring接口

用户态接口：

io_uring 的实现仅仅使用了三个用户态的系统调用接口：

（1）io_uring_setup：初始化一个新的 io_uring 上下文，内核通过一块和用户共享的内存区域进行消息的传递。

（2）io_uring_enter：提交任务以及收割任务。

（3）io_uring_register：注册用户态和内核态的共享 buffer。

使用前两个系统调用已经足够使用 io_uring 接口了。

数据结构

io_uring_sq_cq_sqes

在 SQ，CQ 之间有一个叫做 SQEs 数组。该数组的目的是方便通过环形缓冲区提交内存上不连续的请求，即内核的响应请求的顺序是不确定的，导致在 SEQs 中插入新请求的位置可能是离散的。

SQ 和 CQ 中每个节点保存的都是 SQEs 数组的索引，而不是实际的请求，实际的请求只保存在 SQEs 数组中。这样在提交请求时，就可以批量提交一组 SQEs 上不连续的请求。

io_uring_sq_cq

另外，由于上面所述的内存区域都是由 kernel 进行分配的，用户程序是不能直接访问的，在进行初始化的时候，相关初始化接口会返回对应区域的 fd，应用程序通过该 fd 进行 mmap，实现和 kernel 的内存共享。在返回的相关参数中，会有对应三个区域在该共享内存中对应位置的描述，方便用户态程序的访问。

实现思路

解决“系统调用开销大”的问题

针对这个问题，考虑是否每次都需要系统调用。如果能将多次系统调用中的逻辑放到有限次数中来，就能将消耗降为常数时间复杂度。

解决“拷贝开销大”的问题

之所以在提交和完成事件中存在内存拷贝，是因为应用程序和内核之间的通信需要拷贝数据，所以为了避免这个问题，需要重新考量应用与内核间的通信方式。我们发现，两者通信，不是必须要拷贝，通过现有技术，可以让应用与内核共享内存，用于彼此通信，需要生产者-消费者模型。

要实现核外与内核的一个零拷贝，最佳的方式就是实现一块内存映射区域，两者共享一段内存，核外往这段内存写数据，然后通知内核使用这段内存数据，或者内核填写这段数据，核外使用这部分数据。因此我们需要一对共享的ring buffer用于应用程序和内核之间的通信。

共享ring buffer的设计主要带来以下几个好处：

提交、完成请求时节省应用和内核之间的内存拷贝
使用 SQPOLL 高级特性时，应用程序无需调用系统调用
无锁操作，用memory ordering实现同步，通过几个简单的头尾指针的移动就可以实现快速交互。

一块用于用户传递数据给内核，一块是内核传递数据给用户，一方只读，一方只写。 - 提交队列SQ(submission queue)中，应用是IO提交的生产者（producer），内核是消费者（consumer）。 - 完成队列CQ(completion queue)中，内核是完成事件的生产者，应用是消费者。

内核控制SQ ring的head和CQ ring的tail，应用程序控制SQ ring的tail和CQ ring的head

io_uring_ring_buffer

那么他们分别需要保存的是什么数据呢？
假设A缓存区为核外写，内核读，就是将IO数据写到这个缓存区，然后通知内核来读；再假设B缓存区为内核写，核外读，他所承担的责任就是返回完成状态，标记A缓存区的其中一个entry的完成状态为成功或者失败等信息。

代码统计

io_uring_cloc

数据分布

数据结构定义

各数据结构如下所示

// SQ/CQ
struct io_uring
{
    u32 head;
    u32 tail;
}

/*
 * This data is shared(from kernel) with the application through the mmap at offsets
 * IORING_OFF_SQ_RING and IORING_OFF_CQ_RING.
 *
 * The offsets to the member fields are published through struct
 * io_sqring_offsets when calling io_uring_setup.
 */
struct io_rings {
	/*
	 * Head and tail offsets into the ring; the offsets need to be
	 * masked to get valid indices.
	 *
	 * The kernel controls head of the sq ring and the tail of the cq ring,
	 * and the application controls tail of the sq ring and the head of the
	 * cq ring.
	 */
	struct io_uring		sq, cq;
	/*
	 * Bitmasks to apply to head and tail offsets (constant, equals
	 * ring_entries - 1, i.e., sq_ring_mask = sq_ring_entries - 1, cq_ring_mask = cq_ring_entries - 1)
	 */
	u32			sq_ring_mask, cq_ring_mask;
	/* Ring sizes (constant, power of 2) */
	u32			sq_ring_entries, cq_ring_entries;
	/*
	 * Number of invalid entries dropped by the kernel due to
	 * invalid index stored in array
	 *
	 * Written by the kernel, shouldn't be modified by the
	 * application (i.e. get number of "new events" by comparing to
	 * cached value).
	 *
	 * After a new SQ head value was read by the application this
	 * counter includes all submissions that were dropped reaching
	 * the new SQ head (and possibly more).
	 */
	u32			sq_dropped;
	/*
	 * Runtime SQ flags
	 *
	 * Written by the kernel, shouldn't be modified by the
	 * application.
	 *
	 * The application needs a full memory barrier before checking
	 * for IORING_SQ_NEED_WAKEUP after updating the sq tail.
	 */
	atomic_t		sq_flags;
	/*
	 * Runtime CQ flags
	 *
	 * Written by the application, shouldn't be modified by the
	 * kernel.
	 */
	u32			cq_flags;
	/*
	 * Number of completion events lost because the queue was full;
	 * this should be avoided by the application by making sure
	 * there are not more requests pending than there is space in
	 * the completion queue.
	 *
	 * Written by the kernel, shouldn't be modified by the
	 * application (i.e. get number of "new events" by comparing to
	 * cached value).
	 *
	 * As completion events come in out of order this counter is not
	 * ordered with any other data.
	 */
	u32			cq_overflow;
	/*
	 * Ring buffer of completion events.
	 *
	 * The kernel writes completion events fresh every time they are
	 * produced, so the application is allowed to modify pending
	 * entries.
	 */
	struct io_uring_cqe	cqes[] ____cacheline_aligned_in_smp;
};

struct io_ring_ctx {
	/* const or read-mostly hot data */
	struct {
		unsigned int		flags;
		unsigned int		drain_next: 1;
		unsigned int		restricted: 1;
		unsigned int		off_timeout_used: 1;
		unsigned int		drain_active: 1;
		unsigned int		has_evfd: 1;
		/* all CQEs should be posted only by the submitter task */
		unsigned int		task_complete: 1;
		unsigned int		lockless_cq: 1;
		unsigned int		syscall_iopoll: 1;
		unsigned int		poll_activated: 1;
		unsigned int		drain_disabled: 1;
		unsigned int		compat: 1;

		struct task_struct	*submitter_task;
		struct io_rings		*rings;
		struct percpu_ref	refs;

		enum task_work_notify_mode	notify_method;
	} ____cacheline_aligned_in_smp;

	/* submission data */
	struct {
		struct mutex		uring_lock;

		/*
		 * Ring buffer of indices into array of io_uring_sqe, which is
		 * mmapped by the application using the IORING_OFF_SQES offset.
		 *
		 * This indirection could e.g. be used to assign fixed
		 * io_uring_sqe entries to operations and only submit them to
		 * the queue when needed.
		 *
		 * The kernel modifies neither the indices array nor the entries
		 * array.
		 */
		u32			*sq_array; // indices array
		struct io_uring_sqe	*sq_sqes;
		unsigned		cached_sq_head;
		unsigned		sq_entries;

		/*
		 * Fixed resources fast path, should be accessed only under
		 * uring_lock, and updated through io_uring_register(2)
		 */
		struct io_rsrc_node	*rsrc_node;
		atomic_t		cancel_seq;
		struct io_file_table	file_table;
		unsigned		nr_user_files;
		unsigned		nr_user_bufs;
		struct io_mapped_ubuf	**user_bufs;

		struct io_submit_state	submit_state;

		struct io_buffer_list	*io_bl;
		struct xarray		io_bl_xa;

		struct io_hash_table	cancel_table_locked;
		struct io_alloc_cache	apoll_cache;
		struct io_alloc_cache	netmsg_cache;

		/*
		 * ->iopoll_list is protected by the ctx->uring_lock for
		 * io_uring instances that don't use IORING_SETUP_SQPOLL.
		 * For SQPOLL, only the single threaded io_sq_thread() will
		 * manipulate the list, hence no extra locking is needed there.
		 */
		struct io_wq_work_list	iopoll_list;
		bool			poll_multi_queue;
	} ____cacheline_aligned_in_smp;

	struct {
		/*
		 * We cache a range of free CQEs we can use, once exhausted it
		 * should go through a slower range setup, see __io_get_cqe()
		 */
		struct io_uring_cqe	*cqe_cached;
		struct io_uring_cqe	*cqe_sentinel;

		unsigned		cached_cq_tail;
		unsigned		cq_entries;
		struct io_ev_fd	__rcu	*io_ev_fd;
		unsigned		cq_extra;
	} ____cacheline_aligned_in_smp;

	/*
	 * task_work and async notification delivery cacheline. Expected to
	 * regularly bounce b/w CPUs.
	 */
	struct {
		struct llist_head	work_llist;
		unsigned long		check_cq;
		atomic_t		cq_wait_nr;
		atomic_t		cq_timeouts;
		struct wait_queue_head	cq_wait;
	} ____cacheline_aligned_in_smp;

	/* timeouts */
	struct {
		spinlock_t		timeout_lock;
		struct list_head	timeout_list;
		struct list_head	ltimeout_list;
		unsigned		cq_last_tm_flush;
	} ____cacheline_aligned_in_smp;

	struct io_uring_cqe	completion_cqes[16];

	spinlock_t		completion_lock;

	/* IRQ completion list, under ->completion_lock */
	struct io_wq_work_list	locked_free_list;
	unsigned int		locked_free_nr;

	struct list_head	io_buffers_comp;
	struct list_head	cq_overflow_list;
	struct io_hash_table	cancel_table;

	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
	struct io_sq_data	*sq_data;	/* if using sq thread polling */

	struct wait_queue_head	sqo_sq_wait;
	struct list_head	sqd_list;

	unsigned int		file_alloc_start;
	unsigned int		file_alloc_end;

	struct xarray		personalities;
	u32			pers_next;

	struct list_head	io_buffers_cache;

	/* Keep this last, we don't need it for the fast path */
	struct wait_queue_head		poll_wq;
	struct io_restriction		restrictions;

	/* slow path rsrc auxilary data, used by update/register */
	struct io_mapped_ubuf		*dummy_ubuf;
	struct io_rsrc_data		*file_data;
	struct io_rsrc_data		*buf_data;

	/* protected by ->uring_lock */
	struct list_head		rsrc_ref_list;
	struct io_alloc_cache		rsrc_node_cache;
	struct wait_queue_head		rsrc_quiesce_wq;
	unsigned			rsrc_quiesce;

	struct list_head		io_buffers_pages;

	#if defined(CONFIG_UNIX)
		struct socket		*ring_sock;
	#endif
	/* hashed buffered write serialization */
	struct io_wq_hash		*hash_map;

	/* Only used for accounting purposes */
	struct user_struct		*user;
	struct mm_struct		*mm_account;

	/* ctx exit and cancelation */
	struct llist_head		fallback_llist;
	struct delayed_work		fallback_work;
	struct work_struct		exit_work;
	struct list_head		tctx_list;
	struct completion		ref_comp;

	/* io-wq management, e.g. thread count */
	u32				iowq_limits[2];
	bool				iowq_limits_set;

	struct callback_head		poll_wq_task_work;
	struct list_head		defer_list;
	unsigned			sq_thread_idle;
	/* protected by ->completion_lock */
	unsigned			evfd_last_cq_tail;

	/*
	 * If IORING_SETUP_NO_MMAP is used, then the below holds
	 * the gup'ed pages for the two rings, and the sqes.
	 */
	unsigned short			n_ring_pages;
	unsigned short			n_sqe_pages;
	struct page			**ring_pages;
	struct page			**sqe_pages;
};

数据结构可视化

io_rings的数据分布

struct io_rings的其他字段
struct io_uring_cqe cqes[]
u32 *sq_array // u32的数组，是指向sq_sqes的indice数组

如下图所示，可见io_uring_cqe和sq_array是相邻的

io_uring_sqe（真实的sqe）和io_rings不连续，如下图所示

io_uring_sqe

用户只能得到io_uring_params，其中存放了io_sqring_offsets和io_cqring_offsets，根据这两个结构体，用户便可以访问io_rings中的sq_array和cqes，如下图所示

io_rings_complete_structure

如上所示，用户可以通过mmap拿到io_rings，再加上用户可以访问io_uring_params的io_sqring_offsets和io_cqring_offsets，因此用户可以访问到io_rings的各个成员

io_uring_with_user

为什么要有sq_off/cq_off，拿到io_rings后，直接访问成员不可以吗？难道是

mmap后，用户得到的io_rings只是一个地址，不是一个结构体

画图：两个 struct io_uring和sqes、cqes、sq_array的关系

sq_array其实并不是很适用，所以内核有个选项可以禁用sq_array，参考

基本上在任何位置只需要能持有io_ring_ctx和io_uring_params就可以找到任何数据所在的位置

关键流程

数据结构定义好了，逻辑实现具体是如何驱动这些数据结构的呢？使用上，大体分为准备、提交、收割过程。

有几个io_uring相关的系统调用：

#include <linux/io_uring.h>

int io_uring_setup(u32 entries, struct io_uring_params *p);

int io_uring_enter(unsigned int fd, unsigned int to_submit,
                   unsigned int min_complete, unsigned int flags,
                   sigset_t *sig);

int io_uring_register(unsigned int fd, unsigned int opcode,
                      void *arg, unsigned int nr_args);

下面分析关键流程。

io_uring_setup

io_uring通过io_uring_setup完成准备阶段。

int io_uring_setup(u32 entries, struct io_uring_params *p);

/*
 * Passed in for io_uring_setup(2). Copied back with updated info on success
 */
struct io_uring_params {
	__u32 sq_entries;
	__u32 cq_entries;
	__u32 flags;
	__u32 sq_thread_cpu;
	__u32 sq_thread_idle;
	__u32 features;
	__u32 wq_fd;
	__u32 resv[3];
	struct io_sqring_offsets sq_off;
	struct io_cqring_offsets cq_off;
};

flags中比较重要的有

IORING_SETUP_IOPOLL
- 让内核采用 Polling 的模式收割Block层的请求。在收割IO时，以忙等待的方式，而不是异步中断通知（Interrupt Request）的方式，即应用程序需要不断调用io_uring_enter轮询设备来检查io是否完成。因此相比于IRQ，会消耗更多的cpu资源，但IO操作的延迟更低。该种方式需要依靠打开文件的时候，设置为 O_DIRECT 的标记。我没弄懂
- 猜测：
  - 在IOPOLL启用时，会依靠轮询的方式收割block层的请求
  - 如果在IOPOLL开启后，SQPOLL也开启了，那么用户不用阻塞，SQ thread会进行
IORING_SETUP_SQPOLL
- 内核额外启用一个内核线程，称为SQ线程。这个内核线程可以运行在某个指定的 core 上（通过 sq_thread_cpu 配置）。这个内核线程会不停的 Poll SQ，除非在一段时间内没有 Poll 到任何请求（通过 sq_thread_idle 配置），才会被挂起。SQ线程不仅会处理IO提交，也会处理IO完成事件
IORING_SETUP_SINGLE_ISSUER
- 只有一个线程提交任务
IORING_SETUP_DEFER_TASKRUN
- 在异步任务中，可能存在这种情况：在异步任务A提交后，该任务会加入到task work queue中，当cpu正在运行某个非常重要的任务B时，任务A可能从 task work queue中被调度出来，挤掉任务B的执行，导致这个非常重要的任务B的执行延迟增加，即执行时间增加。通过在io_uring_setup中设置IORING_SETUP_DEFER_TASKRUN，使得我们可以在用户调用io_uring_enter，并且带上IORING_ENTER_GETEVENTS时，才开始执行这些异步任务，例如任务A。这样就避免这些异步任务中断其他正在运行的任务。

代码分析

io_uring_setup系统调用的过程就是初始化相关数据结构，建立好对应的缓存区，然后通过系统调用的参数io_uring_params结构传递回去，告诉核外环内存地址在哪，起始指针的地址在哪等关键的信息。

需要初始化内存的内存分为三个区域，分别是SQ，CQ，SQEs。内核初始化SQ和CQ，SQ和CQ都是ring，此外，提交请求在SQ，CQ之间有一个间接数组，即内核提供了一个Submission Queue Entries（SQEs）数组。之所以额外采用了一个数组保存SQEs，是为了方便通过环形缓冲区提交内存上不连续的请求。SQ和CQ中每个节点保存的都是SQEs数组的索引，而不是实际的请求，实际的请求只保存在SQEs数组中。这样在提交请求时，就可以批量提交一组SQEs上不连续的请求。

io_uring_setup的逻辑可以以下分为三部分

创建一个上下文结构io_ring_ctx用来管理整个会话。
根据io_uring_params->sq_off/cq_off偏移量来实现SQ和CQ内存区的映射
错误检查、权限检查、资源配额检查等检查逻辑。

/*
 * Sets up an aio uring context, and returns the fd. Applications asks for a
 * ring size, we return the actual sq/cq ring sizes (among other things) in the
 * params structure passed in.
 */
static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
{
	// 检查params各成员是否valid

	return io_uring_create(entries, &p, params);
}


// 真正执行setup的函数io_uring_create
static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
				  struct io_uring_params __user *params)
{
	struct io_ring_ctx *ctx;
	struct io_uring_task *tctx;
	struct file *file;
	int ret;

	// 检查p->flags的合法性
    ...

	// 设置p->sq_entries和p->cq_entries，必须是2的幂次
	...

    // 创建io_ring_ctx，为io_ring_ctx分配内存
	ctx = io_ring_ctx_alloc(p);
	
    
    // 设置io_ring_ctx的flag
    ...
        

	/*
	 * When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, user
	 * space applications don't need to do io completion events
	 * polling again, they can rely on io_sq_thread to do polling
	 * work, which can reduce cpu usage and uring_lock contention.
	 */
    
    
	
	// 分配内存，为io_rings、cqes、sqs这三个紧邻的结构以及sqes分配内存
    // allocate memory if app haven't, otherwise just map. 
    // The size = sizeof(io_rings) + p->cq_entries * sizeof(io_uring_cqe) + p->sq_entries * sizeof(u32) + p->sq_entries * sizeof(io_uring_sqe)
    // 申请io_rings SQEs
	ret = io_allocate_scq_urings(ctx, p);
	

    // 处理poll模式的逻辑，包括初始化SQpoll内核线程
	ret = io_sq_offload_create(ctx, p);

    

	// 创建io_ring_ctx对应的file，之后用户需要这个file来访问io_ring_ctx
	file = io_uring_get_file(ctx);
}



// 其中，我们对cq/sq可以进行的操作如下所示，在io_uring_get_file函数中，此结构会被放入到file->f_op中
static const struct file_operations io_uring_fops = {
	.release	= io_uring_release,
	.mmap		= io_uring_mmap,
#ifndef CONFIG_MMU
	.get_unmapped_area = io_uring_nommu_get_unmapped_area,
	.mmap_capabilities = io_uring_nommu_mmap_capabilities,
#else
	.get_unmapped_area = io_uring_mmu_get_unmapped_area,
#endif
	.poll		= io_uring_poll,
#ifdef CONFIG_PROC_FS
	.show_fdinfo	= io_uring_show_fdinfo,
#endif
};


// io_uring_mmap的核心实现如下所示
// ctx存放于file中，通过offset获取ctx的各成员，以访问ctx->rings、cqes、sqes
static void *io_uring_validate_mmap_request(struct file *file,
					    loff_t pgoff, size_t sz)
{
	struct io_ring_ctx *ctx = file->private_data;
	loff_t offset = pgoff << PAGE_SHIFT;
	struct page *page;
	void *ptr;

	/* Don't allow mmap if the ring was setup without it */
	if (ctx->flags & IORING_SETUP_NO_MMAP)
		return ERR_PTR(-EINVAL);

	switch (offset & IORING_OFF_MMAP_MASK) {
	case IORING_OFF_SQ_RING:
	case IORING_OFF_CQ_RING:
		ptr = ctx->rings;
		break;
	case IORING_OFF_SQES:
		ptr = ctx->sq_sqes;
		break;
	case IORING_OFF_PBUF_RING: {
		unsigned int bgid;

		bgid = (offset & ~IORING_OFF_MMAP_MASK) >> IORING_OFF_PBUF_SHIFT;
		mutex_lock(&ctx->uring_lock);
		ptr = io_pbuf_get_address(ctx, bgid);
		mutex_unlock(&ctx->uring_lock);
		if (!ptr)
			return ERR_PTR(-EINVAL);
		break;
		}
	default:
		return ERR_PTR(-EINVAL);
	}

	page = virt_to_head_page(ptr);
	if (sz > page_size(page))
		return ERR_PTR(-EINVAL);

	return ptr;
}

/*
 * Magic offsets(byte-based) for the application to mmap the data it needs, 'LL' means unsigned long long, we can see that SQ_RING/CQ_RING max size is 0x8000000ULLB = 128MB, SQES max size is 0x70000000ULLB = 1792MB.
 */
#define IORING_OFF_SQ_RING			0ULL
#define IORING_OFF_CQ_RING			0x8000000ULL
#define IORING_OFF_SQES				0x10000000ULL
#define IORING_OFF_PBUF_RING		0x80000000ULL
#define IORING_OF`F_PBUF_SHIFT		16
#define IORING_OFF_MMAP_MASK		0xf8000000ULL

如下图所示，io_uring_setup的主要功能由以下四个函数提供

io_uring_setup

io_ring_ctx_alloc，主要用来申请空间，初始化列表头、互斥锁、自旋锁等结构
io_allocate_scq_urings，初始化整个struct io_rings *rings，包括SQ/CQ头尾指针、SQE、CQE
- SQ、CQ 头尾指针以及 CQE 都在 struct io_rings *rings 结构体中
- SQE 则是在 struct io_ring_ctx *ctx 结构体中
io_sq_offload_create，根据用户通过 io_uring_setup 传递的 flags 来配置 io_uring 的运行方式
io_uring_get_fd 将 struct io_ring_ctx *ctx 暴露给用户态访问

io_sq_offload_create

__cold int io_sq_offload_create(struct io_ring_ctx *ctx,
				struct io_uring_params *p)
{
	/* Retain compatibility with failing for an invalid attach attempt */
    // 检查是否和另外一个io_uring共享 SQ thread
    ...

	if (ctx->flags & IORING_SETUP_SQPOLL) {
		struct task_struct *tsk; // SQ thread
		struct io_sq_data *sqd;

		// sqd存放SQ thread的相关信息
		sqd = io_get_sq_data(p, &attached);
	

		// 设置ctx中SQ thread相关的信息
        ...
		// 检查是否需要将SQ thread绑定到指定的cpu上
        ...
        
        // 创建 SQ thread
		tsk = create_io_thread(io_sq_thread, sqd, NUMA_NO_NODE);
        
        // 例行操作，开启线程
		wake_up_new_task(tsk);
		if (ret)
			goto err;
	}
}

io_uring_enter

代码分析

// 处理submit和completion相关事件
SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
		u32, min_complete, u32, flags, const void __user *, argp,
		size_t, argsz)
{
	struct io_ring_ctx *ctx;
	struct file *file;
	long ret;


	// 根据fd找到对应的file，再根据file得到io_ring_ctx
	...
	ctx = file->private_data;
    

	/*
	 * For SQ polling, the thread will do all submissions and completions.
	 * Just return the requested submit count, and wake the thread if
	 * we were asked to.
	 */
	if (ctx->flags & IORING_SETUP_SQPOLL) {
		if (flags & IORING_ENTER_SQ_WAKEUP)
			wake_up(&ctx->sq_data->wait);
		if (flags & IORING_ENTER_SQ_WAIT)
			io_sqpoll_wait_sq(ctx); // current一直等待，直到sq_rings非满时

		ret = to_submit;
	} else if (to_submit) {
		ret = io_uring_add_tctx_node(ctx);

		mutex_lock(&ctx->uring_lock);
		ret = io_submit_sqes(ctx, to_submit); // 最后是通过io_queue_sqe来提交sqe的

		if (flags & IORING_ENTER_GETEVENTS) { // 处理completion事件
			if (ctx->syscall_iopoll)
				goto iopoll_locked;
			/*
			 * Ignore errors, we'll soon call io_cqring_wait() and
			 * it should handle ownership problems if any.
			 */
			if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) // 暂时不执行sqe，等到某一次调用io_uring_enter时，再统一执行多个sqes
				(void)io_run_local_work_locked(ctx, min_complete); // The reason execute this line not in 3694 line is it needs get locked. Keep task work local to a io_ring_ctx, rather than to the submission task.
		}
		mutex_unlock(&ctx->uring_lock);
	}

    // 处理completion事件
	if (flags & IORING_ENTER_GETEVENTS) {
		int ret2;
		
        // 未启用sq thread，但是启用了iopoll
		if (ctx->syscall_iopoll) {
			/*
			 * We disallow the app entering submit/complete with
			 * polling, but we still need to lock the ring to
			 * prevent racing with polled issue that got punted to
			 * a workqueue.
			 */
			mutex_lock(&ctx->uring_lock);
iopoll_locked:
			ret2 = io_validate_ext_arg(flags, argp, argsz);
			if (likely(!ret2)) {
				min_complete = min(min_complete,
						   ctx->cq_entries);
				ret2 = io_iopoll_check(ctx, min_complete);
			}
			mutex_unlock(&ctx->uring_lock);
		} else { // 一直等待cq_ring
			const sigset_t __user *sig;
			struct __kernel_timespec __user *ts;

			ret2 = io_get_ext_arg(flags, argp, &argsz, &ts, &sig);
			if (likely(!ret2)) {
				min_complete = min(min_complete,
						   ctx->cq_entries);
				ret2 = io_cqring_wait(ctx, min_complete, sig,
						      argsz, ts);
			}
		}
}

flags

要理解这些flags，需要阅读io_uring_enter这个函数

比较重要的flags

轮询参数的配置

io_uring 大致可以分为默认、IOPOLL、SQPOLL、IOPOLL+SQPOLL 四种模式。如果需要轮询来检测IO请求是否完成，考虑开启IOPOLL；如果需要更高的实时性、减少系统调用开销，则考虑开启SQPOLL

只开启IORING_SETUP_IOPOLL，通过系统调用 io_uring_enter 提交任务和收割任务
只开启IORING_SETUP_SQPOL，提交任务无需系统调用，收割任务需要调用io_uring_enter。内核线程在一段时间无操作后会休眠，可以通过 io_uring_enter 唤醒
IORING_SETUP_IOPOLL 和 IORING_SETUP_SQPOLL 都开启，内核线程会同时对 io_uring 的队列和设备驱动队列做轮询。在这种情况下，用户态程序不需要调用 io_uring_enter 来触发内核的设备轮询了，只需要在用户态轮询完成事件队列即可。

IO提交

io_uring提供了submission offload模式，使得提交过程完全不需要进行系统调用。当程序在用户态设置完SQE，并通过修改SQ的tail完成一次插入时，如果此时SQ线程处于唤醒状态，那么可以立刻捕获到这次提交，这样就避免了用户程序调用io_uring_enter。如上所说，如果SQ线程处于休眠状态，则需要通过使用IORING_SQ_NEED_WAKEUP标志位调用io_uring_enter来唤醒SQ线程。

在初始化完成之后，应用程序就可以使用这些队列来添加 IO 请求，即填充 SQE。当请求都加入 SQ 后，应用程序还需要某种方式告诉内核，生产的请求待消费，这就是提交 IO 请求。

IO 提交的做法是找到一个空闲的 SQE，根据请求设置 SQE，并将这个 SQE 的索引放到 SQ 中。SQ 是一个典型的 RingBuffer，有 head，tail 两个成员，如果 head == tail，意味着队列为空。SQE 设置完成后，需要修改 SQ 的 tail，以表示向 RingBuffer 中插入一个请求，当所有请求都加入 SQ 后，就可以使用相关接口 io_uring_enter（）来提交 IO 请求。

io_uring 提供了 io_uring_enter 这个系统调用接口，用于通知内核 IO 请求的产生以及等待内核完成请求。为了在追求极致 IO 性能的场景下获得最高性能，io_uring 还支持了轮询模式，轮询模式有两种使用场景，一种是提交 IO 过程的轮询模式，这是通过设置 IORING_SETUP_SQPOLL 来开启；另外一种是收割 IO 过程的轮询模式，通过设置 IORING_SETUP_IOPOLL 来开启。

提交 IO 的轮询机制

为了提升性能，内核提供了轮询的方式来提交 IO 请求，在初始化阶段通过设置 io_uring 的相关标志位 IORING_SETUP_SQPOLL 可以开启该机制。

在设置 IORING_SETUP_SQPOLL 模式下，内核会额外启动一个内核线程（在io_uring_setup函数执行完后），我们称作 SQ 线程。这个内核线程可以运行在某个指定的 core 上（通过 sq_thread_cpu 配置）。这个内核线程会不停的 Poll SQ，除非在一段时间内没有 Poll 到任何请求（通过 sq_thread_idle 配置），才会被挂起。

当程序在用户态设置完 SQE，并通过修改 SQ 的 tail 完成一次插入时，如果此时 SQ 线程处于唤醒状态，那么可以立刻捕获到这次提交，这样就避免了用户程序调用 io_uring_enter 这个系统调用。如果 SQ 线程处于休眠状态，则需要通过调用 io_uring_enter，并使用 IORING_ENTER_SQ_WAKEUP 参数，来唤醒 SQ 线程。用户态可以通过 sqring 的 flags 变量获取 SQ 线程的状态。

在提交 IO 的时候，如果出现了没有空闲的 SEQ entry 来提交新的请求的时候，因为开启了IORING_SETUP_SQPOLL模式，应用程序不知道是否sqe entries被consume，即应用程序不知道什么时候有空闲，这时只能不断重试。为解决这种场景的问题，可以在调用 io_uring_enter 的时候设置 IORING_ENTER_SQ_WAIT 标志位，当提交新请求的时候，它会等到至少有一个新的 SQ entry 能使用的时候才返回。

1	`int io_uring_enter(unsigned int fd, unsigned int to_submit, unsigned int min_complete, unsigned int flags, sigset_t *sig);`

如果 flags 字段设置了 IORING_ENTER_GETEVENTS，调用会等待 min_complete 个 IO 完成后才返回。

min_complete 为 0，调用会立即完成，并返回已经完成的 IO，如果 min_complete 不为 0，并且 io_uring 设置了 IORING_SETUP_IOPOLL，那么调用通过忙等待的方式等待 min_complete 个 IO 完成事件，否则调用会挂起等待 IO 中断（或信号）

IO收割

收割 IO 的轮询机制

在设置了 IORING_ENTER_GETEVENTS 标志位后，kernel 会一直阻塞到至少有min_complete个事件完成才会返回。

patch

目前io_uring_create中，检查sq->entries和cq_entries的代码有冗余，将其写为一个函数，可以参考如下函数

static int io_validate_ext_arg(unsigned flags, const void __user *argp, size_t argsz)
{
	if (flags & IORING_ENTER_EXT_ARG) {
		struct io_uring_getevents_arg arg;

		if (argsz != sizeof(arg))
			return -EINVA
		if (copy_from_user(&arg, argp, sizeof(arg)))
			return -EFAULT;
	}
	return 0;
}

只支持direct IO

IO请求元数据开销较大

不够异步

API不友好

每一个IO至少需要两次系统调用才能完成（submit和wait-for-completion)，两次还是两种？

系统调用开销大

因为一次IO至少需要两次系统调用

io_poll

概念

内核通知用户事件是否完成有如下两种方式

内核中断通知用户
io_poll，即用户轮询事件是否完成，适用于
- 高速设备，例如从机械硬盘到固态硬盘，事件处理时间较短，因此占用cpu较少
- IO请求流量较小

目前的不足

io_uring的增强

fastpoll

io_uring优势

使用方便
- 仅有三个系统调用
- 相关内核人员编写了对应的用户库liburing
通用性强，支持
- 传统IO（buffer IO + direct IO）
- epoll型编程，适用于网络IO
特性丰富，
高性能，IO请求overhead小
- 不用系统调用

io_uring使用ring buffer

共享ring buffer的设计主要带来以下几个好处：

提交、完成请求时节省应用和内核之间的内存拷贝
使用 SQPOLL 高级特性时，应用程序无需调用系统调用
无锁操作，用memory ordering实现同步，通过几个简单的头尾指针的移动就可以实现快速交互。

mmap两块内存，一块（SQ ring）用于核外传递数据给内核，一块（CQ ring）是内核传递数据给核外，一方只读，一方只写。内核控制SQ ring的head和CQ ring的tail，应用程序控制SQ ring的tail和CQ ring的head。

为什么用户和内核都可以操作两个ring的head和tail？

SQ ring和CQ ring和head和tail都用io_uring结构体存放存放在内核中，用户可以访问到这两个io_uring

重要特性

IORING_FEAT_FAST_POLL

用户只需要下发一次，不用被中断，事件可执行时，内核会帮忙执行，是这样吗？？？

目前io_uring不仅用于异步IO，更是一个异步编程框架，即只要是在用户态阻塞的操作，都可以想想是否可以用io_uring进行异步化，目前io_uring已经支持几十个系统调用

linux kernel之前看的代码

e38afaea62051bf540a624c01e45418d9332b6fe

v6.6

Linux kernel

io_uring api

http://example.com/io-uring-实现/

作者

发布于

2024年3月26日

许可协议

链接阶段gnu库上一篇

io_uring api 下一篇