io_uring

动机

了解到异步io给scylladb带来了极速性能，因此想了解其异步io框架seastar，然而其并未使用io_uring。不过如果io_uring在其执行框架seastar前被提出，应该是会被用上的。

异步io的优势

异步io不会占用cpu资源，在计算密集型任务中，可以更充分地利用处理器资源，进程在提交异步io后，可以立即执行后续任务，不会阻塞。在nginx中，读取文件的操作会异步地提交给内核，内核会通知IO设备独立地执行操作，这样的话nginx进程可以继续充分地占用CPU。并且当大量读事件堆积到IO设备的队列中时，将会发挥出内核中“电梯算法”的优势，从而降低随机读取磁盘扇区的成本。

异步io的劣势

什么场景下不适合使用异步io呢？如果io操作并不繁重，例如读取小文件时，可以马上结束，那么同步io会更适用，因为不用再被io完成事件中断。

IO模型

目前的IO模型可按下图进行分类

all_kins_of_io

libaio：linux kernel实现的native aio
posix aio：glibc实现的aio

AIO的缺陷

aio目前存在很多缺陷，还是以nginx为例，目前，nginx仅支持在读取文件时使用AIO，因为正常写入文件往往是写入内存就立刻返回，即在写入时采用buffer write，效率很高，但是如果替换成AIO写入速度会明显下降。这是因为

仅支持direct IO。在采用AIO的时候，只能使用O_DIRECT，不能借助文件系统缓存来缓存当前的IO请求，并且还存在size对齐（直接操作磁盘，所有写入内存块数量必须是文件系统块大小的倍数，而且要与内存页大小对齐。）等限制，直接影响了aio在很多场景的使用。
仍然可能被阻塞。语义不完备。即使应用层主观上，希望系统层采用异步IO，但是客观上，有时候还是可能会被阻塞。例如在收割io完成事件时，例如通过io_getevents(2)调用read_events读取AIO的完成events，read_events中的wait_event_interruptible_hrtimeout等待aio_read_events，如果条件不成立（events未完成）则调用__wait_event_hrtimeout进入睡眠（当然，支持用户态设置最大等待时间）。
拷贝开销大。每个IO提交需要拷贝64+8字节，每个IO完成需要拷贝32字节，总共104字节的拷贝。这个拷贝开销是否可以承受，和单次IO大小有关：如果需要发送的IO本身就很大，相较之下，这点消耗可以忽略，而在大量小IO的场景下，这样的拷贝影响比较大。
API不友好。每一个IO至少需要两次系统调用才能完成（submit和wait-for-completion)，需要非常小心地使用完成事件以避免丢事件。
系统调用开销大。也正是因为上一条，io_submit/io_getevents造成了较大的系统调用开销

io_uring接口

用户态接口：

io_uring 的实现仅仅使用了三个用户态的系统调用接口：

io_uring_setup：初始化一个新的 io_uring 上下文，内核通过一块和用户共享的内存区域进行消息传递。
io_uring_enter：提交任务以及收割任务。
io_uring_register：注册用户态和内核态的共享 buffer。

使用前两个系统调用已经足够使用 io_uring 接口了。

实现思路

解决“系统调用开销大”的问题

针对这个问题，考虑是否每次都需要系统调用。如果能将多次系统调用中的逻辑放到有限次数中来，就能将消耗降为常数时间复杂度。

解决“拷贝开销大”的问题

之所以在提交和完成事件中存在内存拷贝，是因为应用程序和内核之间的通信需要拷贝数据，为了避免这个问题，io_uring将共享内存作为应用程序与内核间的通信方式。

要实现核外与内核的一个零拷贝，最佳的方式就是实现一块内存映射区域，两者共享一段内存，核外往这段内存写数据，然后通知内核使用这段内存数据，或者内核填写这段数据，核外使用这部分数据。io_uring使用了一对共享的ring buffer用于应用程序和内核之间的通信。

共享ring buffer的设计主要带来以下几个好处：

提交、完成请求时节省应用和内核之间的内存拷贝
使用 SQPOLL 高级特性时，应用程序无需调用系统调用，内核线程会自动处理
无锁操作，用memory ordering实现同步，通过几个简单的头尾指针的移动就可以实现快速交互。

一块用于用户传递数据给内核，一块是内核传递数据给用户，一方只读，一方只写。

提交队列SQ(submission queue)中，应用是IO提交的生产者（producer），内核是消费者（consumer），应用程序将io请求放到sqes中，并修改SQ的tail
完成队列CQ(completion queue)中，内核是完成事件的生产者，应用是消费者，内核对已经完成的io请求作标记，放入CQ中，修改CQ的tail

内核控制SQ ring的head和CQ ring的tail，应用程序控制SQ ring的tail和CQ ring的head，为了方便，下图中省去了sqes

io_uring_ring_buffer

数据结构

io_uring_sq_cq_sqes

在 SQ，CQ 之间有一个叫做 SQEs 数组。SQ 中每个节点保存的都是 SQEs 数组的索引，而不是实际的请求。因为内核的响应请求的顺序是不确定的，导致在 SEQs 中插入新请求的位置可能是离散的。该数组的目的是方便通过环形缓冲区提交内存上不连续的请求。

另外，由于上面所述的内存区域都是由 kernel 进行分配的，用户程序是不能直接访问的，在进行初始化的时候，相关初始化接口，即io_uring_setup会返回对应区域的 fd，应用程序通过该 fd 进行访问。在返回的相关参数中，会有对应三个区域在该共享内存中对应位置的描述，方便用户态程序的访问。

数据结构定义

各数据结构如下所示

// SQ/CQ
struct io_uring
{
    u32 head;
    u32 tail;
}

/*
 * This data is shared(from kernel) with the application through the mmap at offsets
 * IORING_OFF_SQ_RING and IORING_OFF_CQ_RING.
 *
 * The offsets to the member fields are published through struct
 * io_sqring_offsets when calling io_uring_setup.
 */
struct io_rings {
	/*
	 * Head and tail offsets into the ring; the offsets need to be
	 * masked to get valid indices.
	 *
	 * The kernel controls head of the sq ring and the tail of the cq ring,
	 * and the application controls tail of the sq ring and the head of the
	 * cq ring.
	 */
	struct io_uring		sq, cq;
	/*
	 * Bitmasks to apply to head and tail offsets (constant, equals
	 * ring_entries - 1, i.e., sq_ring_mask = sq_ring_entries - 1, cq_ring_mask = cq_ring_entries - 1)
	 */
	u32			sq_ring_mask, cq_ring_mask;
	/* Ring sizes (constant, power of 2) */
	u32			sq_ring_entries, cq_ring_entries;
	/*
	 * Number of invalid entries dropped by the kernel due to
	 * invalid index stored in array
	 *
	 * Written by the kernel, shouldn't be modified by the
	 * application (i.e. get number of "new events" by comparing to
	 * cached value).
	 *
	 * After a new SQ head value was read by the application this
	 * counter includes all submissions that were dropped reaching
	 * the new SQ head (and possibly more).
	 */
	u32			sq_dropped;
	/*
	 * Runtime SQ flags
	 *
	 * Written by the kernel, shouldn't be modified by the
	 * application.
	 *
	 * The application needs a full memory barrier before checking
	 * for IORING_SQ_NEED_WAKEUP after updating the sq tail.
	 */
	atomic_t		sq_flags;
	/*
	 * Runtime CQ flags
	 *
	 * Written by the application, shouldn't be modified by the
	 * kernel.
	 */
	u32			cq_flags;
	/*
	 * Number of completion events lost because the queue was full;
	 * this should be avoided by the application by making sure
	 * there are not more requests pending than there is space in
	 * the completion queue.
	 *
	 * Written by the kernel, shouldn't be modified by the
	 * application (i.e. get number of "new events" by comparing to
	 * cached value).
	 *
	 * As completion events come in out of order this counter is not
	 * ordered with any other data.
	 */
	u32			cq_overflow;
	/*
	 * Ring buffer of completion events.
	 *
	 * The kernel writes completion events fresh every time they are
	 * produced, so the application is allowed to modify pending
	 * entries.
	 */
	struct io_uring_cqe	cqes[] ____cacheline_aligned_in_smp;
};

struct io_ring_ctx {
	/* const or read-mostly hot data */
	struct {
		unsigned int		flags;
		unsigned int		drain_next: 1;
		unsigned int		restricted: 1;
		unsigned int		off_timeout_used: 1;
		unsigned int		drain_active: 1;
		unsigned int		has_evfd: 1;
		/* all CQEs should be posted only by the submitter task */
		unsigned int		task_complete: 1;
		unsigned int		lockless_cq: 1;
		unsigned int		syscall_iopoll: 1;
		unsigned int		poll_activated: 1;
		unsigned int		drain_disabled: 1;
		unsigned int		compat: 1;

		struct task_struct	*submitter_task;
		struct io_rings		*rings;
		struct percpu_ref	refs;

		enum task_work_notify_mode	notify_method;
	} ____cacheline_aligned_in_smp;

	/* submission data */
	struct {
		struct mutex		uring_lock;

		/*
		 * Ring buffer of indices into array of io_uring_sqe, which is
		 * mmapped by the application using the IORING_OFF_SQES offset.
		 *
		 * This indirection could e.g. be used to assign fixed
		 * io_uring_sqe entries to operations and only submit them to
		 * the queue when needed.
		 *
		 * The kernel modifies neither the indices array nor the entries
		 * array.
		 */
		u32			*sq_array; // indices array
		struct io_uring_sqe	*sq_sqes;
		unsigned		cached_sq_head;
		unsigned		sq_entries;

		/*
		 * Fixed resources fast path, should be accessed only under
		 * uring_lock, and updated through io_uring_register(2)
		 */
		struct io_rsrc_node	*rsrc_node;
		atomic_t		cancel_seq;
		struct io_file_table	file_table;
		unsigned		nr_user_files;
		unsigned		nr_user_bufs;
		struct io_mapped_ubuf	**user_bufs;

		struct io_submit_state	submit_state;

		struct io_buffer_list	*io_bl;
		struct xarray		io_bl_xa;

		struct io_hash_table	cancel_table_locked;
		struct io_alloc_cache	apoll_cache;
		struct io_alloc_cache	netmsg_cache;

		/*
		 * ->iopoll_list is protected by the ctx->uring_lock for
		 * io_uring instances that don't use IORING_SETUP_SQPOLL.
		 * For SQPOLL, only the single threaded io_sq_thread() will
		 * manipulate the list, hence no extra locking is needed there.
		 */
		struct io_wq_work_list	iopoll_list;
		bool			poll_multi_queue;
	} ____cacheline_aligned_in_smp;

	struct {
		/*
		 * We cache a range of free CQEs we can use, once exhausted it
		 * should go through a slower range setup, see __io_get_cqe()
		 */
		struct io_uring_cqe	*cqe_cached;
		struct io_uring_cqe	*cqe_sentinel;

		unsigned		cached_cq_tail;
		unsigned		cq_entries;
		struct io_ev_fd	__rcu	*io_ev_fd;
		unsigned		cq_extra;
	} ____cacheline_aligned_in_smp;

	/*
	 * task_work and async notification delivery cacheline. Expected to
	 * regularly bounce b/w CPUs.
	 */
	struct {
		struct llist_head	work_llist;
		unsigned long		check_cq;
		atomic_t		cq_wait_nr;
		atomic_t		cq_timeouts;
		struct wait_queue_head	cq_wait;
	} ____cacheline_aligned_in_smp;

	/* timeouts */
	struct {
		spinlock_t		timeout_lock;
		struct list_head	timeout_list;
		struct list_head	ltimeout_list;
		unsigned		cq_last_tm_flush;
	} ____cacheline_aligned_in_smp;

	struct io_uring_cqe	completion_cqes[16];

	spinlock_t		completion_lock;

	/* IRQ completion list, under ->completion_lock */
	struct io_wq_work_list	locked_free_list;
	unsigned int		locked_free_nr;

	struct list_head	io_buffers_comp;
	struct list_head	cq_overflow_list;
	struct io_hash_table	cancel_table;

	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
	struct io_sq_data	*sq_data;	/* if using sq thread polling */

	struct wait_queue_head	sqo_sq_wait;
	struct list_head	sqd_list;

	unsigned int		file_alloc_start;
	unsigned int		file_alloc_end;

	struct xarray		personalities;
	u32			pers_next;

	struct list_head	io_buffers_cache;

	/* Keep this last, we don't need it for the fast path */
	struct wait_queue_head		poll_wq;
	struct io_restriction		restrictions;

	/* slow path rsrc auxilary data, used by update/register */
	struct io_mapped_ubuf		*dummy_ubuf;
	struct io_rsrc_data		*file_data;
	struct io_rsrc_data		*buf_data;

	/* protected by ->uring_lock */
	struct list_head		rsrc_ref_list;
	struct io_alloc_cache		rsrc_node_cache;
	struct wait_queue_head		rsrc_quiesce_wq;
	unsigned			rsrc_quiesce;

	struct list_head		io_buffers_pages;

	#if defined(CONFIG_UNIX)
		struct socket		*ring_sock;
	#endif
	/* hashed buffered write serialization */
	struct io_wq_hash		*hash_map;

	/* Only used for accounting purposes */
	struct user_struct		*user;
	struct mm_struct		*mm_account;

	/* ctx exit and cancelation */
	struct llist_head		fallback_llist;
	struct delayed_work		fallback_work;
	struct work_struct		exit_work;
	struct list_head		tctx_list;
	struct completion		ref_comp;

	/* io-wq management, e.g. thread count */
	u32				iowq_limits[2];
	bool				iowq_limits_set;

	struct callback_head		poll_wq_task_work;
	struct list_head		defer_list;
	unsigned			sq_thread_idle;
	/* protected by ->completion_lock */
	unsigned			evfd_last_cq_tail;

	/*
	 * If IORING_SETUP_NO_MMAP is used, then the below holds
	 * the gup'ed pages for the two rings, and the sqes.
	 */
	unsigned short			n_ring_pages;
	unsigned short			n_sqe_pages;
	struct page			**ring_pages;
	struct page			**sqe_pages;
};