io_uring

动机

了解到异步io给scylladb带来了极速性能,因此想了解其异步io框架seastar,然而其并未使用io_uring。不过如果io_uring在其执行框架seastar前被提出,应该是会被用上的。

异步io的优势

异步io不会占用cpu资源,在计算密集型任务中,可以更充分地利用处理器资源,进程在提交异步io后,可以立即执行后续任务,不会阻塞。在nginx中,读取文件的操作会异步地提交给内核,内核会通知IO设备独立地执行操作,这样的话nginx进程可以继续充分地占用CPU。并且当大量读事件堆积到IO设备的队列中时,将会发挥出内核中“电梯算法”的优势,从而降低随机读取磁盘扇区的成本。

异步io的劣势

什么场景下不适合使用异步io呢?如果io操作并不繁重,例如读取小文件时,可以马上结束,那么同步io会更适用,因为不用再被io完成事件中断。

IO模型

目前的IO模型可按下图进行分类

all_kins_of_io

  • libaio:linux kernel实现的native aio
  • posix aio:glibc实现的aio

AIO的缺陷

aio目前存在很多缺陷,还是以nginx为例,目前,nginx仅支持在读取文件时使用AIO,因为正常写入文件往往是写入内存就立刻返回,即在 写入时采用buffer write,效率很高,但是如果替换成AIO写入速度会明显下降。这是因为

  • 仅支持direct IO。在采用AIO的时候,只能使用O_DIRECT,不能借助文件系统缓存来缓存当前的IO请求,并且还存在size对齐(直接操作磁盘,所有写入内存块数量必须是文件系统块大小的倍数,而且要与内存页大小对齐。)等限制,直接影响了aio在很多场景的使用。
  • 仍然可能被阻塞。语义不完备。即使应用层主观上,希望系统层采用异步IO,但是客观上,有时候还是可能会被阻塞。例如在收割io完成事件时,例如通过io_getevents(2)调用read_events读取AIO的完成events,read_events中的wait_event_interruptible_hrtimeout等待aio_read_events,如果条件不成立(events未完成)则调用__wait_event_hrtimeout进入睡眠(当然,支持用户态设置最大等待时间)。
  • 拷贝开销大。每个IO提交需要拷贝64+8字节,每个IO完成需要拷贝32字节,总共104字节的拷贝。这个拷贝开销是否可以承受,和单次IO大小有关:如果需要发送的IO本身就很大,相较之下,这点消耗可以忽略,而在大量小IO的场景下,这样的拷贝影响比较大。
  • API不友好。每一个IO至少需要两次系统调用才能完成(submit和wait-for-completion),需要非常小心地使用完成事件以避免丢事件。
  • 系统调用开销大。也正是因为上一条,io_submit/io_getevents造成了较大的系统调用开销

io_uring接口

用户态接口:

io_uring 的实现仅仅使用了三个用户态的系统调用接口:

  • io_uring_setup:初始化一个新的 io_uring 上下文,内核通过一块和用户共享的内存区域进行消息传递。

  • io_uring_enter:提交任务以及收割任务。

  • io_uring_register:注册用户态和内核态的共享 buffer。

使用前两个系统调用已经足够使用 io_uring 接口了。

实现思路

解决“系统调用开销大”的问题

针对这个问题,考虑是否每次都需要系统调用。如果能将多次系统调用中的逻辑放到有限次数中来,就能将消耗降为常数时间复杂度。

解决“拷贝开销大”的问题

之所以在提交和完成事件中存在内存拷贝,是因为应用程序和内核之间的通信需要拷贝数据,为了避免这个问题,io_uring将共享内存作为应用程序与内核间的通信方式。

要实现核外与内核的一个零拷贝,最佳的方式就是实现一块内存映射区域,两者共享一段内存,核外往这段内存写数据,然后通知内核使用这段内存数据,或者内核填写这段数据,核外使用这部分数据。io_uring使用了一对共享的ring buffer用于应用程序和内核之间的通信。

共享ring buffer的设计主要带来以下几个好处:

  • 提交、完成请求时节省应用和内核之间的内存拷贝
  • 使用 SQPOLL 高级特性时,应用程序无需调用系统调用,内核线程会自动处理
  • 无锁操作,用memory ordering实现同步,通过几个简单的头尾指针的移动就可以实现快速交互。

一块用于用户传递数据给内核,一块是内核传递数据给用户,一方只读,一方只写。

  • 提交队列SQ(submission queue)中,应用是IO提交的生产者(producer),内核是消费者(consumer),应用程序将io请求放到sqes中,并修改SQ的tail
  • 完成队列CQ(completion queue)中,内核是完成事件的生产者,应用是消费者,内核对已经完成的io请求作标记,放入CQ中,修改CQ的tail

内核控制SQ ring的head和CQ ring的tail,应用程序控制SQ ring的tail和CQ ring的head,为了方便,下图中省去了sqes

io_uring_ring_buffer

数据结构

io_uring_sq_cq_sqes

在 SQ,CQ 之间有一个叫做 SQEs 数组。SQ 中每个节点保存的都是 SQEs 数组的索引,而不是实际的请求。因为内核的响应请求的顺序是不确定的,导致在 SEQs 中插入新请求的位置可能是离散的。该数组的目的是方便通过环形缓冲区提交内存上不连续的请求。

另外,由于上面所述的内存区域都是由 kernel 进行分配的,用户程序是不能直接访问的,在进行初始化的时候,相关初始化接口,即io_uring_setup会返回对应区域的 fd,应用程序通过该 fd 进行访问。在返回的相关参数中,会有对应三个区域在该共享内存中对应位置的描述,方便用户态程序的访问。

数据结构定义

各数据结构如下所示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
// SQ/CQ
struct io_uring
{
u32 head;
u32 tail;
}

/*
* This data is shared(from kernel) with the application through the mmap at offsets
* IORING_OFF_SQ_RING and IORING_OFF_CQ_RING.
*
* The offsets to the member fields are published through struct
* io_sqring_offsets when calling io_uring_setup.
*/
struct io_rings {
/*
* Head and tail offsets into the ring; the offsets need to be
* masked to get valid indices.
*
* The kernel controls head of the sq ring and the tail of the cq ring,
* and the application controls tail of the sq ring and the head of the
* cq ring.
*/
struct io_uring sq, cq;
/*
* Bitmasks to apply to head and tail offsets (constant, equals
* ring_entries - 1, i.e., sq_ring_mask = sq_ring_entries - 1, cq_ring_mask = cq_ring_entries - 1)
*/
u32 sq_ring_mask, cq_ring_mask;
/* Ring sizes (constant, power of 2) */
u32 sq_ring_entries, cq_ring_entries;
/*
* Number of invalid entries dropped by the kernel due to
* invalid index stored in array
*
* Written by the kernel, shouldn't be modified by the
* application (i.e. get number of "new events" by comparing to
* cached value).
*
* After a new SQ head value was read by the application this
* counter includes all submissions that were dropped reaching
* the new SQ head (and possibly more).
*/
u32 sq_dropped;
/*
* Runtime SQ flags
*
* Written by the kernel, shouldn't be modified by the
* application.
*
* The application needs a full memory barrier before checking
* for IORING_SQ_NEED_WAKEUP after updating the sq tail.
*/
atomic_t sq_flags;
/*
* Runtime CQ flags
*
* Written by the application, shouldn't be modified by the
* kernel.
*/
u32 cq_flags;
/*
* Number of completion events lost because the queue was full;
* this should be avoided by the application by making sure
* there are not more requests pending than there is space in
* the completion queue.
*
* Written by the kernel, shouldn't be modified by the
* application (i.e. get number of "new events" by comparing to
* cached value).
*
* As completion events come in out of order this counter is not
* ordered with any other data.
*/
u32 cq_overflow;
/*
* Ring buffer of completion events.
*
* The kernel writes completion events fresh every time they are
* produced, so the application is allowed to modify pending
* entries.
*/
struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp;
};

struct io_ring_ctx {
/* const or read-mostly hot data */
struct {
unsigned int flags;
unsigned int drain_next: 1;
unsigned int restricted: 1;
unsigned int off_timeout_used: 1;
unsigned int drain_active: 1;
unsigned int has_evfd: 1;
/* all CQEs should be posted only by the submitter task */
unsigned int task_complete: 1;
unsigned int lockless_cq: 1;
unsigned int syscall_iopoll: 1;
unsigned int poll_activated: 1;
unsigned int drain_disabled: 1;
unsigned int compat: 1;

struct task_struct *submitter_task;
struct io_rings *rings;
struct percpu_ref refs;

enum task_work_notify_mode notify_method;
} ____cacheline_aligned_in_smp;

/* submission data */
struct {
struct mutex uring_lock;

/*
* Ring buffer of indices into array of io_uring_sqe, which is
* mmapped by the application using the IORING_OFF_SQES offset.
*
* This indirection could e.g. be used to assign fixed
* io_uring_sqe entries to operations and only submit them to
* the queue when needed.
*
* The kernel modifies neither the indices array nor the entries
* array.
*/
u32 *sq_array; // indices array
struct io_uring_sqe *sq_sqes;
unsigned cached_sq_head;
unsigned sq_entries;

/*
* Fixed resources fast path, should be accessed only under
* uring_lock, and updated through io_uring_register(2)
*/
struct io_rsrc_node *rsrc_node;
atomic_t cancel_seq;
struct io_file_table file_table;
unsigned nr_user_files;
unsigned nr_user_bufs;
struct io_mapped_ubuf **user_bufs;

struct io_submit_state submit_state;

struct io_buffer_list *io_bl;
struct xarray io_bl_xa;

struct io_hash_table cancel_table_locked;
struct io_alloc_cache apoll_cache;
struct io_alloc_cache netmsg_cache;

/*
* ->iopoll_list is protected by the ctx->uring_lock for
* io_uring instances that don't use IORING_SETUP_SQPOLL.
* For SQPOLL, only the single threaded io_sq_thread() will
* manipulate the list, hence no extra locking is needed there.
*/
struct io_wq_work_list iopoll_list;
bool poll_multi_queue;
} ____cacheline_aligned_in_smp;

struct {
/*
* We cache a range of free CQEs we can use, once exhausted it
* should go through a slower range setup, see __io_get_cqe()
*/
struct io_uring_cqe *cqe_cached;
struct io_uring_cqe *cqe_sentinel;

unsigned cached_cq_tail;
unsigned cq_entries;
struct io_ev_fd __rcu *io_ev_fd;
unsigned cq_extra;
} ____cacheline_aligned_in_smp;

/*
* task_work and async notification delivery cacheline. Expected to
* regularly bounce b/w CPUs.
*/
struct {
struct llist_head work_llist;
unsigned long check_cq;
atomic_t cq_wait_nr;
atomic_t cq_timeouts;
struct wait_queue_head cq_wait;
} ____cacheline_aligned_in_smp;

/* timeouts */
struct {
spinlock_t timeout_lock;
struct list_head timeout_list;
struct list_head ltimeout_list;
unsigned cq_last_tm_flush;
} ____cacheline_aligned_in_smp;

struct io_uring_cqe completion_cqes[16];

spinlock_t completion_lock;

/* IRQ completion list, under ->completion_lock */
struct io_wq_work_list locked_free_list;
unsigned int locked_free_nr;

struct list_head io_buffers_comp;
struct list_head cq_overflow_list;
struct io_hash_table cancel_table;

const struct cred *sq_creds; /* cred used for __io_sq_thread() */
struct io_sq_data *sq_data; /* if using sq thread polling */

struct wait_queue_head sqo_sq_wait;
struct list_head sqd_list;

unsigned int file_alloc_start;
unsigned int file_alloc_end;

struct xarray personalities;
u32 pers_next;

struct list_head io_buffers_cache;

/* Keep this last, we don't need it for the fast path */
struct wait_queue_head poll_wq;
struct io_restriction restrictions;

/* slow path rsrc auxilary data, used by update/register */
struct io_mapped_ubuf *dummy_ubuf;
struct io_rsrc_data *file_data;
struct io_rsrc_data *buf_data;

/* protected by ->uring_lock */
struct list_head rsrc_ref_list;
struct io_alloc_cache rsrc_node_cache;
struct wait_queue_head rsrc_quiesce_wq;
unsigned rsrc_quiesce;

struct list_head io_buffers_pages;

#if defined(CONFIG_UNIX)
struct socket *ring_sock;
#endif
/* hashed buffered write serialization */
struct io_wq_hash *hash_map;

/* Only used for accounting purposes */
struct user_struct *user;
struct mm_struct *mm_account;

/* ctx exit and cancelation */
struct llist_head fallback_llist;
struct delayed_work fallback_work;
struct work_struct exit_work;
struct list_head tctx_list;
struct completion ref_comp;

/* io-wq management, e.g. thread count */
u32 iowq_limits[2];
bool iowq_limits_set;

struct callback_head poll_wq_task_work;
struct list_head defer_list;
unsigned sq_thread_idle;
/* protected by ->completion_lock */
unsigned evfd_last_cq_tail;

/*
* If IORING_SETUP_NO_MMAP is used, then the below holds
* the gup'ed pages for the two rings, and the sqes.
*/
unsigned short n_ring_pages;
unsigned short n_sqe_pages;
struct page **ring_pages;
struct page **sqe_pages;
};

数据结构可视化

io_rings的数据分布

  • struct io_rings的其他字段
  • struct io_uring_cqe cqes[]
  • u32 *sq_array // u32的数组,是指向sq_sqes的indice数组

如下图所示,可见io_uring_cqe和sq_array是相邻的

io_rings

io_uring_sqe(真实的sqe)和io_rings不连续,如下图所示

io_uring_sqe

用户只能得到io_uring_params,其中存放了io_sqring_offsets和io_cqring_offsets,根据这两个结构体,用户便可以访问io_rings中的sq_array和cqes,如下图所示

io_rings_complete_structure

如上所示,用户可以通过mmap拿到io_rings,再加上用户可以访问io_uring_params的io_sqring_offsets和io_cqring_offsets,因此用户可以访问到io_rings的各个成员

io_uring_with_user

应用程序访问

可以看出,在任何位置只需要能持有io_ring_ctx和io_uring_params就可以找到任何数据所在的位置


io_uring
http://example.com/io-uring/
作者
Yw
发布于
2024年3月10日
许可协议