| xj | b04a402 | 2021-11-25 15:01:52 +0800 | [diff] [blame] | 1 | .. _userfaultfd: | 
|  | 2 |  | 
|  | 3 | =========== | 
|  | 4 | Userfaultfd | 
|  | 5 | =========== | 
|  | 6 |  | 
|  | 7 | Objective | 
|  | 8 | ========= | 
|  | 9 |  | 
|  | 10 | Userfaults allow the implementation of on-demand paging from userland | 
|  | 11 | and more generally they allow userland to take control of various | 
|  | 12 | memory page faults, something otherwise only the kernel code could do. | 
|  | 13 |  | 
|  | 14 | For example userfaults allows a proper and more optimal implementation | 
|  | 15 | of the PROT_NONE+SIGSEGV trick. | 
|  | 16 |  | 
|  | 17 | Design | 
|  | 18 | ====== | 
|  | 19 |  | 
|  | 20 | Userfaults are delivered and resolved through the userfaultfd syscall. | 
|  | 21 |  | 
|  | 22 | The userfaultfd (aside from registering and unregistering virtual | 
|  | 23 | memory ranges) provides two primary functionalities: | 
|  | 24 |  | 
|  | 25 | 1) read/POLLIN protocol to notify a userland thread of the faults | 
|  | 26 | happening | 
|  | 27 |  | 
|  | 28 | 2) various UFFDIO_* ioctls that can manage the virtual memory regions | 
|  | 29 | registered in the userfaultfd that allows userland to efficiently | 
|  | 30 | resolve the userfaults it receives via 1) or to manage the virtual | 
|  | 31 | memory in the background | 
|  | 32 |  | 
|  | 33 | The real advantage of userfaults if compared to regular virtual memory | 
|  | 34 | management of mremap/mprotect is that the userfaults in all their | 
|  | 35 | operations never involve heavyweight structures like vmas (in fact the | 
|  | 36 | userfaultfd runtime load never takes the mmap_sem for writing). | 
|  | 37 |  | 
|  | 38 | Vmas are not suitable for page- (or hugepage) granular fault tracking | 
|  | 39 | when dealing with virtual address spaces that could span | 
|  | 40 | Terabytes. Too many vmas would be needed for that. | 
|  | 41 |  | 
|  | 42 | The userfaultfd once opened by invoking the syscall, can also be | 
|  | 43 | passed using unix domain sockets to a manager process, so the same | 
|  | 44 | manager process could handle the userfaults of a multitude of | 
|  | 45 | different processes without them being aware about what is going on | 
|  | 46 | (well of course unless they later try to use the userfaultfd | 
|  | 47 | themselves on the same region the manager is already tracking, which | 
|  | 48 | is a corner case that would currently return -EBUSY). | 
|  | 49 |  | 
|  | 50 | API | 
|  | 51 | === | 
|  | 52 |  | 
|  | 53 | When first opened the userfaultfd must be enabled invoking the | 
|  | 54 | UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or | 
|  | 55 | a later API version) which will specify the read/POLLIN protocol | 
|  | 56 | userland intends to speak on the UFFD and the uffdio_api.features | 
|  | 57 | userland requires. The UFFDIO_API ioctl if successful (i.e. if the | 
|  | 58 | requested uffdio_api.api is spoken also by the running kernel and the | 
|  | 59 | requested features are going to be enabled) will return into | 
|  | 60 | uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of | 
|  | 61 | respectively all the available features of the read(2) protocol and | 
|  | 62 | the generic ioctl available. | 
|  | 63 |  | 
|  | 64 | The uffdio_api.features bitmask returned by the UFFDIO_API ioctl | 
|  | 65 | defines what memory types are supported by the userfaultfd and what | 
|  | 66 | events, except page fault notifications, may be generated. | 
|  | 67 |  | 
|  | 68 | If the kernel supports registering userfaultfd ranges on hugetlbfs | 
|  | 69 | virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in | 
|  | 70 | uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be | 
|  | 71 | set if the kernel supports registering userfaultfd ranges on shared | 
|  | 72 | memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero | 
|  | 73 | MAP_SHARED, memfd_create, etc). | 
|  | 74 |  | 
|  | 75 | The userland application that wants to use userfaultfd with hugetlbfs | 
|  | 76 | or shared memory need to set the corresponding flag in | 
|  | 77 | uffdio_api.features to enable those features. | 
|  | 78 |  | 
|  | 79 | If the userland desires to receive notifications for events other than | 
|  | 80 | page faults, it has to verify that uffdio_api.features has appropriate | 
|  | 81 | UFFD_FEATURE_EVENT_* bits set. These events are described in more | 
|  | 82 | detail below in "Non-cooperative userfaultfd" section. | 
|  | 83 |  | 
|  | 84 | Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should | 
|  | 85 | be invoked (if present in the returned uffdio_api.ioctls bitmask) to | 
|  | 86 | register a memory range in the userfaultfd by setting the | 
|  | 87 | uffdio_register structure accordingly. The uffdio_register.mode | 
|  | 88 | bitmask will specify to the kernel which kind of faults to track for | 
|  | 89 | the range (UFFDIO_REGISTER_MODE_MISSING would track missing | 
|  | 90 | pages). The UFFDIO_REGISTER ioctl will return the | 
|  | 91 | uffdio_register.ioctls bitmask of ioctls that are suitable to resolve | 
|  | 92 | userfaults on the range registered. Not all ioctls will necessarily be | 
|  | 93 | supported for all memory types depending on the underlying virtual | 
|  | 94 | memory backend (anonymous memory vs tmpfs vs real filebacked | 
|  | 95 | mappings). | 
|  | 96 |  | 
|  | 97 | Userland can use the uffdio_register.ioctls to manage the virtual | 
|  | 98 | address space in the background (to add or potentially also remove | 
|  | 99 | memory from the userfaultfd registered range). This means a userfault | 
|  | 100 | could be triggering just before userland maps in the background the | 
|  | 101 | user-faulted page. | 
|  | 102 |  | 
|  | 103 | The primary ioctl to resolve userfaults is UFFDIO_COPY. That | 
|  | 104 | atomically copies a page into the userfault registered range and wakes | 
|  | 105 | up the blocked userfaults (unless uffdio_copy.mode & | 
|  | 106 | UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to | 
|  | 107 | UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an | 
|  | 108 | half copied page since it'll keep userfaulting until the copy has | 
|  | 109 | finished. | 
|  | 110 |  | 
|  | 111 | QEMU/KVM | 
|  | 112 | ======== | 
|  | 113 |  | 
|  | 114 | QEMU/KVM is using the userfaultfd syscall to implement postcopy live | 
|  | 115 | migration. Postcopy live migration is one form of memory | 
|  | 116 | externalization consisting of a virtual machine running with part or | 
|  | 117 | all of its memory residing on a different node in the cloud. The | 
|  | 118 | userfaultfd abstraction is generic enough that not a single line of | 
|  | 119 | KVM kernel code had to be modified in order to add postcopy live | 
|  | 120 | migration to QEMU. | 
|  | 121 |  | 
|  | 122 | Guest async page faults, FOLL_NOWAIT and all other GUP features work | 
|  | 123 | just fine in combination with userfaults. Userfaults trigger async | 
|  | 124 | page faults in the guest scheduler so those guest processes that | 
|  | 125 | aren't waiting for userfaults (i.e. network bound) can keep running in | 
|  | 126 | the guest vcpus. | 
|  | 127 |  | 
|  | 128 | It is generally beneficial to run one pass of precopy live migration | 
|  | 129 | just before starting postcopy live migration, in order to avoid | 
|  | 130 | generating userfaults for readonly guest regions. | 
|  | 131 |  | 
|  | 132 | The implementation of postcopy live migration currently uses one | 
|  | 133 | single bidirectional socket but in the future two different sockets | 
|  | 134 | will be used (to reduce the latency of the userfaults to the minimum | 
|  | 135 | possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). | 
|  | 136 |  | 
|  | 137 | The QEMU in the source node writes all pages that it knows are missing | 
|  | 138 | in the destination node, into the socket, and the migration thread of | 
|  | 139 | the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE | 
|  | 140 | ioctls on the userfaultfd in order to map the received pages into the | 
|  | 141 | guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). | 
|  | 142 |  | 
|  | 143 | A different postcopy thread in the destination node listens with | 
|  | 144 | poll() to the userfaultfd in parallel. When a POLLIN event is | 
|  | 145 | generated after a userfault triggers, the postcopy thread read() from | 
|  | 146 | the userfaultfd and receives the fault address (or -EAGAIN in case the | 
|  | 147 | userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run | 
|  | 148 | by the parallel QEMU migration thread). | 
|  | 149 |  | 
|  | 150 | After the QEMU postcopy thread (running in the destination node) gets | 
|  | 151 | the userfault address it writes the information about the missing page | 
|  | 152 | into the socket. The QEMU source node receives the information and | 
|  | 153 | roughly "seeks" to that page address and continues sending all | 
|  | 154 | remaining missing pages from that new page offset. Soon after that | 
|  | 155 | (just the time to flush the tcp_wmem queue through the network) the | 
|  | 156 | migration thread in the QEMU running in the destination node will | 
|  | 157 | receive the page that triggered the userfault and it'll map it as | 
|  | 158 | usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it | 
|  | 159 | was spontaneously sent by the source or if it was an urgent page | 
|  | 160 | requested through a userfault). | 
|  | 161 |  | 
|  | 162 | By the time the userfaults start, the QEMU in the destination node | 
|  | 163 | doesn't need to keep any per-page state bitmap relative to the live | 
|  | 164 | migration around and a single per-page bitmap has to be maintained in | 
|  | 165 | the QEMU running in the source node to know which pages are still | 
|  | 166 | missing in the destination node. The bitmap in the source node is | 
|  | 167 | checked to find which missing pages to send in round robin and we seek | 
|  | 168 | over it when receiving incoming userfaults. After sending each page of | 
|  | 169 | course the bitmap is updated accordingly. It's also useful to avoid | 
|  | 170 | sending the same page twice (in case the userfault is read by the | 
|  | 171 | postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration | 
|  | 172 | thread). | 
|  | 173 |  | 
|  | 174 | Non-cooperative userfaultfd | 
|  | 175 | =========================== | 
|  | 176 |  | 
|  | 177 | When the userfaultfd is monitored by an external manager, the manager | 
|  | 178 | must be able to track changes in the process virtual memory | 
|  | 179 | layout. Userfaultfd can notify the manager about such changes using | 
|  | 180 | the same read(2) protocol as for the page fault notifications. The | 
|  | 181 | manager has to explicitly enable these events by setting appropriate | 
|  | 182 | bits in uffdio_api.features passed to UFFDIO_API ioctl: | 
|  | 183 |  | 
|  | 184 | UFFD_FEATURE_EVENT_FORK | 
|  | 185 | enable userfaultfd hooks for fork(). When this feature is | 
|  | 186 | enabled, the userfaultfd context of the parent process is | 
|  | 187 | duplicated into the newly created process. The manager | 
|  | 188 | receives UFFD_EVENT_FORK with file descriptor of the new | 
|  | 189 | userfaultfd context in the uffd_msg.fork. | 
|  | 190 |  | 
|  | 191 | UFFD_FEATURE_EVENT_REMAP | 
|  | 192 | enable notifications about mremap() calls. When the | 
|  | 193 | non-cooperative process moves a virtual memory area to a | 
|  | 194 | different location, the manager will receive | 
|  | 195 | UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and | 
|  | 196 | new addresses of the area and its original length. | 
|  | 197 |  | 
|  | 198 | UFFD_FEATURE_EVENT_REMOVE | 
|  | 199 | enable notifications about madvise(MADV_REMOVE) and | 
|  | 200 | madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will | 
|  | 201 | be generated upon these calls to madvise. The uffd_msg.remove | 
|  | 202 | will contain start and end addresses of the removed area. | 
|  | 203 |  | 
|  | 204 | UFFD_FEATURE_EVENT_UNMAP | 
|  | 205 | enable notifications about memory unmapping. The manager will | 
|  | 206 | get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and | 
|  | 207 | end addresses of the unmapped area. | 
|  | 208 |  | 
|  | 209 | Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP | 
|  | 210 | are pretty similar, they quite differ in the action expected from the | 
|  | 211 | userfaultfd manager. In the former case, the virtual memory is | 
|  | 212 | removed, but the area is not, the area remains monitored by the | 
|  | 213 | userfaultfd, and if a page fault occurs in that area it will be | 
|  | 214 | delivered to the manager. The proper resolution for such page fault is | 
|  | 215 | to zeromap the faulting address. However, in the latter case, when an | 
|  | 216 | area is unmapped, either explicitly (with munmap() system call), or | 
|  | 217 | implicitly (e.g. during mremap()), the area is removed and in turn the | 
|  | 218 | userfaultfd context for such area disappears too and the manager will | 
|  | 219 | not get further userland page faults from the removed area. Still, the | 
|  | 220 | notification is required in order to prevent manager from using | 
|  | 221 | UFFDIO_COPY on the unmapped area. | 
|  | 222 |  | 
|  | 223 | Unlike userland page faults which have to be synchronous and require | 
|  | 224 | explicit or implicit wakeup, all the events are delivered | 
|  | 225 | asynchronously and the non-cooperative process resumes execution as | 
|  | 226 | soon as manager executes read(). The userfaultfd manager should | 
|  | 227 | carefully synchronize calls to UFFDIO_COPY with the events | 
|  | 228 | processing. To aid the synchronization, the UFFDIO_COPY ioctl will | 
|  | 229 | return -ENOSPC when the monitored process exits at the time of | 
|  | 230 | UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed | 
|  | 231 | its virtual memory layout simultaneously with outstanding UFFDIO_COPY | 
|  | 232 | operation. | 
|  | 233 |  | 
|  | 234 | The current asynchronous model of the event delivery is optimal for | 
|  | 235 | single threaded non-cooperative userfaultfd manager implementations. A | 
|  | 236 | synchronous event delivery model can be added later as a new | 
|  | 237 | userfaultfd feature to facilitate multithreading enhancements of the | 
|  | 238 | non cooperative manager, for example to allow UFFDIO_COPY ioctls to | 
|  | 239 | run in parallel to the event reception. Single threaded | 
|  | 240 | implementations should continue to use the current async event | 
|  | 241 | delivery model instead. |