|  | .. _userfaultfd: | 
|  |  | 
|  | =========== | 
|  | Userfaultfd | 
|  | =========== | 
|  |  | 
|  | Objective | 
|  | ========= | 
|  |  | 
|  | Userfaults allow the implementation of on-demand paging from userland | 
|  | and more generally they allow userland to take control of various | 
|  | memory page faults, something otherwise only the kernel code could do. | 
|  |  | 
|  | For example userfaults allows a proper and more optimal implementation | 
|  | of the PROT_NONE+SIGSEGV trick. | 
|  |  | 
|  | Design | 
|  | ====== | 
|  |  | 
|  | Userfaults are delivered and resolved through the userfaultfd syscall. | 
|  |  | 
|  | The userfaultfd (aside from registering and unregistering virtual | 
|  | memory ranges) provides two primary functionalities: | 
|  |  | 
|  | 1) read/POLLIN protocol to notify a userland thread of the faults | 
|  | happening | 
|  |  | 
|  | 2) various UFFDIO_* ioctls that can manage the virtual memory regions | 
|  | registered in the userfaultfd that allows userland to efficiently | 
|  | resolve the userfaults it receives via 1) or to manage the virtual | 
|  | memory in the background | 
|  |  | 
|  | The real advantage of userfaults if compared to regular virtual memory | 
|  | management of mremap/mprotect is that the userfaults in all their | 
|  | operations never involve heavyweight structures like vmas (in fact the | 
|  | userfaultfd runtime load never takes the mmap_sem for writing). | 
|  |  | 
|  | Vmas are not suitable for page- (or hugepage) granular fault tracking | 
|  | when dealing with virtual address spaces that could span | 
|  | Terabytes. Too many vmas would be needed for that. | 
|  |  | 
|  | The userfaultfd once opened by invoking the syscall, can also be | 
|  | passed using unix domain sockets to a manager process, so the same | 
|  | manager process could handle the userfaults of a multitude of | 
|  | different processes without them being aware about what is going on | 
|  | (well of course unless they later try to use the userfaultfd | 
|  | themselves on the same region the manager is already tracking, which | 
|  | is a corner case that would currently return -EBUSY). | 
|  |  | 
|  | API | 
|  | === | 
|  |  | 
|  | When first opened the userfaultfd must be enabled invoking the | 
|  | UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or | 
|  | a later API version) which will specify the read/POLLIN protocol | 
|  | userland intends to speak on the UFFD and the uffdio_api.features | 
|  | userland requires. The UFFDIO_API ioctl if successful (i.e. if the | 
|  | requested uffdio_api.api is spoken also by the running kernel and the | 
|  | requested features are going to be enabled) will return into | 
|  | uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of | 
|  | respectively all the available features of the read(2) protocol and | 
|  | the generic ioctl available. | 
|  |  | 
|  | The uffdio_api.features bitmask returned by the UFFDIO_API ioctl | 
|  | defines what memory types are supported by the userfaultfd and what | 
|  | events, except page fault notifications, may be generated. | 
|  |  | 
|  | If the kernel supports registering userfaultfd ranges on hugetlbfs | 
|  | virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in | 
|  | uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be | 
|  | set if the kernel supports registering userfaultfd ranges on shared | 
|  | memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero | 
|  | MAP_SHARED, memfd_create, etc). | 
|  |  | 
|  | The userland application that wants to use userfaultfd with hugetlbfs | 
|  | or shared memory need to set the corresponding flag in | 
|  | uffdio_api.features to enable those features. | 
|  |  | 
|  | If the userland desires to receive notifications for events other than | 
|  | page faults, it has to verify that uffdio_api.features has appropriate | 
|  | UFFD_FEATURE_EVENT_* bits set. These events are described in more | 
|  | detail below in "Non-cooperative userfaultfd" section. | 
|  |  | 
|  | Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should | 
|  | be invoked (if present in the returned uffdio_api.ioctls bitmask) to | 
|  | register a memory range in the userfaultfd by setting the | 
|  | uffdio_register structure accordingly. The uffdio_register.mode | 
|  | bitmask will specify to the kernel which kind of faults to track for | 
|  | the range (UFFDIO_REGISTER_MODE_MISSING would track missing | 
|  | pages). The UFFDIO_REGISTER ioctl will return the | 
|  | uffdio_register.ioctls bitmask of ioctls that are suitable to resolve | 
|  | userfaults on the range registered. Not all ioctls will necessarily be | 
|  | supported for all memory types depending on the underlying virtual | 
|  | memory backend (anonymous memory vs tmpfs vs real filebacked | 
|  | mappings). | 
|  |  | 
|  | Userland can use the uffdio_register.ioctls to manage the virtual | 
|  | address space in the background (to add or potentially also remove | 
|  | memory from the userfaultfd registered range). This means a userfault | 
|  | could be triggering just before userland maps in the background the | 
|  | user-faulted page. | 
|  |  | 
|  | The primary ioctl to resolve userfaults is UFFDIO_COPY. That | 
|  | atomically copies a page into the userfault registered range and wakes | 
|  | up the blocked userfaults (unless uffdio_copy.mode & | 
|  | UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to | 
|  | UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an | 
|  | half copied page since it'll keep userfaulting until the copy has | 
|  | finished. | 
|  |  | 
|  | QEMU/KVM | 
|  | ======== | 
|  |  | 
|  | QEMU/KVM is using the userfaultfd syscall to implement postcopy live | 
|  | migration. Postcopy live migration is one form of memory | 
|  | externalization consisting of a virtual machine running with part or | 
|  | all of its memory residing on a different node in the cloud. The | 
|  | userfaultfd abstraction is generic enough that not a single line of | 
|  | KVM kernel code had to be modified in order to add postcopy live | 
|  | migration to QEMU. | 
|  |  | 
|  | Guest async page faults, FOLL_NOWAIT and all other GUP features work | 
|  | just fine in combination with userfaults. Userfaults trigger async | 
|  | page faults in the guest scheduler so those guest processes that | 
|  | aren't waiting for userfaults (i.e. network bound) can keep running in | 
|  | the guest vcpus. | 
|  |  | 
|  | It is generally beneficial to run one pass of precopy live migration | 
|  | just before starting postcopy live migration, in order to avoid | 
|  | generating userfaults for readonly guest regions. | 
|  |  | 
|  | The implementation of postcopy live migration currently uses one | 
|  | single bidirectional socket but in the future two different sockets | 
|  | will be used (to reduce the latency of the userfaults to the minimum | 
|  | possible without having to decrease /proc/sys/net/ipv4/tcp_wmem). | 
|  |  | 
|  | The QEMU in the source node writes all pages that it knows are missing | 
|  | in the destination node, into the socket, and the migration thread of | 
|  | the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE | 
|  | ioctls on the userfaultfd in order to map the received pages into the | 
|  | guest (UFFDIO_ZEROCOPY is used if the source page was a zero page). | 
|  |  | 
|  | A different postcopy thread in the destination node listens with | 
|  | poll() to the userfaultfd in parallel. When a POLLIN event is | 
|  | generated after a userfault triggers, the postcopy thread read() from | 
|  | the userfaultfd and receives the fault address (or -EAGAIN in case the | 
|  | userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run | 
|  | by the parallel QEMU migration thread). | 
|  |  | 
|  | After the QEMU postcopy thread (running in the destination node) gets | 
|  | the userfault address it writes the information about the missing page | 
|  | into the socket. The QEMU source node receives the information and | 
|  | roughly "seeks" to that page address and continues sending all | 
|  | remaining missing pages from that new page offset. Soon after that | 
|  | (just the time to flush the tcp_wmem queue through the network) the | 
|  | migration thread in the QEMU running in the destination node will | 
|  | receive the page that triggered the userfault and it'll map it as | 
|  | usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it | 
|  | was spontaneously sent by the source or if it was an urgent page | 
|  | requested through a userfault). | 
|  |  | 
|  | By the time the userfaults start, the QEMU in the destination node | 
|  | doesn't need to keep any per-page state bitmap relative to the live | 
|  | migration around and a single per-page bitmap has to be maintained in | 
|  | the QEMU running in the source node to know which pages are still | 
|  | missing in the destination node. The bitmap in the source node is | 
|  | checked to find which missing pages to send in round robin and we seek | 
|  | over it when receiving incoming userfaults. After sending each page of | 
|  | course the bitmap is updated accordingly. It's also useful to avoid | 
|  | sending the same page twice (in case the userfault is read by the | 
|  | postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration | 
|  | thread). | 
|  |  | 
|  | Non-cooperative userfaultfd | 
|  | =========================== | 
|  |  | 
|  | When the userfaultfd is monitored by an external manager, the manager | 
|  | must be able to track changes in the process virtual memory | 
|  | layout. Userfaultfd can notify the manager about such changes using | 
|  | the same read(2) protocol as for the page fault notifications. The | 
|  | manager has to explicitly enable these events by setting appropriate | 
|  | bits in uffdio_api.features passed to UFFDIO_API ioctl: | 
|  |  | 
|  | UFFD_FEATURE_EVENT_FORK | 
|  | enable userfaultfd hooks for fork(). When this feature is | 
|  | enabled, the userfaultfd context of the parent process is | 
|  | duplicated into the newly created process. The manager | 
|  | receives UFFD_EVENT_FORK with file descriptor of the new | 
|  | userfaultfd context in the uffd_msg.fork. | 
|  |  | 
|  | UFFD_FEATURE_EVENT_REMAP | 
|  | enable notifications about mremap() calls. When the | 
|  | non-cooperative process moves a virtual memory area to a | 
|  | different location, the manager will receive | 
|  | UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and | 
|  | new addresses of the area and its original length. | 
|  |  | 
|  | UFFD_FEATURE_EVENT_REMOVE | 
|  | enable notifications about madvise(MADV_REMOVE) and | 
|  | madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will | 
|  | be generated upon these calls to madvise. The uffd_msg.remove | 
|  | will contain start and end addresses of the removed area. | 
|  |  | 
|  | UFFD_FEATURE_EVENT_UNMAP | 
|  | enable notifications about memory unmapping. The manager will | 
|  | get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and | 
|  | end addresses of the unmapped area. | 
|  |  | 
|  | Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP | 
|  | are pretty similar, they quite differ in the action expected from the | 
|  | userfaultfd manager. In the former case, the virtual memory is | 
|  | removed, but the area is not, the area remains monitored by the | 
|  | userfaultfd, and if a page fault occurs in that area it will be | 
|  | delivered to the manager. The proper resolution for such page fault is | 
|  | to zeromap the faulting address. However, in the latter case, when an | 
|  | area is unmapped, either explicitly (with munmap() system call), or | 
|  | implicitly (e.g. during mremap()), the area is removed and in turn the | 
|  | userfaultfd context for such area disappears too and the manager will | 
|  | not get further userland page faults from the removed area. Still, the | 
|  | notification is required in order to prevent manager from using | 
|  | UFFDIO_COPY on the unmapped area. | 
|  |  | 
|  | Unlike userland page faults which have to be synchronous and require | 
|  | explicit or implicit wakeup, all the events are delivered | 
|  | asynchronously and the non-cooperative process resumes execution as | 
|  | soon as manager executes read(). The userfaultfd manager should | 
|  | carefully synchronize calls to UFFDIO_COPY with the events | 
|  | processing. To aid the synchronization, the UFFDIO_COPY ioctl will | 
|  | return -ENOSPC when the monitored process exits at the time of | 
|  | UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed | 
|  | its virtual memory layout simultaneously with outstanding UFFDIO_COPY | 
|  | operation. | 
|  |  | 
|  | The current asynchronous model of the event delivery is optimal for | 
|  | single threaded non-cooperative userfaultfd manager implementations. A | 
|  | synchronous event delivery model can be added later as a new | 
|  | userfaultfd feature to facilitate multithreading enhancements of the | 
|  | non cooperative manager, for example to allow UFFDIO_COPY ioctls to | 
|  | run in parallel to the event reception. Single threaded | 
|  | implementations should continue to use the current async event | 
|  | delivery model instead. |