| rjw | 1f88458 | 2022-01-06 17:20:42 +0800 | [diff] [blame] | 1 | Heterogeneous Memory Management (HMM) | 
 | 2 |  | 
 | 3 | Transparently allow any component of a program to use any memory region of said | 
 | 4 | program with a device without using device specific memory allocator. This is | 
 | 5 | becoming a requirement to simplify the use of advance heterogeneous computing | 
 | 6 | where GPU, DSP or FPGA are use to perform various computations. | 
 | 7 |  | 
 | 8 | This document is divided as follow, in the first section i expose the problems | 
 | 9 | related to the use of a device specific allocator. The second section i expose | 
 | 10 | the hardware limitations that are inherent to many platforms. The third section | 
 | 11 | gives an overview of HMM designs. The fourth section explains how CPU page- | 
 | 12 | table mirroring works and what is HMM purpose in this context. Fifth section | 
 | 13 | deals with how device memory is represented inside the kernel. Finaly the last | 
 | 14 | section present the new migration helper that allow to leverage the device DMA | 
 | 15 | engine. | 
 | 16 |  | 
 | 17 |  | 
 | 18 | 1) Problems of using device specific memory allocator: | 
 | 19 | 2) System bus, device memory characteristics | 
 | 20 | 3) Share address space and migration | 
 | 21 | 4) Address space mirroring implementation and API | 
 | 22 | 5) Represent and manage device memory from core kernel point of view | 
 | 23 | 6) Migrate to and from device memory | 
 | 24 | 7) Memory cgroup (memcg) and rss accounting | 
 | 25 |  | 
 | 26 |  | 
 | 27 | ------------------------------------------------------------------------------- | 
 | 28 |  | 
 | 29 | 1) Problems of using device specific memory allocator: | 
 | 30 |  | 
 | 31 | Device with large amount of on board memory (several giga bytes) like GPU have | 
 | 32 | historically manage their memory through dedicated driver specific API. This | 
 | 33 | creates a disconnect between memory allocated and managed by device driver and | 
 | 34 | regular application memory (private anonymous, share memory or regular file | 
 | 35 | back memory). From here on i will refer to this aspect as split address space. | 
 | 36 | I use share address space to refer to the opposite situation ie one in which | 
 | 37 | any memory region can be use by device transparently. | 
 | 38 |  | 
 | 39 | Split address space because device can only access memory allocated through the | 
 | 40 | device specific API. This imply that all memory object in a program are not | 
 | 41 | equal from device point of view which complicate large program that rely on a | 
 | 42 | wide set of libraries. | 
 | 43 |  | 
 | 44 | Concretly this means that code that wants to leverage device like GPU need to | 
 | 45 | copy object between genericly allocated memory (malloc, mmap private/share/) | 
 | 46 | and memory allocated through the device driver API (this still end up with an | 
 | 47 | mmap but of the device file). | 
 | 48 |  | 
 | 49 | For flat dataset (array, grid, image, ...) this isn't too hard to achieve but | 
 | 50 | complex data-set (list, tree, ...) are hard to get right. Duplicating a complex | 
 | 51 | data-set need to re-map all the pointer relations between each of its elements. | 
 | 52 | This is error prone and program gets harder to debug because of the duplicate | 
 | 53 | data-set. | 
 | 54 |  | 
 | 55 | Split address space also means that library can not transparently use data they | 
 | 56 | are getting from core program or other library and thus each library might have | 
 | 57 | to duplicate its input data-set using specific memory allocator. Large project | 
 | 58 | suffer from this and waste resources because of the various memory copy. | 
 | 59 |  | 
 | 60 | Duplicating each library API to accept as input or output memory allocted by | 
 | 61 | each device specific allocator is not a viable option. It would lead to a | 
 | 62 | combinatorial explosions in the library entry points. | 
 | 63 |  | 
 | 64 | Finaly with the advance of high level language constructs (in C++ but in other | 
 | 65 | language too) it is now possible for compiler to leverage GPU or other devices | 
 | 66 | without even the programmer knowledge. Some of compiler identified patterns are | 
 | 67 | only do-able with a share address. It is as well more reasonable to use a share | 
 | 68 | address space for all the other patterns. | 
 | 69 |  | 
 | 70 |  | 
 | 71 | ------------------------------------------------------------------------------- | 
 | 72 |  | 
 | 73 | 2) System bus, device memory characteristics | 
 | 74 |  | 
 | 75 | System bus cripple share address due to few limitations. Most system bus only | 
 | 76 | allow basic memory access from device to main memory, even cache coherency is | 
 | 77 | often optional. Access to device memory from CPU is even more limited, most | 
 | 78 | often than not it is not cache coherent. | 
 | 79 |  | 
 | 80 | If we only consider the PCIE bus than device can access main memory (often | 
 | 81 | through an IOMMU) and be cache coherent with the CPUs. However it only allows | 
 | 82 | a limited set of atomic operation from device on main memory. This is worse | 
 | 83 | in the other direction the CPUs can only access a limited range of the device | 
 | 84 | memory and can not perform atomic operations on it. Thus device memory can not | 
 | 85 | be consider like regular memory from kernel point of view. | 
 | 86 |  | 
 | 87 | Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0 | 
 | 88 | and 16 lanes). This is 33 times less that fastest GPU memory (1 TBytes/s). | 
 | 89 | The final limitation is latency, access to main memory from the device has an | 
 | 90 | order of magnitude higher latency than when the device access its own memory. | 
 | 91 |  | 
 | 92 | Some platform are developing new system bus or additions/modifications to PCIE | 
 | 93 | to address some of those limitations (OpenCAPI, CCIX). They mainly allow two | 
 | 94 | way cache coherency between CPU and device and allow all atomic operations the | 
 | 95 | architecture supports. Saddly not all platform are following this trends and | 
 | 96 | some major architecture are left without hardware solutions to those problems. | 
 | 97 |  | 
 | 98 | So for share address space to make sense not only we must allow device to | 
 | 99 | access any memory memory but we must also permit any memory to be migrated to | 
 | 100 | device memory while device is using it (blocking CPU access while it happens). | 
 | 101 |  | 
 | 102 |  | 
 | 103 | ------------------------------------------------------------------------------- | 
 | 104 |  | 
 | 105 | 3) Share address space and migration | 
 | 106 |  | 
 | 107 | HMM intends to provide two main features. First one is to share the address | 
 | 108 | space by duplication the CPU page table into the device page table so same | 
 | 109 | address point to same memory and this for any valid main memory address in | 
 | 110 | the process address space. | 
 | 111 |  | 
 | 112 | To achieve this, HMM offer a set of helpers to populate the device page table | 
 | 113 | while keeping track of CPU page table updates. Device page table updates are | 
 | 114 | not as easy as CPU page table updates. To update the device page table you must | 
 | 115 | allow a buffer (or use a pool of pre-allocated buffer) and write GPU specifics | 
 | 116 | commands in it to perform the update (unmap, cache invalidations and flush, | 
 | 117 | ...). This can not be done through common code for all device. Hence why HMM | 
 | 118 | provides helpers to factor out everything that can be while leaving the gory | 
 | 119 | details to the device driver. | 
 | 120 |  | 
 | 121 | The second mechanism HMM provide is a new kind of ZONE_DEVICE memory that does | 
 | 122 | allow to allocate a struct page for each page of the device memory. Those page | 
 | 123 | are special because the CPU can not map them. They however allow to migrate | 
 | 124 | main memory to device memory using exhisting migration mechanism and everything | 
 | 125 | looks like if page was swap out to disk from CPU point of view. Using a struct | 
 | 126 | page gives the easiest and cleanest integration with existing mm mechanisms. | 
 | 127 | Again here HMM only provide helpers, first to hotplug new ZONE_DEVICE memory | 
 | 128 | for the device memory and second to perform migration. Policy decision of what | 
 | 129 | and when to migrate things is left to the device driver. | 
 | 130 |  | 
 | 131 | Note that any CPU access to a device page trigger a page fault and a migration | 
 | 132 | back to main memory ie when a page backing an given address A is migrated from | 
 | 133 | a main memory page to a device page then any CPU access to address A trigger a | 
 | 134 | page fault and initiate a migration back to main memory. | 
 | 135 |  | 
 | 136 |  | 
 | 137 | With this two features, HMM not only allow a device to mirror a process address | 
 | 138 | space and keeps both CPU and device page table synchronize, but also allow to | 
 | 139 | leverage device memory by migrating part of data-set that is actively use by a | 
 | 140 | device. | 
 | 141 |  | 
 | 142 |  | 
 | 143 | ------------------------------------------------------------------------------- | 
 | 144 |  | 
 | 145 | 4) Address space mirroring implementation and API | 
 | 146 |  | 
 | 147 | Address space mirroring main objective is to allow to duplicate range of CPU | 
 | 148 | page table into a device page table and HMM helps keeping both synchronize. A | 
 | 149 | device driver that want to mirror a process address space must start with the | 
 | 150 | registration of an hmm_mirror struct: | 
 | 151 |  | 
 | 152 |  int hmm_mirror_register(struct hmm_mirror *mirror, | 
 | 153 |                          struct mm_struct *mm); | 
 | 154 |  int hmm_mirror_register_locked(struct hmm_mirror *mirror, | 
 | 155 |                                 struct mm_struct *mm); | 
 | 156 |  | 
 | 157 | The locked variant is to be use when the driver is already holding the mmap_sem | 
 | 158 | of the mm in write mode. The mirror struct has a set of callback that are use | 
 | 159 | to propagate CPU page table: | 
 | 160 |  | 
 | 161 |  struct hmm_mirror_ops { | 
 | 162 |      /* sync_cpu_device_pagetables() - synchronize page tables | 
 | 163 |       * | 
 | 164 |       * @mirror: pointer to struct hmm_mirror | 
 | 165 |       * @update_type: type of update that occurred to the CPU page table | 
 | 166 |       * @start: virtual start address of the range to update | 
 | 167 |       * @end: virtual end address of the range to update | 
 | 168 |       * | 
 | 169 |       * This callback ultimately originates from mmu_notifiers when the CPU | 
 | 170 |       * page table is updated. The device driver must update its page table | 
 | 171 |       * in response to this callback. The update argument tells what action | 
 | 172 |       * to perform. | 
 | 173 |       * | 
 | 174 |       * The device driver must not return from this callback until the device | 
 | 175 |       * page tables are completely updated (TLBs flushed, etc); this is a | 
 | 176 |       * synchronous call. | 
 | 177 |       */ | 
 | 178 |       void (*update)(struct hmm_mirror *mirror, | 
 | 179 |                      enum hmm_update action, | 
 | 180 |                      unsigned long start, | 
 | 181 |                      unsigned long end); | 
 | 182 |  }; | 
 | 183 |  | 
 | 184 | Device driver must perform update to the range following action (turn range | 
 | 185 | read only, or fully unmap, ...). Once driver callback returns the device must | 
 | 186 | be done with the update. | 
 | 187 |  | 
 | 188 |  | 
 | 189 | When device driver wants to populate a range of virtual address it can use | 
 | 190 | either: | 
 | 191 |  int hmm_vma_get_pfns(struct vm_area_struct *vma, | 
 | 192 |                       struct hmm_range *range, | 
 | 193 |                       unsigned long start, | 
 | 194 |                       unsigned long end, | 
 | 195 |                       hmm_pfn_t *pfns); | 
 | 196 |  int hmm_vma_fault(struct vm_area_struct *vma, | 
 | 197 |                    struct hmm_range *range, | 
 | 198 |                    unsigned long start, | 
 | 199 |                    unsigned long end, | 
 | 200 |                    hmm_pfn_t *pfns, | 
 | 201 |                    bool write, | 
 | 202 |                    bool block); | 
 | 203 |  | 
 | 204 | First one (hmm_vma_get_pfns()) will only fetch present CPU page table entry and | 
 | 205 | will not trigger a page fault on missing or non present entry. The second one | 
 | 206 | do trigger page fault on missing or read only entry if write parameter is true. | 
 | 207 | Page fault use the generic mm page fault code path just like a CPU page fault. | 
 | 208 |  | 
 | 209 | Both function copy CPU page table into their pfns array argument. Each entry in | 
 | 210 | that array correspond to an address in the virtual range. HMM provide a set of | 
 | 211 | flags to help driver identify special CPU page table entries. | 
 | 212 |  | 
 | 213 | Locking with the update() callback is the most important aspect the driver must | 
 | 214 | respect in order to keep things properly synchronize. The usage pattern is : | 
 | 215 |  | 
 | 216 |  int driver_populate_range(...) | 
 | 217 |  { | 
 | 218 |       struct hmm_range range; | 
 | 219 |       ... | 
 | 220 |  again: | 
 | 221 |       ret = hmm_vma_get_pfns(vma, &range, start, end, pfns); | 
 | 222 |       if (ret) | 
 | 223 |           return ret; | 
 | 224 |       take_lock(driver->update); | 
 | 225 |       if (!hmm_vma_range_done(vma, &range)) { | 
 | 226 |           release_lock(driver->update); | 
 | 227 |           goto again; | 
 | 228 |       } | 
 | 229 |  | 
 | 230 |       // Use pfns array content to update device page table | 
 | 231 |  | 
 | 232 |       release_lock(driver->update); | 
 | 233 |       return 0; | 
 | 234 |  } | 
 | 235 |  | 
 | 236 | The driver->update lock is the same lock that driver takes inside its update() | 
 | 237 | callback. That lock must be call before hmm_vma_range_done() to avoid any race | 
 | 238 | with a concurrent CPU page table update. | 
 | 239 |  | 
 | 240 | HMM implements all this on top of the mmu_notifier API because we wanted to a | 
 | 241 | simpler API and also to be able to perform optimization latter own like doing | 
 | 242 | concurrent device update in multi-devices scenario. | 
 | 243 |  | 
 | 244 | HMM also serve as an impedence missmatch between how CPU page table update are | 
 | 245 | done (by CPU write to the page table and TLB flushes) from how device update | 
 | 246 | their own page table. Device update is a multi-step process, first appropriate | 
 | 247 | commands are write to a buffer, then this buffer is schedule for execution on | 
 | 248 | the device. It is only once the device has executed commands in the buffer that | 
 | 249 | the update is done. Creating and scheduling update command buffer can happen | 
 | 250 | concurrently for multiple devices. Waiting for each device to report commands | 
 | 251 | as executed is serialize (there is no point in doing this concurrently). | 
 | 252 |  | 
 | 253 |  | 
 | 254 | ------------------------------------------------------------------------------- | 
 | 255 |  | 
 | 256 | 5) Represent and manage device memory from core kernel point of view | 
 | 257 |  | 
 | 258 | Several differents design were try to support device memory. First one use | 
 | 259 | device specific data structure to keep information about migrated memory and | 
 | 260 | HMM hooked itself in various place of mm code to handle any access to address | 
 | 261 | that were back by device memory. It turns out that this ended up replicating | 
 | 262 | most of the fields of struct page and also needed many kernel code path to be | 
 | 263 | updated to understand this new kind of memory. | 
 | 264 |  | 
 | 265 | Thing is most kernel code path never try to access the memory behind a page | 
 | 266 | but only care about struct page contents. Because of this HMM switchted to | 
 | 267 | directly using struct page for device memory which left most kernel code path | 
 | 268 | un-aware of the difference. We only need to make sure that no one ever try to | 
 | 269 | map those page from the CPU side. | 
 | 270 |  | 
 | 271 | HMM provide a set of helpers to register and hotplug device memory as a new | 
 | 272 | region needing struct page. This is offer through a very simple API: | 
 | 273 |  | 
 | 274 |  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, | 
 | 275 |                                    struct device *device, | 
 | 276 |                                    unsigned long size); | 
 | 277 |  void hmm_devmem_remove(struct hmm_devmem *devmem); | 
 | 278 |  | 
 | 279 | The hmm_devmem_ops is where most of the important things are: | 
 | 280 |  | 
 | 281 |  struct hmm_devmem_ops { | 
 | 282 |      void (*free)(struct hmm_devmem *devmem, struct page *page); | 
 | 283 |      int (*fault)(struct hmm_devmem *devmem, | 
 | 284 |                   struct vm_area_struct *vma, | 
 | 285 |                   unsigned long addr, | 
 | 286 |                   struct page *page, | 
 | 287 |                   unsigned flags, | 
 | 288 |                   pmd_t *pmdp); | 
 | 289 |  }; | 
 | 290 |  | 
 | 291 | The first callback (free()) happens when the last reference on a device page is | 
 | 292 | drop. This means the device page is now free and no longer use by anyone. The | 
 | 293 | second callback happens whenever CPU try to access a device page which it can | 
 | 294 | not do. This second callback must trigger a migration back to system memory. | 
 | 295 |  | 
 | 296 |  | 
 | 297 | ------------------------------------------------------------------------------- | 
 | 298 |  | 
 | 299 | 6) Migrate to and from device memory | 
 | 300 |  | 
 | 301 | Because CPU can not access device memory, migration must use device DMA engine | 
 | 302 | to perform copy from and to device memory. For this we need a new migration | 
 | 303 | helper: | 
 | 304 |  | 
 | 305 |  int migrate_vma(const struct migrate_vma_ops *ops, | 
 | 306 |                  struct vm_area_struct *vma, | 
 | 307 |                  unsigned long mentries, | 
 | 308 |                  unsigned long start, | 
 | 309 |                  unsigned long end, | 
 | 310 |                  unsigned long *src, | 
 | 311 |                  unsigned long *dst, | 
 | 312 |                  void *private); | 
 | 313 |  | 
 | 314 | Unlike other migration function it works on a range of virtual address, there | 
 | 315 | is two reasons for that. First device DMA copy has a high setup overhead cost | 
 | 316 | and thus batching multiple pages is needed as otherwise the migration overhead | 
 | 317 | make the whole excersie pointless. The second reason is because driver trigger | 
 | 318 | such migration base on range of address the device is actively accessing. | 
 | 319 |  | 
 | 320 | The migrate_vma_ops struct define two callbacks. First one (alloc_and_copy()) | 
 | 321 | control destination memory allocation and copy operation. Second one is there | 
 | 322 | to allow device driver to perform cleanup operation after migration. | 
 | 323 |  | 
 | 324 |  struct migrate_vma_ops { | 
 | 325 |      void (*alloc_and_copy)(struct vm_area_struct *vma, | 
 | 326 |                             const unsigned long *src, | 
 | 327 |                             unsigned long *dst, | 
 | 328 |                             unsigned long start, | 
 | 329 |                             unsigned long end, | 
 | 330 |                             void *private); | 
 | 331 |      void (*finalize_and_map)(struct vm_area_struct *vma, | 
 | 332 |                               const unsigned long *src, | 
 | 333 |                               const unsigned long *dst, | 
 | 334 |                               unsigned long start, | 
 | 335 |                               unsigned long end, | 
 | 336 |                               void *private); | 
 | 337 |  }; | 
 | 338 |  | 
 | 339 | It is important to stress that this migration helpers allow for hole in the | 
 | 340 | virtual address range. Some pages in the range might not be migrated for all | 
 | 341 | the usual reasons (page is pin, page is lock, ...). This helper does not fail | 
 | 342 | but just skip over those pages. | 
 | 343 |  | 
 | 344 | The alloc_and_copy() might as well decide to not migrate all pages in the | 
 | 345 | range (for reasons under the callback control). For those the callback just | 
 | 346 | have to leave the corresponding dst entry empty. | 
 | 347 |  | 
 | 348 | Finaly the migration of the struct page might fails (for file back page) for | 
 | 349 | various reasons (failure to freeze reference, or update page cache, ...). If | 
 | 350 | that happens then the finalize_and_map() can catch any pages that was not | 
 | 351 | migrated. Note those page were still copied to new page and thus we wasted | 
 | 352 | bandwidth but this is considered as a rare event and a price that we are | 
 | 353 | willing to pay to keep all the code simpler. | 
 | 354 |  | 
 | 355 |  | 
 | 356 | ------------------------------------------------------------------------------- | 
 | 357 |  | 
 | 358 | 7) Memory cgroup (memcg) and rss accounting | 
 | 359 |  | 
 | 360 | For now device memory is accounted as any regular page in rss counters (either | 
 | 361 | anonymous if device page is use for anonymous, file if device page is use for | 
 | 362 | file back page or shmem if device page is use for share memory). This is a | 
 | 363 | deliberate choice to keep existing application that might start using device | 
 | 364 | memory without knowing about it to keep runing unimpacted. | 
 | 365 |  | 
 | 366 | Drawbacks is that OOM killer might kill an application using a lot of device | 
 | 367 | memory and not a lot of regular system memory and thus not freeing much system | 
 | 368 | memory. We want to gather more real world experience on how application and | 
 | 369 | system react under memory pressure in the presence of device memory before | 
 | 370 | deciding to account device memory differently. | 
 | 371 |  | 
 | 372 |  | 
 | 373 | Same decision was made for memory cgroup. Device memory page are accounted | 
 | 374 | against same memory cgroup a regular page would be accounted to. This does | 
 | 375 | simplify migration to and from device memory. This also means that migration | 
 | 376 | back from device memory to regular memory can not fail because it would | 
 | 377 | go above memory cgroup limit. We might revisit this choice latter on once we | 
 | 378 | get more experience in how device memory is use and its impact on memory | 
 | 379 | resource control. | 
 | 380 |  | 
 | 381 |  | 
 | 382 | Note that device memory can never be pin nor by device driver nor through GUP | 
 | 383 | and thus such memory is always free upon process exit. Or when last reference | 
 | 384 | is drop in case of share memory or file back memory. |