| rjw | 1f88458 | 2022-01-06 17:20:42 +0800 | [diff] [blame] | 1 | L1TF - L1 Terminal Fault | 
|  | 2 | ======================== | 
|  | 3 |  | 
|  | 4 | L1 Terminal Fault is a hardware vulnerability which allows unprivileged | 
|  | 5 | speculative access to data which is available in the Level 1 Data Cache | 
|  | 6 | when the page table entry controlling the virtual address, which is used | 
|  | 7 | for the access, has the Present bit cleared or other reserved bits set. | 
|  | 8 |  | 
|  | 9 | Affected processors | 
|  | 10 | ------------------- | 
|  | 11 |  | 
|  | 12 | This vulnerability affects a wide range of Intel processors. The | 
|  | 13 | vulnerability is not present on: | 
|  | 14 |  | 
|  | 15 | - Processors from AMD, Centaur and other non Intel vendors | 
|  | 16 |  | 
|  | 17 | - Older processor models, where the CPU family is < 6 | 
|  | 18 |  | 
|  | 19 | - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft, | 
|  | 20 | Penwell, Pineview, Silvermont, Airmont, Merrifield) | 
|  | 21 |  | 
|  | 22 | - The Intel XEON PHI family | 
|  | 23 |  | 
|  | 24 | - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the | 
|  | 25 | IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected | 
|  | 26 | by the Meltdown vulnerability either. These CPUs should become | 
|  | 27 | available by end of 2018. | 
|  | 28 |  | 
|  | 29 | Whether a processor is affected or not can be read out from the L1TF | 
|  | 30 | vulnerability file in sysfs. See :ref:`l1tf_sys_info`. | 
|  | 31 |  | 
|  | 32 | Related CVEs | 
|  | 33 | ------------ | 
|  | 34 |  | 
|  | 35 | The following CVE entries are related to the L1TF vulnerability: | 
|  | 36 |  | 
|  | 37 | =============  =================  ============================== | 
|  | 38 | CVE-2018-3615  L1 Terminal Fault  SGX related aspects | 
|  | 39 | CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects | 
|  | 40 | CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects | 
|  | 41 | =============  =================  ============================== | 
|  | 42 |  | 
|  | 43 | Problem | 
|  | 44 | ------- | 
|  | 45 |  | 
|  | 46 | If an instruction accesses a virtual address for which the relevant page | 
|  | 47 | table entry (PTE) has the Present bit cleared or other reserved bits set, | 
|  | 48 | then speculative execution ignores the invalid PTE and loads the referenced | 
|  | 49 | data if it is present in the Level 1 Data Cache, as if the page referenced | 
|  | 50 | by the address bits in the PTE was still present and accessible. | 
|  | 51 |  | 
|  | 52 | While this is a purely speculative mechanism and the instruction will raise | 
|  | 53 | a page fault when it is retired eventually, the pure act of loading the | 
|  | 54 | data and making it available to other speculative instructions opens up the | 
|  | 55 | opportunity for side channel attacks to unprivileged malicious code, | 
|  | 56 | similar to the Meltdown attack. | 
|  | 57 |  | 
|  | 58 | While Meltdown breaks the user space to kernel space protection, L1TF | 
|  | 59 | allows to attack any physical memory address in the system and the attack | 
|  | 60 | works across all protection domains. It allows an attack of SGX and also | 
|  | 61 | works from inside virtual machines because the speculation bypasses the | 
|  | 62 | extended page table (EPT) protection mechanism. | 
|  | 63 |  | 
|  | 64 |  | 
|  | 65 | Attack scenarios | 
|  | 66 | ---------------- | 
|  | 67 |  | 
|  | 68 | 1. Malicious user space | 
|  | 69 | ^^^^^^^^^^^^^^^^^^^^^^^ | 
|  | 70 |  | 
|  | 71 | Operating Systems store arbitrary information in the address bits of a | 
|  | 72 | PTE which is marked non present. This allows a malicious user space | 
|  | 73 | application to attack the physical memory to which these PTEs resolve. | 
|  | 74 | In some cases user-space can maliciously influence the information | 
|  | 75 | encoded in the address bits of the PTE, thus making attacks more | 
|  | 76 | deterministic and more practical. | 
|  | 77 |  | 
|  | 78 | The Linux kernel contains a mitigation for this attack vector, PTE | 
|  | 79 | inversion, which is permanently enabled and has no performance | 
|  | 80 | impact. The kernel ensures that the address bits of PTEs, which are not | 
|  | 81 | marked present, never point to cacheable physical memory space. | 
|  | 82 |  | 
|  | 83 | A system with an up to date kernel is protected against attacks from | 
|  | 84 | malicious user space applications. | 
|  | 85 |  | 
|  | 86 | 2. Malicious guest in a virtual machine | 
|  | 87 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  | 88 |  | 
|  | 89 | The fact that L1TF breaks all domain protections allows malicious guest | 
|  | 90 | OSes, which can control the PTEs directly, and malicious guest user | 
|  | 91 | space applications, which run on an unprotected guest kernel lacking the | 
|  | 92 | PTE inversion mitigation for L1TF, to attack physical host memory. | 
|  | 93 |  | 
|  | 94 | A special aspect of L1TF in the context of virtualization is symmetric | 
|  | 95 | multi threading (SMT). The Intel implementation of SMT is called | 
|  | 96 | HyperThreading. The fact that Hyperthreads on the affected processors | 
|  | 97 | share the L1 Data Cache (L1D) is important for this. As the flaw allows | 
|  | 98 | only to attack data which is present in L1D, a malicious guest running | 
|  | 99 | on one Hyperthread can attack the data which is brought into the L1D by | 
|  | 100 | the context which runs on the sibling Hyperthread of the same physical | 
|  | 101 | core. This context can be host OS, host user space or a different guest. | 
|  | 102 |  | 
|  | 103 | If the processor does not support Extended Page Tables, the attack is | 
|  | 104 | only possible, when the hypervisor does not sanitize the content of the | 
|  | 105 | effective (shadow) page tables. | 
|  | 106 |  | 
|  | 107 | While solutions exist to mitigate these attack vectors fully, these | 
|  | 108 | mitigations are not enabled by default in the Linux kernel because they | 
|  | 109 | can affect performance significantly. The kernel provides several | 
|  | 110 | mechanisms which can be utilized to address the problem depending on the | 
|  | 111 | deployment scenario. The mitigations, their protection scope and impact | 
|  | 112 | are described in the next sections. | 
|  | 113 |  | 
|  | 114 | The default mitigations and the rationale for choosing them are explained | 
|  | 115 | at the end of this document. See :ref:`default_mitigations`. | 
|  | 116 |  | 
|  | 117 | .. _l1tf_sys_info: | 
|  | 118 |  | 
|  | 119 | L1TF system information | 
|  | 120 | ----------------------- | 
|  | 121 |  | 
|  | 122 | The Linux kernel provides a sysfs interface to enumerate the current L1TF | 
|  | 123 | status of the system: whether the system is vulnerable, and which | 
|  | 124 | mitigations are active. The relevant sysfs file is: | 
|  | 125 |  | 
|  | 126 | /sys/devices/system/cpu/vulnerabilities/l1tf | 
|  | 127 |  | 
|  | 128 | The possible values in this file are: | 
|  | 129 |  | 
|  | 130 | ===========================   =============================== | 
|  | 131 | 'Not affected'		The processor is not vulnerable | 
|  | 132 | 'Mitigation: PTE Inversion'	The host protection is active | 
|  | 133 | ===========================   =============================== | 
|  | 134 |  | 
|  | 135 | If KVM/VMX is enabled and the processor is vulnerable then the following | 
|  | 136 | information is appended to the 'Mitigation: PTE Inversion' part: | 
|  | 137 |  | 
|  | 138 | - SMT status: | 
|  | 139 |  | 
|  | 140 | =====================  ================ | 
|  | 141 | 'VMX: SMT vulnerable'  SMT is enabled | 
|  | 142 | 'VMX: SMT disabled'    SMT is disabled | 
|  | 143 | =====================  ================ | 
|  | 144 |  | 
|  | 145 | - L1D Flush mode: | 
|  | 146 |  | 
|  | 147 | ================================  ==================================== | 
|  | 148 | 'L1D vulnerable'		      L1D flushing is disabled | 
|  | 149 |  | 
|  | 150 | 'L1D conditional cache flushes'   L1D flush is conditionally enabled | 
|  | 151 |  | 
|  | 152 | 'L1D cache flushes'		      L1D flush is unconditionally enabled | 
|  | 153 | ================================  ==================================== | 
|  | 154 |  | 
|  | 155 | The resulting grade of protection is discussed in the following sections. | 
|  | 156 |  | 
|  | 157 |  | 
|  | 158 | Host mitigation mechanism | 
|  | 159 | ------------------------- | 
|  | 160 |  | 
|  | 161 | The kernel is unconditionally protected against L1TF attacks from malicious | 
|  | 162 | user space running on the host. | 
|  | 163 |  | 
|  | 164 |  | 
|  | 165 | Guest mitigation mechanisms | 
|  | 166 | --------------------------- | 
|  | 167 |  | 
|  | 168 | .. _l1d_flush: | 
|  | 169 |  | 
|  | 170 | 1. L1D flush on VMENTER | 
|  | 171 | ^^^^^^^^^^^^^^^^^^^^^^^ | 
|  | 172 |  | 
|  | 173 | To make sure that a guest cannot attack data which is present in the L1D | 
|  | 174 | the hypervisor flushes the L1D before entering the guest. | 
|  | 175 |  | 
|  | 176 | Flushing the L1D evicts not only the data which should not be accessed | 
|  | 177 | by a potentially malicious guest, it also flushes the guest | 
|  | 178 | data. Flushing the L1D has a performance impact as the processor has to | 
|  | 179 | bring the flushed guest data back into the L1D. Depending on the | 
|  | 180 | frequency of VMEXIT/VMENTER and the type of computations in the guest | 
|  | 181 | performance degradation in the range of 1% to 50% has been observed. For | 
|  | 182 | scenarios where guest VMEXIT/VMENTER are rare the performance impact is | 
|  | 183 | minimal. Virtio and mechanisms like posted interrupts are designed to | 
|  | 184 | confine the VMEXITs to a bare minimum, but specific configurations and | 
|  | 185 | application scenarios might still suffer from a high VMEXIT rate. | 
|  | 186 |  | 
|  | 187 | The kernel provides two L1D flush modes: | 
|  | 188 | - conditional ('cond') | 
|  | 189 | - unconditional ('always') | 
|  | 190 |  | 
|  | 191 | The conditional mode avoids L1D flushing after VMEXITs which execute | 
|  | 192 | only audited code paths before the corresponding VMENTER. These code | 
|  | 193 | paths have been verified that they cannot expose secrets or other | 
|  | 194 | interesting data to an attacker, but they can leak information about the | 
|  | 195 | address space layout of the hypervisor. | 
|  | 196 |  | 
|  | 197 | Unconditional mode flushes L1D on all VMENTER invocations and provides | 
|  | 198 | maximum protection. It has a higher overhead than the conditional | 
|  | 199 | mode. The overhead cannot be quantified correctly as it depends on the | 
|  | 200 | workload scenario and the resulting number of VMEXITs. | 
|  | 201 |  | 
|  | 202 | The general recommendation is to enable L1D flush on VMENTER. The kernel | 
|  | 203 | defaults to conditional mode on affected processors. | 
|  | 204 |  | 
|  | 205 | **Note**, that L1D flush does not prevent the SMT problem because the | 
|  | 206 | sibling thread will also bring back its data into the L1D which makes it | 
|  | 207 | attackable again. | 
|  | 208 |  | 
|  | 209 | L1D flush can be controlled by the administrator via the kernel command | 
|  | 210 | line and sysfs control files. See :ref:`mitigation_control_command_line` | 
|  | 211 | and :ref:`mitigation_control_kvm`. | 
|  | 212 |  | 
|  | 213 | .. _guest_confinement: | 
|  | 214 |  | 
|  | 215 | 2. Guest VCPU confinement to dedicated physical cores | 
|  | 216 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  | 217 |  | 
|  | 218 | To address the SMT problem, it is possible to make a guest or a group of | 
|  | 219 | guests affine to one or more physical cores. The proper mechanism for | 
|  | 220 | that is to utilize exclusive cpusets to ensure that no other guest or | 
|  | 221 | host tasks can run on these cores. | 
|  | 222 |  | 
|  | 223 | If only a single guest or related guests run on sibling SMT threads on | 
|  | 224 | the same physical core then they can only attack their own memory and | 
|  | 225 | restricted parts of the host memory. | 
|  | 226 |  | 
|  | 227 | Host memory is attackable, when one of the sibling SMT threads runs in | 
|  | 228 | host OS (hypervisor) context and the other in guest context. The amount | 
|  | 229 | of valuable information from the host OS context depends on the context | 
|  | 230 | which the host OS executes, i.e. interrupts, soft interrupts and kernel | 
|  | 231 | threads. The amount of valuable data from these contexts cannot be | 
|  | 232 | declared as non-interesting for an attacker without deep inspection of | 
|  | 233 | the code. | 
|  | 234 |  | 
|  | 235 | **Note**, that assigning guests to a fixed set of physical cores affects | 
|  | 236 | the ability of the scheduler to do load balancing and might have | 
|  | 237 | negative effects on CPU utilization depending on the hosting | 
|  | 238 | scenario. Disabling SMT might be a viable alternative for particular | 
|  | 239 | scenarios. | 
|  | 240 |  | 
|  | 241 | For further information about confining guests to a single or to a group | 
|  | 242 | of cores consult the cpusets documentation: | 
|  | 243 |  | 
|  | 244 | https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt | 
|  | 245 |  | 
|  | 246 | .. _interrupt_isolation: | 
|  | 247 |  | 
|  | 248 | 3. Interrupt affinity | 
|  | 249 | ^^^^^^^^^^^^^^^^^^^^^ | 
|  | 250 |  | 
|  | 251 | Interrupts can be made affine to logical CPUs. This is not universally | 
|  | 252 | true because there are types of interrupts which are truly per CPU | 
|  | 253 | interrupts, e.g. the local timer interrupt. Aside of that multi queue | 
|  | 254 | devices affine their interrupts to single CPUs or groups of CPUs per | 
|  | 255 | queue without allowing the administrator to control the affinities. | 
|  | 256 |  | 
|  | 257 | Moving the interrupts, which can be affinity controlled, away from CPUs | 
|  | 258 | which run untrusted guests, reduces the attack vector space. | 
|  | 259 |  | 
|  | 260 | Whether the interrupts with are affine to CPUs, which run untrusted | 
|  | 261 | guests, provide interesting data for an attacker depends on the system | 
|  | 262 | configuration and the scenarios which run on the system. While for some | 
|  | 263 | of the interrupts it can be assumed that they won't expose interesting | 
|  | 264 | information beyond exposing hints about the host OS memory layout, there | 
|  | 265 | is no way to make general assumptions. | 
|  | 266 |  | 
|  | 267 | Interrupt affinity can be controlled by the administrator via the | 
|  | 268 | /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is | 
|  | 269 | available at: | 
|  | 270 |  | 
|  | 271 | https://www.kernel.org/doc/Documentation/IRQ-affinity.txt | 
|  | 272 |  | 
|  | 273 | .. _smt_control: | 
|  | 274 |  | 
|  | 275 | 4. SMT control | 
|  | 276 | ^^^^^^^^^^^^^^ | 
|  | 277 |  | 
|  | 278 | To prevent the SMT issues of L1TF it might be necessary to disable SMT | 
|  | 279 | completely. Disabling SMT can have a significant performance impact, but | 
|  | 280 | the impact depends on the hosting scenario and the type of workloads. | 
|  | 281 | The impact of disabling SMT needs also to be weighted against the impact | 
|  | 282 | of other mitigation solutions like confining guests to dedicated cores. | 
|  | 283 |  | 
|  | 284 | The kernel provides a sysfs interface to retrieve the status of SMT and | 
|  | 285 | to control it. It also provides a kernel command line interface to | 
|  | 286 | control SMT. | 
|  | 287 |  | 
|  | 288 | The kernel command line interface consists of the following options: | 
|  | 289 |  | 
|  | 290 | =========== ========================================================== | 
|  | 291 | nosmt	 Affects the bring up of the secondary CPUs during boot. The | 
|  | 292 | kernel tries to bring all present CPUs online during the | 
|  | 293 | boot process. "nosmt" makes sure that from each physical | 
|  | 294 | core only one - the so called primary (hyper) thread is | 
|  | 295 | activated. Due to a design flaw of Intel processors related | 
|  | 296 | to Machine Check Exceptions the non primary siblings have | 
|  | 297 | to be brought up at least partially and are then shut down | 
|  | 298 | again.  "nosmt" can be undone via the sysfs interface. | 
|  | 299 |  | 
|  | 300 | nosmt=force Has the same effect as "nosmt" but it does not allow to | 
|  | 301 | undo the SMT disable via the sysfs interface. | 
|  | 302 | =========== ========================================================== | 
|  | 303 |  | 
|  | 304 | The sysfs interface provides two files: | 
|  | 305 |  | 
|  | 306 | - /sys/devices/system/cpu/smt/control | 
|  | 307 | - /sys/devices/system/cpu/smt/active | 
|  | 308 |  | 
|  | 309 | /sys/devices/system/cpu/smt/control: | 
|  | 310 |  | 
|  | 311 | This file allows to read out the SMT control state and provides the | 
|  | 312 | ability to disable or (re)enable SMT. The possible states are: | 
|  | 313 |  | 
|  | 314 | ==============  =================================================== | 
|  | 315 | on		SMT is supported by the CPU and enabled. All | 
|  | 316 | logical CPUs can be onlined and offlined without | 
|  | 317 | restrictions. | 
|  | 318 |  | 
|  | 319 | off		SMT is supported by the CPU and disabled. Only | 
|  | 320 | the so called primary SMT threads can be onlined | 
|  | 321 | and offlined without restrictions. An attempt to | 
|  | 322 | online a non-primary sibling is rejected | 
|  | 323 |  | 
|  | 324 | forceoff	Same as 'off' but the state cannot be controlled. | 
|  | 325 | Attempts to write to the control file are rejected. | 
|  | 326 |  | 
|  | 327 | notsupported	The processor does not support SMT. It's therefore | 
|  | 328 | not affected by the SMT implications of L1TF. | 
|  | 329 | Attempts to write to the control file are rejected. | 
|  | 330 | ==============  =================================================== | 
|  | 331 |  | 
|  | 332 | The possible states which can be written into this file to control SMT | 
|  | 333 | state are: | 
|  | 334 |  | 
|  | 335 | - on | 
|  | 336 | - off | 
|  | 337 | - forceoff | 
|  | 338 |  | 
|  | 339 | /sys/devices/system/cpu/smt/active: | 
|  | 340 |  | 
|  | 341 | This file reports whether SMT is enabled and active, i.e. if on any | 
|  | 342 | physical core two or more sibling threads are online. | 
|  | 343 |  | 
|  | 344 | SMT control is also possible at boot time via the l1tf kernel command | 
|  | 345 | line parameter in combination with L1D flush control. See | 
|  | 346 | :ref:`mitigation_control_command_line`. | 
|  | 347 |  | 
|  | 348 | 5. Disabling EPT | 
|  | 349 | ^^^^^^^^^^^^^^^^ | 
|  | 350 |  | 
|  | 351 | Disabling EPT for virtual machines provides full mitigation for L1TF even | 
|  | 352 | with SMT enabled, because the effective page tables for guests are | 
|  | 353 | managed and sanitized by the hypervisor. Though disabling EPT has a | 
|  | 354 | significant performance impact especially when the Meltdown mitigation | 
|  | 355 | KPTI is enabled. | 
|  | 356 |  | 
|  | 357 | EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter. | 
|  | 358 |  | 
|  | 359 | There is ongoing research and development for new mitigation mechanisms to | 
|  | 360 | address the performance impact of disabling SMT or EPT. | 
|  | 361 |  | 
|  | 362 | .. _mitigation_control_command_line: | 
|  | 363 |  | 
|  | 364 | Mitigation control on the kernel command line | 
|  | 365 | --------------------------------------------- | 
|  | 366 |  | 
|  | 367 | The kernel command line allows to control the L1TF mitigations at boot | 
|  | 368 | time with the option "l1tf=". The valid arguments for this option are: | 
|  | 369 |  | 
|  | 370 | ============  ============================================================= | 
|  | 371 | full		Provides all available mitigations for the L1TF | 
|  | 372 | vulnerability. Disables SMT and enables all mitigations in | 
|  | 373 | the hypervisors, i.e. unconditional L1D flushing | 
|  | 374 |  | 
|  | 375 | SMT control and L1D flush control via the sysfs interface | 
|  | 376 | is still possible after boot.  Hypervisors will issue a | 
|  | 377 | warning when the first VM is started in a potentially | 
|  | 378 | insecure configuration, i.e. SMT enabled or L1D flush | 
|  | 379 | disabled. | 
|  | 380 |  | 
|  | 381 | full,force	Same as 'full', but disables SMT and L1D flush runtime | 
|  | 382 | control. Implies the 'nosmt=force' command line option. | 
|  | 383 | (i.e. sysfs control of SMT is disabled.) | 
|  | 384 |  | 
|  | 385 | flush		Leaves SMT enabled and enables the default hypervisor | 
|  | 386 | mitigation, i.e. conditional L1D flushing | 
|  | 387 |  | 
|  | 388 | SMT control and L1D flush control via the sysfs interface | 
|  | 389 | is still possible after boot.  Hypervisors will issue a | 
|  | 390 | warning when the first VM is started in a potentially | 
|  | 391 | insecure configuration, i.e. SMT enabled or L1D flush | 
|  | 392 | disabled. | 
|  | 393 |  | 
|  | 394 | flush,nosmt	Disables SMT and enables the default hypervisor mitigation, | 
|  | 395 | i.e. conditional L1D flushing. | 
|  | 396 |  | 
|  | 397 | SMT control and L1D flush control via the sysfs interface | 
|  | 398 | is still possible after boot.  Hypervisors will issue a | 
|  | 399 | warning when the first VM is started in a potentially | 
|  | 400 | insecure configuration, i.e. SMT enabled or L1D flush | 
|  | 401 | disabled. | 
|  | 402 |  | 
|  | 403 | flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is | 
|  | 404 | started in a potentially insecure configuration. | 
|  | 405 |  | 
|  | 406 | off		Disables hypervisor mitigations and doesn't emit any | 
|  | 407 | warnings. | 
|  | 408 | It also drops the swap size and available RAM limit restrictions | 
|  | 409 | on both hypervisor and bare metal. | 
|  | 410 |  | 
|  | 411 | ============  ============================================================= | 
|  | 412 |  | 
|  | 413 | The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`. | 
|  | 414 |  | 
|  | 415 |  | 
|  | 416 | .. _mitigation_control_kvm: | 
|  | 417 |  | 
|  | 418 | Mitigation control for KVM - module parameter | 
|  | 419 | ------------------------------------------------------------- | 
|  | 420 |  | 
|  | 421 | The KVM hypervisor mitigation mechanism, flushing the L1D cache when | 
|  | 422 | entering a guest, can be controlled with a module parameter. | 
|  | 423 |  | 
|  | 424 | The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the | 
|  | 425 | following arguments: | 
|  | 426 |  | 
|  | 427 | ============  ============================================================== | 
|  | 428 | always	L1D cache flush on every VMENTER. | 
|  | 429 |  | 
|  | 430 | cond		Flush L1D on VMENTER only when the code between VMEXIT and | 
|  | 431 | VMENTER can leak host memory which is considered | 
|  | 432 | interesting for an attacker. This still can leak host memory | 
|  | 433 | which allows e.g. to determine the hosts address space layout. | 
|  | 434 |  | 
|  | 435 | never		Disables the mitigation | 
|  | 436 | ============  ============================================================== | 
|  | 437 |  | 
|  | 438 | The parameter can be provided on the kernel command line, as a module | 
|  | 439 | parameter when loading the modules and at runtime modified via the sysfs | 
|  | 440 | file: | 
|  | 441 |  | 
|  | 442 | /sys/module/kvm_intel/parameters/vmentry_l1d_flush | 
|  | 443 |  | 
|  | 444 | The default is 'cond'. If 'l1tf=full,force' is given on the kernel command | 
|  | 445 | line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush | 
|  | 446 | module parameter is ignored and writes to the sysfs file are rejected. | 
|  | 447 |  | 
|  | 448 | .. _mitigation_selection: | 
|  | 449 |  | 
|  | 450 | Mitigation selection guide | 
|  | 451 | -------------------------- | 
|  | 452 |  | 
|  | 453 | 1. No virtualization in use | 
|  | 454 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  | 455 |  | 
|  | 456 | The system is protected by the kernel unconditionally and no further | 
|  | 457 | action is required. | 
|  | 458 |  | 
|  | 459 | 2. Virtualization with trusted guests | 
|  | 460 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  | 461 |  | 
|  | 462 | If the guest comes from a trusted source and the guest OS kernel is | 
|  | 463 | guaranteed to have the L1TF mitigations in place the system is fully | 
|  | 464 | protected against L1TF and no further action is required. | 
|  | 465 |  | 
|  | 466 | To avoid the overhead of the default L1D flushing on VMENTER the | 
|  | 467 | administrator can disable the flushing via the kernel command line and | 
|  | 468 | sysfs control files. See :ref:`mitigation_control_command_line` and | 
|  | 469 | :ref:`mitigation_control_kvm`. | 
|  | 470 |  | 
|  | 471 |  | 
|  | 472 | 3. Virtualization with untrusted guests | 
|  | 473 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  | 474 |  | 
|  | 475 | 3.1. SMT not supported or disabled | 
|  | 476 | """""""""""""""""""""""""""""""""" | 
|  | 477 |  | 
|  | 478 | If SMT is not supported by the processor or disabled in the BIOS or by | 
|  | 479 | the kernel, it's only required to enforce L1D flushing on VMENTER. | 
|  | 480 |  | 
|  | 481 | Conditional L1D flushing is the default behaviour and can be tuned. See | 
|  | 482 | :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`. | 
|  | 483 |  | 
|  | 484 | 3.2. EPT not supported or disabled | 
|  | 485 | """""""""""""""""""""""""""""""""" | 
|  | 486 |  | 
|  | 487 | If EPT is not supported by the processor or disabled in the hypervisor, | 
|  | 488 | the system is fully protected. SMT can stay enabled and L1D flushing on | 
|  | 489 | VMENTER is not required. | 
|  | 490 |  | 
|  | 491 | EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter. | 
|  | 492 |  | 
|  | 493 | 3.3. SMT and EPT supported and active | 
|  | 494 | """"""""""""""""""""""""""""""""""""" | 
|  | 495 |  | 
|  | 496 | If SMT and EPT are supported and active then various degrees of | 
|  | 497 | mitigations can be employed: | 
|  | 498 |  | 
|  | 499 | - L1D flushing on VMENTER: | 
|  | 500 |  | 
|  | 501 | L1D flushing on VMENTER is the minimal protection requirement, but it | 
|  | 502 | is only potent in combination with other mitigation methods. | 
|  | 503 |  | 
|  | 504 | Conditional L1D flushing is the default behaviour and can be tuned. See | 
|  | 505 | :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`. | 
|  | 506 |  | 
|  | 507 | - Guest confinement: | 
|  | 508 |  | 
|  | 509 | Confinement of guests to a single or a group of physical cores which | 
|  | 510 | are not running any other processes, can reduce the attack surface | 
|  | 511 | significantly, but interrupts, soft interrupts and kernel threads can | 
|  | 512 | still expose valuable data to a potential attacker. See | 
|  | 513 | :ref:`guest_confinement`. | 
|  | 514 |  | 
|  | 515 | - Interrupt isolation: | 
|  | 516 |  | 
|  | 517 | Isolating the guest CPUs from interrupts can reduce the attack surface | 
|  | 518 | further, but still allows a malicious guest to explore a limited amount | 
|  | 519 | of host physical memory. This can at least be used to gain knowledge | 
|  | 520 | about the host address space layout. The interrupts which have a fixed | 
|  | 521 | affinity to the CPUs which run the untrusted guests can depending on | 
|  | 522 | the scenario still trigger soft interrupts and schedule kernel threads | 
|  | 523 | which might expose valuable information. See | 
|  | 524 | :ref:`interrupt_isolation`. | 
|  | 525 |  | 
|  | 526 | The above three mitigation methods combined can provide protection to a | 
|  | 527 | certain degree, but the risk of the remaining attack surface has to be | 
|  | 528 | carefully analyzed. For full protection the following methods are | 
|  | 529 | available: | 
|  | 530 |  | 
|  | 531 | - Disabling SMT: | 
|  | 532 |  | 
|  | 533 | Disabling SMT and enforcing the L1D flushing provides the maximum | 
|  | 534 | amount of protection. This mitigation is not depending on any of the | 
|  | 535 | above mitigation methods. | 
|  | 536 |  | 
|  | 537 | SMT control and L1D flushing can be tuned by the command line | 
|  | 538 | parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run | 
|  | 539 | time with the matching sysfs control files. See :ref:`smt_control`, | 
|  | 540 | :ref:`mitigation_control_command_line` and | 
|  | 541 | :ref:`mitigation_control_kvm`. | 
|  | 542 |  | 
|  | 543 | - Disabling EPT: | 
|  | 544 |  | 
|  | 545 | Disabling EPT provides the maximum amount of protection as well. It is | 
|  | 546 | not depending on any of the above mitigation methods. SMT can stay | 
|  | 547 | enabled and L1D flushing is not required, but the performance impact is | 
|  | 548 | significant. | 
|  | 549 |  | 
|  | 550 | EPT can be disabled in the hypervisor via the 'kvm-intel.ept' | 
|  | 551 | parameter. | 
|  | 552 |  | 
|  | 553 | 3.4. Nested virtual machines | 
|  | 554 | """""""""""""""""""""""""""" | 
|  | 555 |  | 
|  | 556 | When nested virtualization is in use, three operating systems are involved: | 
|  | 557 | the bare metal hypervisor, the nested hypervisor and the nested virtual | 
|  | 558 | machine.  VMENTER operations from the nested hypervisor into the nested | 
|  | 559 | guest will always be processed by the bare metal hypervisor. If KVM is the | 
|  | 560 | bare metal hypervisor it will: | 
|  | 561 |  | 
|  | 562 | - Flush the L1D cache on every switch from the nested hypervisor to the | 
|  | 563 | nested virtual machine, so that the nested hypervisor's secrets are not | 
|  | 564 | exposed to the nested virtual machine; | 
|  | 565 |  | 
|  | 566 | - Flush the L1D cache on every switch from the nested virtual machine to | 
|  | 567 | the nested hypervisor; this is a complex operation, and flushing the L1D | 
|  | 568 | cache avoids that the bare metal hypervisor's secrets are exposed to the | 
|  | 569 | nested virtual machine; | 
|  | 570 |  | 
|  | 571 | - Instruct the nested hypervisor to not perform any L1D cache flush. This | 
|  | 572 | is an optimization to avoid double L1D flushing. | 
|  | 573 |  | 
|  | 574 |  | 
|  | 575 | .. _default_mitigations: | 
|  | 576 |  | 
|  | 577 | Default mitigations | 
|  | 578 | ------------------- | 
|  | 579 |  | 
|  | 580 | The kernel default mitigations for vulnerable processors are: | 
|  | 581 |  | 
|  | 582 | - PTE inversion to protect against malicious user space. This is done | 
|  | 583 | unconditionally and cannot be controlled. The swap storage is limited | 
|  | 584 | to ~16TB. | 
|  | 585 |  | 
|  | 586 | - L1D conditional flushing on VMENTER when EPT is enabled for | 
|  | 587 | a guest. | 
|  | 588 |  | 
|  | 589 | The kernel does not by default enforce the disabling of SMT, which leaves | 
|  | 590 | SMT systems vulnerable when running untrusted guests with EPT enabled. | 
|  | 591 |  | 
|  | 592 | The rationale for this choice is: | 
|  | 593 |  | 
|  | 594 | - Force disabling SMT can break existing setups, especially with | 
|  | 595 | unattended updates. | 
|  | 596 |  | 
|  | 597 | - If regular users run untrusted guests on their machine, then L1TF is | 
|  | 598 | just an add on to other malware which might be embedded in an untrusted | 
|  | 599 | guest, e.g. spam-bots or attacks on the local network. | 
|  | 600 |  | 
|  | 601 | There is no technical way to prevent a user from running untrusted code | 
|  | 602 | on their machines blindly. | 
|  | 603 |  | 
|  | 604 | - It's technically extremely unlikely and from today's knowledge even | 
|  | 605 | impossible that L1TF can be exploited via the most popular attack | 
|  | 606 | mechanisms like JavaScript because these mechanisms have no way to | 
|  | 607 | control PTEs. If this would be possible and not other mitigation would | 
|  | 608 | be possible, then the default might be different. | 
|  | 609 |  | 
|  | 610 | - The administrators of cloud and hosting setups have to carefully | 
|  | 611 | analyze the risk for their scenarios and make the appropriate | 
|  | 612 | mitigation choices, which might even vary across their deployed | 
|  | 613 | machines and also result in other changes of their overall setup. | 
|  | 614 | There is no way for the kernel to provide a sensible default for this | 
|  | 615 | kind of scenarios. |