| ================= | 
 | KVM VCPU Requests | 
 | ================= | 
 |  | 
 | Overview | 
 | ======== | 
 |  | 
 | KVM supports an internal API enabling threads to request a VCPU thread to | 
 | perform some activity.  For example, a thread may request a VCPU to flush | 
 | its TLB with a VCPU request.  The API consists of the following functions:: | 
 |  | 
 |   /* Check if any requests are pending for VCPU @vcpu. */ | 
 |   bool kvm_request_pending(struct kvm_vcpu *vcpu); | 
 |  | 
 |   /* Check if VCPU @vcpu has request @req pending. */ | 
 |   bool kvm_test_request(int req, struct kvm_vcpu *vcpu); | 
 |  | 
 |   /* Clear request @req for VCPU @vcpu. */ | 
 |   void kvm_clear_request(int req, struct kvm_vcpu *vcpu); | 
 |  | 
 |   /* | 
 |    * Check if VCPU @vcpu has request @req pending. When the request is | 
 |    * pending it will be cleared and a memory barrier, which pairs with | 
 |    * another in kvm_make_request(), will be issued. | 
 |    */ | 
 |   bool kvm_check_request(int req, struct kvm_vcpu *vcpu); | 
 |  | 
 |   /* | 
 |    * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs | 
 |    * with another in kvm_check_request(), prior to setting the request. | 
 |    */ | 
 |   void kvm_make_request(int req, struct kvm_vcpu *vcpu); | 
 |  | 
 |   /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */ | 
 |   bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req); | 
 |  | 
 | Typically a requester wants the VCPU to perform the activity as soon | 
 | as possible after making the request.  This means most requests | 
 | (kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(), | 
 | and kvm_make_all_cpus_request() has the kicking of all VCPUs built | 
 | into it. | 
 |  | 
 | VCPU Kicks | 
 | ---------- | 
 |  | 
 | The goal of a VCPU kick is to bring a VCPU thread out of guest mode in | 
 | order to perform some KVM maintenance.  To do so, an IPI is sent, forcing | 
 | a guest mode exit.  However, a VCPU thread may not be in guest mode at the | 
 | time of the kick.  Therefore, depending on the mode and state of the VCPU | 
 | thread, there are two other actions a kick may take.  All three actions | 
 | are listed below: | 
 |  | 
 | 1) Send an IPI.  This forces a guest mode exit. | 
 | 2) Waking a sleeping VCPU.  Sleeping VCPUs are VCPU threads outside guest | 
 |    mode that wait on waitqueues.  Waking them removes the threads from | 
 |    the waitqueues, allowing the threads to run again.  This behavior | 
 |    may be suppressed, see KVM_REQUEST_NO_WAKEUP below. | 
 | 3) Nothing.  When the VCPU is not in guest mode and the VCPU thread is not | 
 |    sleeping, then there is nothing to do. | 
 |  | 
 | VCPU Mode | 
 | --------- | 
 |  | 
 | VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the | 
 | guest is running in guest mode or not, as well as some specific | 
 | outside guest mode states.  The architecture may use ``vcpu->mode`` to | 
 | ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"), | 
 | as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and | 
 | even to ensure IPI acknowledgements are waited upon (see "Waiting for | 
 | Acknowledgements").  The following modes are defined: | 
 |  | 
 | OUTSIDE_GUEST_MODE | 
 |  | 
 |   The VCPU thread is outside guest mode. | 
 |  | 
 | IN_GUEST_MODE | 
 |  | 
 |   The VCPU thread is in guest mode. | 
 |  | 
 | EXITING_GUEST_MODE | 
 |  | 
 |   The VCPU thread is transitioning from IN_GUEST_MODE to | 
 |   OUTSIDE_GUEST_MODE. | 
 |  | 
 | READING_SHADOW_PAGE_TABLES | 
 |  | 
 |   The VCPU thread is outside guest mode, but it wants the sender of | 
 |   certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU | 
 |   thread is done reading the page tables. | 
 |  | 
 | VCPU Request Internals | 
 | ====================== | 
 |  | 
 | VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap. | 
 | This means general bitops, like those documented in [atomic-ops]_ could | 
 | also be used, e.g. :: | 
 |  | 
 |   clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests); | 
 |  | 
 | However, VCPU request users should refrain from doing so, as it would | 
 | break the abstraction.  The first 8 bits are reserved for architecture | 
 | independent requests, all additional bits are available for architecture | 
 | dependent requests. | 
 |  | 
 | Architecture Independent Requests | 
 | --------------------------------- | 
 |  | 
 | KVM_REQ_TLB_FLUSH | 
 |  | 
 |   KVM's common MMU notifier may need to flush all of a guest's TLB | 
 |   entries, calling kvm_flush_remote_tlbs() to do so.  Architectures that | 
 |   choose to use the common kvm_flush_remote_tlbs() implementation will | 
 |   need to handle this VCPU request. | 
 |  | 
 | KVM_REQ_MMU_RELOAD | 
 |  | 
 |   When shadow page tables are used and memory slots are removed it's | 
 |   necessary to inform each VCPU to completely refresh the tables.  This | 
 |   request is used for that. | 
 |  | 
 | KVM_REQ_PENDING_TIMER | 
 |  | 
 |   This request may be made from a timer handler run on the host on behalf | 
 |   of a VCPU.  It informs the VCPU thread to inject a timer interrupt. | 
 |  | 
 | KVM_REQ_UNHALT | 
 |  | 
 |   This request may be made from the KVM common function kvm_vcpu_block(), | 
 |   which is used to emulate an instruction that causes a CPU to halt until | 
 |   one of an architectural specific set of events and/or interrupts is | 
 |   received (determined by checking kvm_arch_vcpu_runnable()).  When that | 
 |   event or interrupt arrives kvm_vcpu_block() makes the request.  This is | 
 |   in contrast to when kvm_vcpu_block() returns due to any other reason, | 
 |   such as a pending signal, which does not indicate the VCPU's halt | 
 |   emulation should stop, and therefore does not make the request. | 
 |  | 
 | KVM_REQUEST_MASK | 
 | ---------------- | 
 |  | 
 | VCPU requests should be masked by KVM_REQUEST_MASK before using them with | 
 | bitops.  This is because only the lower 8 bits are used to represent the | 
 | request's number.  The upper bits are used as flags.  Currently only two | 
 | flags are defined. | 
 |  | 
 | VCPU Request Flags | 
 | ------------------ | 
 |  | 
 | KVM_REQUEST_NO_WAKEUP | 
 |  | 
 |   This flag is applied to requests that only need immediate attention | 
 |   from VCPUs running in guest mode.  That is, sleeping VCPUs do not need | 
 |   to be awaken for these requests.  Sleeping VCPUs will handle the | 
 |   requests when they are awaken later for some other reason. | 
 |  | 
 | KVM_REQUEST_WAIT | 
 |  | 
 |   When requests with this flag are made with kvm_make_all_cpus_request(), | 
 |   then the caller will wait for each VCPU to acknowledge its IPI before | 
 |   proceeding.  This flag only applies to VCPUs that would receive IPIs. | 
 |   If, for example, the VCPU is sleeping, so no IPI is necessary, then | 
 |   the requesting thread does not wait.  This means that this flag may be | 
 |   safely combined with KVM_REQUEST_NO_WAKEUP.  See "Waiting for | 
 |   Acknowledgements" for more information about requests with | 
 |   KVM_REQUEST_WAIT. | 
 |  | 
 | VCPU Requests with Associated State | 
 | =================================== | 
 |  | 
 | Requesters that want the receiving VCPU to handle new state need to ensure | 
 | the newly written state is observable to the receiving VCPU thread's CPU | 
 | by the time it observes the request.  This means a write memory barrier | 
 | must be inserted after writing the new state and before setting the VCPU | 
 | request bit.  Additionally, on the receiving VCPU thread's side, a | 
 | corresponding read barrier must be inserted after reading the request bit | 
 | and before proceeding to read the new state associated with it.  See | 
 | scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation | 
 | [memory-barriers]_. | 
 |  | 
 | The pair of functions, kvm_check_request() and kvm_make_request(), provide | 
 | the memory barriers, allowing this requirement to be handled internally by | 
 | the API. | 
 |  | 
 | Ensuring Requests Are Seen | 
 | ========================== | 
 |  | 
 | When making requests to VCPUs, we want to avoid the receiving VCPU | 
 | executing in guest mode for an arbitrary long time without handling the | 
 | request.  We can be sure this won't happen as long as we ensure the VCPU | 
 | thread checks kvm_request_pending() before entering guest mode and that a | 
 | kick will send an IPI to force an exit from guest mode when necessary. | 
 | Extra care must be taken to cover the period after the VCPU thread's last | 
 | kvm_request_pending() check and before it has entered guest mode, as kick | 
 | IPIs will only trigger guest mode exits for VCPU threads that are in guest | 
 | mode or at least have already disabled interrupts in order to prepare to | 
 | enter guest mode.  This means that an optimized implementation (see "IPI | 
 | Reduction") must be certain when it's safe to not send the IPI.  One | 
 | solution, which all architectures except s390 apply, is to: | 
 |  | 
 | - set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and | 
 |   the last kvm_request_pending() check; | 
 | - enable interrupts atomically when entering the guest. | 
 |  | 
 | This solution also requires memory barriers to be placed carefully in both | 
 | the requesting thread and the receiving VCPU.  With the memory barriers we | 
 | can exclude the possibility of a VCPU thread observing | 
 | !kvm_request_pending() on its last check and then not receiving an IPI for | 
 | the next request made of it, even if the request is made immediately after | 
 | the check.  This is done by way of the Dekker memory barrier pattern | 
 | (scenario 10 of [lwn-mb]_).  As the Dekker pattern requires two variables, | 
 | this solution pairs ``vcpu->mode`` with ``vcpu->requests``.  Substituting | 
 | them into the pattern gives:: | 
 |  | 
 |   CPU1                                    CPU2 | 
 |   =================                       ================= | 
 |   local_irq_disable(); | 
 |   WRITE_ONCE(vcpu->mode, IN_GUEST_MODE);  kvm_make_request(REQ, vcpu); | 
 |   smp_mb();                               smp_mb(); | 
 |   if (kvm_request_pending(vcpu)) {        if (READ_ONCE(vcpu->mode) == | 
 |                                               IN_GUEST_MODE) { | 
 |       ...abort guest entry...                 ...send IPI... | 
 |   }                                       } | 
 |  | 
 | As stated above, the IPI is only useful for VCPU threads in guest mode or | 
 | that have already disabled interrupts.  This is why this specific case of | 
 | the Dekker pattern has been extended to disable interrupts before setting | 
 | ``vcpu->mode`` to IN_GUEST_MODE.  WRITE_ONCE() and READ_ONCE() are used to | 
 | pedantically implement the memory barrier pattern, guaranteeing the | 
 | compiler doesn't interfere with ``vcpu->mode``'s carefully planned | 
 | accesses. | 
 |  | 
 | IPI Reduction | 
 | ------------- | 
 |  | 
 | As only one IPI is needed to get a VCPU to check for any/all requests, | 
 | then they may be coalesced.  This is easily done by having the first IPI | 
 | sending kick also change the VCPU mode to something !IN_GUEST_MODE.  The | 
 | transitional state, EXITING_GUEST_MODE, is used for this purpose. | 
 |  | 
 | Waiting for Acknowledgements | 
 | ---------------------------- | 
 |  | 
 | Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to | 
 | be sent, and the acknowledgements to be waited upon, even when the target | 
 | VCPU threads are in modes other than IN_GUEST_MODE.  For example, one case | 
 | is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which | 
 | is set after disabling interrupts.  To support these cases, the | 
 | KVM_REQUEST_WAIT flag changes the condition for sending an IPI from | 
 | checking that the VCPU is IN_GUEST_MODE to checking that it is not | 
 | OUTSIDE_GUEST_MODE. | 
 |  | 
 | Request-less VCPU Kicks | 
 | ----------------------- | 
 |  | 
 | As the determination of whether or not to send an IPI depends on the | 
 | two-variable Dekker memory barrier pattern, then it's clear that | 
 | request-less VCPU kicks are almost never correct.  Without the assurance | 
 | that a non-IPI generating kick will still result in an action by the | 
 | receiving VCPU, as the final kvm_request_pending() check does for | 
 | request-accompanying kicks, then the kick may not do anything useful at | 
 | all.  If, for instance, a request-less kick was made to a VCPU that was | 
 | just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then | 
 | the VCPU thread may continue its entry without actually having done | 
 | whatever it was the kick was meant to initiate. | 
 |  | 
 | One exception is x86's posted interrupt mechanism.  In this case, however, | 
 | even the request-less VCPU kick is coupled with the same | 
 | local_irq_disable() + smp_mb() pattern described above; the ON bit | 
 | (Outstanding Notification) in the posted interrupt descriptor takes the | 
 | role of ``vcpu->requests``.  When sending a posted interrupt, PIR.ON is | 
 | set before reading ``vcpu->mode``; dually, in the VCPU thread, | 
 | vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to | 
 | IN_GUEST_MODE. | 
 |  | 
 | Additional Considerations | 
 | ========================= | 
 |  | 
 | Sleeping VCPUs | 
 | -------------- | 
 |  | 
 | VCPU threads may need to consider requests before and/or after calling | 
 | functions that may put them to sleep, e.g. kvm_vcpu_block().  Whether they | 
 | do or not, and, if they do, which requests need consideration, is | 
 | architecture dependent.  kvm_vcpu_block() calls kvm_arch_vcpu_runnable() | 
 | to check if it should awaken.  One reason to do so is to provide | 
 | architectures a function where requests may be checked if necessary. | 
 |  | 
 | Clearing Requests | 
 | ----------------- | 
 |  | 
 | Generally it only makes sense for the receiving VCPU thread to clear a | 
 | request.  However, in some circumstances, such as when the requesting | 
 | thread and the receiving VCPU thread are executed serially, such as when | 
 | they are the same thread, or when they are using some form of concurrency | 
 | control to temporarily execute synchronously, then it's possible to know | 
 | that the request may be cleared immediately, rather than waiting for the | 
 | receiving VCPU thread to handle the request in VCPU RUN.  The only current | 
 | examples of this are kvm_vcpu_block() calls made by VCPUs to block | 
 | themselves.  A possible side-effect of that call is to make the | 
 | KVM_REQ_UNHALT request, which may then be cleared immediately when the | 
 | VCPU returns from the call. | 
 |  | 
 | References | 
 | ========== | 
 |  | 
 | .. [atomic-ops] Documentation/core-api/atomic_ops.rst | 
 | .. [memory-barriers] Documentation/memory-barriers.txt | 
 | .. [lwn-mb] https://lwn.net/Articles/573436/ |