| xj | b04a402 | 2021-11-25 15:01:52 +0800 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0+ | 
|  | 2 | ====================================================== | 
|  | 3 | IBM Virtual Management Channel Kernel Driver (IBMVMC) | 
|  | 4 | ====================================================== | 
|  | 5 |  | 
|  | 6 | :Authors: | 
|  | 7 | Dave Engebretsen <engebret@us.ibm.com>, | 
|  | 8 | Adam Reznechek <adreznec@linux.vnet.ibm.com>, | 
|  | 9 | Steven Royer <seroyer@linux.vnet.ibm.com>, | 
|  | 10 | Bryant G. Ly <bryantly@linux.vnet.ibm.com>, | 
|  | 11 |  | 
|  | 12 | Introduction | 
|  | 13 | ============ | 
|  | 14 |  | 
|  | 15 | Note: Knowledge of virtualization technology is required to understand | 
|  | 16 | this document. | 
|  | 17 |  | 
|  | 18 | A good reference document would be: | 
|  | 19 |  | 
|  | 20 | https://openpowerfoundation.org/wp-content/uploads/2016/05/LoPAPR_DRAFT_v11_24March2016_cmt1.pdf | 
|  | 21 |  | 
|  | 22 | The Virtual Management Channel (VMC) is a logical device which provides an | 
|  | 23 | interface between the hypervisor and a management partition. This interface | 
|  | 24 | is like a message passing interface. This management partition is intended | 
|  | 25 | to provide an alternative to systems that use a Hardware Management | 
|  | 26 | Console (HMC) - based system management. | 
|  | 27 |  | 
|  | 28 | The primary hardware management solution that is developed by IBM relies | 
|  | 29 | on an appliance server named the Hardware Management Console (HMC), | 
|  | 30 | packaged as an external tower or rack-mounted personal computer. In a | 
|  | 31 | Power Systems environment, a single HMC can manage multiple POWER | 
|  | 32 | processor-based systems. | 
|  | 33 |  | 
|  | 34 | Management Application | 
|  | 35 | ---------------------- | 
|  | 36 |  | 
|  | 37 | In the management partition, a management application exists which enables | 
|  | 38 | a system administrator to configure the system’s partitioning | 
|  | 39 | characteristics via a command line interface (CLI) or Representational | 
|  | 40 | State Transfer Application (REST API's). | 
|  | 41 |  | 
|  | 42 | The management application runs on a Linux logical partition on a | 
|  | 43 | POWER8 or newer processor-based server that is virtualized by PowerVM. | 
|  | 44 | System configuration, maintenance, and control functions which | 
|  | 45 | traditionally require an HMC can be implemented in the management | 
|  | 46 | application using a combination of HMC to hypervisor interfaces and | 
|  | 47 | existing operating system methods. This tool provides a subset of the | 
|  | 48 | functions implemented by the HMC and enables basic partition configuration. | 
|  | 49 | The set of HMC to hypervisor messages supported by the management | 
|  | 50 | application component are passed to the hypervisor over a VMC interface, | 
|  | 51 | which is defined below. | 
|  | 52 |  | 
|  | 53 | The VMC enables the management partition to provide basic partitioning | 
|  | 54 | functions: | 
|  | 55 |  | 
|  | 56 | - Logical Partitioning Configuration | 
|  | 57 | - Start, and stop actions for individual partitions | 
|  | 58 | - Display of partition status | 
|  | 59 | - Management of virtual Ethernet | 
|  | 60 | - Management of virtual Storage | 
|  | 61 | - Basic system management | 
|  | 62 |  | 
|  | 63 | Virtual Management Channel (VMC) | 
|  | 64 | -------------------------------- | 
|  | 65 |  | 
|  | 66 | A logical device, called the Virtual Management Channel (VMC), is defined | 
|  | 67 | for communicating between the management application and the hypervisor. It | 
|  | 68 | basically creates the pipes that enable virtualization management | 
|  | 69 | software. This device is presented to a designated management partition as | 
|  | 70 | a virtual device. | 
|  | 71 |  | 
|  | 72 | This communication device uses Command/Response Queue (CRQ) and the | 
|  | 73 | Remote Direct Memory Access (RDMA) interfaces. A three-way handshake is | 
|  | 74 | defined that must take place to establish that both the hypervisor and | 
|  | 75 | management partition sides of the channel are running prior to | 
|  | 76 | sending/receiving any of the protocol messages. | 
|  | 77 |  | 
|  | 78 | This driver also utilizes Transport Event CRQs. CRQ messages are sent | 
|  | 79 | when the hypervisor detects one of the peer partitions has abnormally | 
|  | 80 | terminated, or one side has called H_FREE_CRQ to close their CRQ. | 
|  | 81 | Two new classes of CRQ messages are introduced for the VMC device. VMC | 
|  | 82 | Administrative messages are used for each partition using the VMC to | 
|  | 83 | communicate capabilities to their partner. HMC Interface messages are used | 
|  | 84 | for the actual flow of HMC messages between the management partition and | 
|  | 85 | the hypervisor. As most HMC messages far exceed the size of a CRQ buffer, | 
|  | 86 | a virtual DMA (RMDA) of the HMC message data is done prior to each HMC | 
|  | 87 | Interface CRQ message. Only the management partition drives RDMA | 
|  | 88 | operations; hypervisors never directly cause the movement of message data. | 
|  | 89 |  | 
|  | 90 |  | 
|  | 91 | Terminology | 
|  | 92 | ----------- | 
|  | 93 | RDMA | 
|  | 94 | Remote Direct Memory Access is DMA transfer from the server to its | 
|  | 95 | client or from the server to its partner partition. DMA refers | 
|  | 96 | to both physical I/O to and from memory operations and to memory | 
|  | 97 | to memory move operations. | 
|  | 98 | CRQ | 
|  | 99 | Command/Response Queue a facility which is used to communicate | 
|  | 100 | between partner partitions. Transport events which are signaled | 
|  | 101 | from the hypervisor to partition are also reported in this queue. | 
|  | 102 |  | 
|  | 103 | Example Management Partition VMC Driver Interface | 
|  | 104 | ================================================= | 
|  | 105 |  | 
|  | 106 | This section provides an example for the management application | 
|  | 107 | implementation where a device driver is used to interface to the VMC | 
|  | 108 | device. This driver consists of a new device, for example /dev/ibmvmc, | 
|  | 109 | which provides interfaces to open, close, read, write, and perform | 
|  | 110 | ioctl’s against the VMC device. | 
|  | 111 |  | 
|  | 112 | VMC Interface Initialization | 
|  | 113 | ---------------------------- | 
|  | 114 |  | 
|  | 115 | The device driver is responsible for initializing the VMC when the driver | 
|  | 116 | is loaded. It first creates and initializes the CRQ. Next, an exchange of | 
|  | 117 | VMC capabilities is performed to indicate the code version and number of | 
|  | 118 | resources available in both the management partition and the hypervisor. | 
|  | 119 | Finally, the hypervisor requests that the management partition create an | 
|  | 120 | initial pool of VMC buffers, one buffer for each possible HMC connection, | 
|  | 121 | which will be used for management application  session initialization. | 
|  | 122 | Prior to completion of this initialization sequence, the device returns | 
|  | 123 | EBUSY to open() calls. EIO is returned for all open() failures. | 
|  | 124 |  | 
|  | 125 | :: | 
|  | 126 |  | 
|  | 127 | Management Partition		Hypervisor | 
|  | 128 | CRQ INIT | 
|  | 129 | ----------------------------------------> | 
|  | 130 | CRQ INIT COMPLETE | 
|  | 131 | <---------------------------------------- | 
|  | 132 | CAPABILITIES | 
|  | 133 | ----------------------------------------> | 
|  | 134 | CAPABILITIES RESPONSE | 
|  | 135 | <---------------------------------------- | 
|  | 136 | ADD BUFFER (HMC IDX=0,1,..)         _ | 
|  | 137 | <----------------------------------------  | | 
|  | 138 | ADD BUFFER RESPONSE              | - Perform # HMCs Iterations | 
|  | 139 | ----------------------------------------> - | 
|  | 140 |  | 
|  | 141 | VMC Interface Open | 
|  | 142 | ------------------ | 
|  | 143 |  | 
|  | 144 | After the basic VMC channel has been initialized, an HMC session level | 
|  | 145 | connection can be established. The application layer performs an open() to | 
|  | 146 | the VMC device and executes an ioctl() against it, indicating the HMC ID | 
|  | 147 | (32 bytes of data) for this session. If the VMC device is in an invalid | 
|  | 148 | state, EIO will be returned for the ioctl(). The device driver creates a | 
|  | 149 | new HMC session value (ranging from 1 to 255) and HMC index value (starting | 
|  | 150 | at index 0 and ranging to 254) for this HMC ID. The driver then does an | 
|  | 151 | RDMA of the HMC ID to the hypervisor, and then sends an Interface Open | 
|  | 152 | message to the hypervisor to establish the session over the VMC. After the | 
|  | 153 | hypervisor receives this information, it sends Add Buffer messages to the | 
|  | 154 | management partition to seed an initial pool of buffers for the new HMC | 
|  | 155 | connection. Finally, the hypervisor sends an Interface Open Response | 
|  | 156 | message, to indicate that it is ready for normal runtime messaging. The | 
|  | 157 | following illustrates this VMC flow: | 
|  | 158 |  | 
|  | 159 | :: | 
|  | 160 |  | 
|  | 161 | Management Partition             Hypervisor | 
|  | 162 | RDMA HMC ID | 
|  | 163 | ----------------------------------------> | 
|  | 164 | Interface Open | 
|  | 165 | ----------------------------------------> | 
|  | 166 | Add Buffer                  _ | 
|  | 167 | <----------------------------------------  | | 
|  | 168 | Add Buffer Response              | - Perform N Iterations | 
|  | 169 | ----------------------------------------> - | 
|  | 170 | Interface Open Response | 
|  | 171 | <---------------------------------------- | 
|  | 172 |  | 
|  | 173 | VMC Interface Runtime | 
|  | 174 | --------------------- | 
|  | 175 |  | 
|  | 176 | During normal runtime, the management application and the hypervisor | 
|  | 177 | exchange HMC messages via the Signal VMC message and RDMA operations. When | 
|  | 178 | sending data to the hypervisor, the management application performs a | 
|  | 179 | write() to the VMC device, and the driver RDMA’s the data to the hypervisor | 
|  | 180 | and then sends a Signal Message. If a write() is attempted before VMC | 
|  | 181 | device buffers have been made available by the hypervisor, or no buffers | 
|  | 182 | are currently available, EBUSY is returned in response to the write(). A | 
|  | 183 | write() will return EIO for all other errors, such as an invalid device | 
|  | 184 | state. When the hypervisor sends a message to the management, the data is | 
|  | 185 | put into a VMC buffer and an Signal Message is sent to the VMC driver in | 
|  | 186 | the management partition. The driver RDMA’s the buffer into the partition | 
|  | 187 | and passes the data up to the appropriate management application via a | 
|  | 188 | read() to the VMC device. The read() request blocks if there is no buffer | 
|  | 189 | available to read. The management application may use select() to wait for | 
|  | 190 | the VMC device to become ready with data to read. | 
|  | 191 |  | 
|  | 192 | :: | 
|  | 193 |  | 
|  | 194 | Management Partition             Hypervisor | 
|  | 195 | MSG RDMA | 
|  | 196 | ----------------------------------------> | 
|  | 197 | SIGNAL MSG | 
|  | 198 | ----------------------------------------> | 
|  | 199 | SIGNAL MSG | 
|  | 200 | <---------------------------------------- | 
|  | 201 | MSG RDMA | 
|  | 202 | <---------------------------------------- | 
|  | 203 |  | 
|  | 204 | VMC Interface Close | 
|  | 205 | ------------------- | 
|  | 206 |  | 
|  | 207 | HMC session level connections are closed by the management partition when | 
|  | 208 | the application layer performs a close() against the device. This action | 
|  | 209 | results in an Interface Close message flowing to the hypervisor, which | 
|  | 210 | causes the session to be terminated. The device driver must free any | 
|  | 211 | storage allocated for buffers for this HMC connection. | 
|  | 212 |  | 
|  | 213 | :: | 
|  | 214 |  | 
|  | 215 | Management Partition             Hypervisor | 
|  | 216 | INTERFACE CLOSE | 
|  | 217 | ----------------------------------------> | 
|  | 218 | INTERFACE CLOSE RESPONSE | 
|  | 219 | <---------------------------------------- | 
|  | 220 |  | 
|  | 221 | Additional Information | 
|  | 222 | ====================== | 
|  | 223 |  | 
|  | 224 | For more information on the documentation for CRQ Messages, VMC Messages, | 
|  | 225 | HMC interface Buffers, and signal messages please refer to the Linux on | 
|  | 226 | Power Architecture Platform Reference. Section F. |