| rjw | 1f88458 | 2022-01-06 17:20:42 +0800 | [diff] [blame] | 1 |  | 
|  | 2 | Miscellaneous Device control operations for the autofs4 kernel module | 
|  | 3 | ==================================================================== | 
|  | 4 |  | 
|  | 5 | The problem | 
|  | 6 | =========== | 
|  | 7 |  | 
|  | 8 | There is a problem with active restarts in autofs (that is to say | 
|  | 9 | restarting autofs when there are busy mounts). | 
|  | 10 |  | 
|  | 11 | During normal operation autofs uses a file descriptor opened on the | 
|  | 12 | directory that is being managed in order to be able to issue control | 
|  | 13 | operations. Using a file descriptor gives ioctl operations access to | 
|  | 14 | autofs specific information stored in the super block. The operations | 
|  | 15 | are things such as setting an autofs mount catatonic, setting the | 
|  | 16 | expire timeout and requesting expire checks. As is explained below, | 
|  | 17 | certain types of autofs triggered mounts can end up covering an autofs | 
|  | 18 | mount itself which prevents us being able to use open(2) to obtain a | 
|  | 19 | file descriptor for these operations if we don't already have one open. | 
|  | 20 |  | 
|  | 21 | Currently autofs uses "umount -l" (lazy umount) to clear active mounts | 
|  | 22 | at restart. While using lazy umount works for most cases, anything that | 
|  | 23 | needs to walk back up the mount tree to construct a path, such as | 
|  | 24 | getcwd(2) and the proc file system /proc/<pid>/cwd, no longer works | 
|  | 25 | because the point from which the path is constructed has been detached | 
|  | 26 | from the mount tree. | 
|  | 27 |  | 
|  | 28 | The actual problem with autofs is that it can't reconnect to existing | 
|  | 29 | mounts. Immediately one thinks of just adding the ability to remount | 
|  | 30 | autofs file systems would solve it, but alas, that can't work. This is | 
|  | 31 | because autofs direct mounts and the implementation of "on demand mount | 
|  | 32 | and expire" of nested mount trees have the file system mounted directly | 
|  | 33 | on top of the mount trigger directory dentry. | 
|  | 34 |  | 
|  | 35 | For example, there are two types of automount maps, direct (in the kernel | 
|  | 36 | module source you will see a third type called an offset, which is just | 
|  | 37 | a direct mount in disguise) and indirect. | 
|  | 38 |  | 
|  | 39 | Here is a master map with direct and indirect map entries: | 
|  | 40 |  | 
|  | 41 | /-      /etc/auto.direct | 
|  | 42 | /test   /etc/auto.indirect | 
|  | 43 |  | 
|  | 44 | and the corresponding map files: | 
|  | 45 |  | 
|  | 46 | /etc/auto.direct: | 
|  | 47 |  | 
|  | 48 | /automount/dparse/g6  budgie:/autofs/export1 | 
|  | 49 | /automount/dparse/g1  shark:/autofs/export1 | 
|  | 50 | and so on. | 
|  | 51 |  | 
|  | 52 | /etc/auto.indirect: | 
|  | 53 |  | 
|  | 54 | g1    shark:/autofs/export1 | 
|  | 55 | g6    budgie:/autofs/export1 | 
|  | 56 | and so on. | 
|  | 57 |  | 
|  | 58 | For the above indirect map an autofs file system is mounted on /test and | 
|  | 59 | mounts are triggered for each sub-directory key by the inode lookup | 
|  | 60 | operation. So we see a mount of shark:/autofs/export1 on /test/g1, for | 
|  | 61 | example. | 
|  | 62 |  | 
|  | 63 | The way that direct mounts are handled is by making an autofs mount on | 
|  | 64 | each full path, such as /automount/dparse/g1, and using it as a mount | 
|  | 65 | trigger. So when we walk on the path we mount shark:/autofs/export1 "on | 
|  | 66 | top of this mount point". Since these are always directories we can | 
|  | 67 | use the follow_link inode operation to trigger the mount. | 
|  | 68 |  | 
|  | 69 | But, each entry in direct and indirect maps can have offsets (making | 
|  | 70 | them multi-mount map entries). | 
|  | 71 |  | 
|  | 72 | For example, an indirect mount map entry could also be: | 
|  | 73 |  | 
|  | 74 | g1  \ | 
|  | 75 | /        shark:/autofs/export5/testing/test \ | 
|  | 76 | /s1      shark:/autofs/export/testing/test/s1 \ | 
|  | 77 | /s2      shark:/autofs/export5/testing/test/s2 \ | 
|  | 78 | /s1/ss1  shark:/autofs/export1 \ | 
|  | 79 | /s2/ss2  shark:/autofs/export2 | 
|  | 80 |  | 
|  | 81 | and a similarly a direct mount map entry could also be: | 
|  | 82 |  | 
|  | 83 | /automount/dparse/g1 \ | 
|  | 84 | /       shark:/autofs/export5/testing/test \ | 
|  | 85 | /s1     shark:/autofs/export/testing/test/s1 \ | 
|  | 86 | /s2     shark:/autofs/export5/testing/test/s2 \ | 
|  | 87 | /s1/ss1 shark:/autofs/export2 \ | 
|  | 88 | /s2/ss2 shark:/autofs/export2 | 
|  | 89 |  | 
|  | 90 | One of the issues with version 4 of autofs was that, when mounting an | 
|  | 91 | entry with a large number of offsets, possibly with nesting, we needed | 
|  | 92 | to mount and umount all of the offsets as a single unit. Not really a | 
|  | 93 | problem, except for people with a large number of offsets in map entries. | 
|  | 94 | This mechanism is used for the well known "hosts" map and we have seen | 
|  | 95 | cases (in 2.4) where the available number of mounts are exhausted or | 
|  | 96 | where the number of privileged ports available is exhausted. | 
|  | 97 |  | 
|  | 98 | In version 5 we mount only as we go down the tree of offsets and | 
|  | 99 | similarly for expiring them which resolves the above problem. There is | 
|  | 100 | somewhat more detail to the implementation but it isn't needed for the | 
|  | 101 | sake of the problem explanation. The one important detail is that these | 
|  | 102 | offsets are implemented using the same mechanism as the direct mounts | 
|  | 103 | above and so the mount points can be covered by a mount. | 
|  | 104 |  | 
|  | 105 | The current autofs implementation uses an ioctl file descriptor opened | 
|  | 106 | on the mount point for control operations. The references held by the | 
|  | 107 | descriptor are accounted for in checks made to determine if a mount is | 
|  | 108 | in use and is also used to access autofs file system information held | 
|  | 109 | in the mount super block. So the use of a file handle needs to be | 
|  | 110 | retained. | 
|  | 111 |  | 
|  | 112 |  | 
|  | 113 | The Solution | 
|  | 114 | ============ | 
|  | 115 |  | 
|  | 116 | To be able to restart autofs leaving existing direct, indirect and | 
|  | 117 | offset mounts in place we need to be able to obtain a file handle | 
|  | 118 | for these potentially covered autofs mount points. Rather than just | 
|  | 119 | implement an isolated operation it was decided to re-implement the | 
|  | 120 | existing ioctl interface and add new operations to provide this | 
|  | 121 | functionality. | 
|  | 122 |  | 
|  | 123 | In addition, to be able to reconstruct a mount tree that has busy mounts, | 
|  | 124 | the uid and gid of the last user that triggered the mount needs to be | 
|  | 125 | available because these can be used as macro substitution variables in | 
|  | 126 | autofs maps. They are recorded at mount request time and an operation | 
|  | 127 | has been added to retrieve them. | 
|  | 128 |  | 
|  | 129 | Since we're re-implementing the control interface, a couple of other | 
|  | 130 | problems with the existing interface have been addressed. First, when | 
|  | 131 | a mount or expire operation completes a status is returned to the | 
|  | 132 | kernel by either a "send ready" or a "send fail" operation. The | 
|  | 133 | "send fail" operation of the ioctl interface could only ever send | 
|  | 134 | ENOENT so the re-implementation allows user space to send an actual | 
|  | 135 | status. Another expensive operation in user space, for those using | 
|  | 136 | very large maps, is discovering if a mount is present. Usually this | 
|  | 137 | involves scanning /proc/mounts and since it needs to be done quite | 
|  | 138 | often it can introduce significant overhead when there are many entries | 
|  | 139 | in the mount table. An operation to lookup the mount status of a mount | 
|  | 140 | point dentry (covered or not) has also been added. | 
|  | 141 |  | 
|  | 142 | Current kernel development policy recommends avoiding the use of the | 
|  | 143 | ioctl mechanism in favor of systems such as Netlink. An implementation | 
|  | 144 | using this system was attempted to evaluate its suitability and it was | 
|  | 145 | found to be inadequate, in this case. The Generic Netlink system was | 
|  | 146 | used for this as raw Netlink would lead to a significant increase in | 
|  | 147 | complexity. There's no question that the Generic Netlink system is an | 
|  | 148 | elegant solution for common case ioctl functions but it's not a complete | 
|  | 149 | replacement probably because its primary purpose in life is to be a | 
|  | 150 | message bus implementation rather than specifically an ioctl replacement. | 
|  | 151 | While it would be possible to work around this there is one concern | 
|  | 152 | that lead to the decision to not use it. This is that the autofs | 
|  | 153 | expire in the daemon has become far to complex because umount | 
|  | 154 | candidates are enumerated, almost for no other reason than to "count" | 
|  | 155 | the number of times to call the expire ioctl. This involves scanning | 
|  | 156 | the mount table which has proved to be a big overhead for users with | 
|  | 157 | large maps. The best way to improve this is try and get back to the | 
|  | 158 | way the expire was done long ago. That is, when an expire request is | 
|  | 159 | issued for a mount (file handle) we should continually call back to | 
|  | 160 | the daemon until we can't umount any more mounts, then return the | 
|  | 161 | appropriate status to the daemon. At the moment we just expire one | 
|  | 162 | mount at a time. A Generic Netlink implementation would exclude this | 
|  | 163 | possibility for future development due to the requirements of the | 
|  | 164 | message bus architecture. | 
|  | 165 |  | 
|  | 166 |  | 
|  | 167 | autofs4 Miscellaneous Device mount control interface | 
|  | 168 | ==================================================== | 
|  | 169 |  | 
|  | 170 | The control interface is opening a device node, typically /dev/autofs. | 
|  | 171 |  | 
|  | 172 | All the ioctls use a common structure to pass the needed parameter | 
|  | 173 | information and return operation results: | 
|  | 174 |  | 
|  | 175 | struct autofs_dev_ioctl { | 
|  | 176 | __u32 ver_major; | 
|  | 177 | __u32 ver_minor; | 
|  | 178 | __u32 size;             /* total size of data passed in | 
|  | 179 | * including this struct */ | 
|  | 180 | __s32 ioctlfd;          /* automount command fd */ | 
|  | 181 |  | 
|  | 182 | /* Command parameters */ | 
|  | 183 | union { | 
|  | 184 | struct args_protover		protover; | 
|  | 185 | struct args_protosubver		protosubver; | 
|  | 186 | struct args_openmount		openmount; | 
|  | 187 | struct args_ready		ready; | 
|  | 188 | struct args_fail		fail; | 
|  | 189 | struct args_setpipefd		setpipefd; | 
|  | 190 | struct args_timeout		timeout; | 
|  | 191 | struct args_requester		requester; | 
|  | 192 | struct args_expire		expire; | 
|  | 193 | struct args_askumount		askumount; | 
|  | 194 | struct args_ismountpoint	ismountpoint; | 
|  | 195 | }; | 
|  | 196 |  | 
|  | 197 | char path[0]; | 
|  | 198 | }; | 
|  | 199 |  | 
|  | 200 | The ioctlfd field is a mount point file descriptor of an autofs mount | 
|  | 201 | point. It is returned by the open call and is used by all calls except | 
|  | 202 | the check for whether a given path is a mount point, where it may | 
|  | 203 | optionally be used to check a specific mount corresponding to a given | 
|  | 204 | mount point file descriptor, and when requesting the uid and gid of the | 
|  | 205 | last successful mount on a directory within the autofs file system. | 
|  | 206 |  | 
|  | 207 | The union is used to communicate parameters and results of calls made | 
|  | 208 | as described below. | 
|  | 209 |  | 
|  | 210 | The path field is used to pass a path where it is needed and the size field | 
|  | 211 | is used account for the increased structure length when translating the | 
|  | 212 | structure sent from user space. | 
|  | 213 |  | 
|  | 214 | This structure can be initialized before setting specific fields by using | 
|  | 215 | the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *). | 
|  | 216 |  | 
|  | 217 | All of the ioctls perform a copy of this structure from user space to | 
|  | 218 | kernel space and return -EINVAL if the size parameter is smaller than | 
|  | 219 | the structure size itself, -ENOMEM if the kernel memory allocation fails | 
|  | 220 | or -EFAULT if the copy itself fails. Other checks include a version check | 
|  | 221 | of the compiled in user space version against the module version and a | 
|  | 222 | mismatch results in a -EINVAL return. If the size field is greater than | 
|  | 223 | the structure size then a path is assumed to be present and is checked to | 
|  | 224 | ensure it begins with a "/" and is NULL terminated, otherwise -EINVAL is | 
|  | 225 | returned. Following these checks, for all ioctl commands except | 
|  | 226 | AUTOFS_DEV_IOCTL_VERSION_CMD, AUTOFS_DEV_IOCTL_OPENMOUNT_CMD and | 
|  | 227 | AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD the ioctlfd is validated and if it is | 
|  | 228 | not a valid descriptor or doesn't correspond to an autofs mount point | 
|  | 229 | an error of -EBADF, -ENOTTY or -EINVAL (not an autofs descriptor) is | 
|  | 230 | returned. | 
|  | 231 |  | 
|  | 232 |  | 
|  | 233 | The ioctls | 
|  | 234 | ========== | 
|  | 235 |  | 
|  | 236 | An example of an implementation which uses this interface can be seen | 
|  | 237 | in autofs version 5.0.4 and later in file lib/dev-ioctl-lib.c of the | 
|  | 238 | distribution tar available for download from kernel.org in directory | 
|  | 239 | /pub/linux/daemons/autofs/v5. | 
|  | 240 |  | 
|  | 241 | The device node ioctl operations implemented by this interface are: | 
|  | 242 |  | 
|  | 243 |  | 
|  | 244 | AUTOFS_DEV_IOCTL_VERSION | 
|  | 245 | ------------------------ | 
|  | 246 |  | 
|  | 247 | Get the major and minor version of the autofs4 device ioctl kernel module | 
|  | 248 | implementation. It requires an initialized struct autofs_dev_ioctl as an | 
|  | 249 | input parameter and sets the version information in the passed in structure. | 
|  | 250 | It returns 0 on success or the error -EINVAL if a version mismatch is | 
|  | 251 | detected. | 
|  | 252 |  | 
|  | 253 |  | 
|  | 254 | AUTOFS_DEV_IOCTL_PROTOVER_CMD and AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD | 
|  | 255 | ------------------------------------------------------------------ | 
|  | 256 |  | 
|  | 257 | Get the major and minor version of the autofs4 protocol version understood | 
|  | 258 | by loaded module. This call requires an initialized struct autofs_dev_ioctl | 
|  | 259 | with the ioctlfd field set to a valid autofs mount point descriptor | 
|  | 260 | and sets the requested version number in version field of struct args_protover | 
|  | 261 | or sub_version field of struct args_protosubver. These commands return | 
|  | 262 | 0 on success or one of the negative error codes if validation fails. | 
|  | 263 |  | 
|  | 264 |  | 
|  | 265 | AUTOFS_DEV_IOCTL_OPENMOUNT and AUTOFS_DEV_IOCTL_CLOSEMOUNT | 
|  | 266 | ---------------------------------------------------------- | 
|  | 267 |  | 
|  | 268 | Obtain and release a file descriptor for an autofs managed mount point | 
|  | 269 | path. The open call requires an initialized struct autofs_dev_ioctl with | 
|  | 270 | the path field set and the size field adjusted appropriately as well | 
|  | 271 | as the devid field of struct args_openmount set to the device number of | 
|  | 272 | the autofs mount. The device number can be obtained from the mount options | 
|  | 273 | shown in /proc/mounts. The close call requires an initialized struct | 
|  | 274 | autofs_dev_ioct with the ioctlfd field set to the descriptor obtained | 
|  | 275 | from the open call. The release of the file descriptor can also be done | 
|  | 276 | with close(2) so any open descriptors will also be closed at process exit. | 
|  | 277 | The close call is included in the implemented operations largely for | 
|  | 278 | completeness and to provide for a consistent user space implementation. | 
|  | 279 |  | 
|  | 280 |  | 
|  | 281 | AUTOFS_DEV_IOCTL_READY_CMD and AUTOFS_DEV_IOCTL_FAIL_CMD | 
|  | 282 | -------------------------------------------------------- | 
|  | 283 |  | 
|  | 284 | Return mount and expire result status from user space to the kernel. | 
|  | 285 | Both of these calls require an initialized struct autofs_dev_ioctl | 
|  | 286 | with the ioctlfd field set to the descriptor obtained from the open | 
|  | 287 | call and the token field of struct args_ready or struct args_fail set | 
|  | 288 | to the wait queue token number, received by user space in the foregoing | 
|  | 289 | mount or expire request. The status field of struct args_fail is set to | 
|  | 290 | the errno of the operation. It is set to 0 on success. | 
|  | 291 |  | 
|  | 292 |  | 
|  | 293 | AUTOFS_DEV_IOCTL_SETPIPEFD_CMD | 
|  | 294 | ------------------------------ | 
|  | 295 |  | 
|  | 296 | Set the pipe file descriptor used for kernel communication to the daemon. | 
|  | 297 | Normally this is set at mount time using an option but when reconnecting | 
|  | 298 | to a existing mount we need to use this to tell the autofs mount about | 
|  | 299 | the new kernel pipe descriptor. In order to protect mounts against | 
|  | 300 | incorrectly setting the pipe descriptor we also require that the autofs | 
|  | 301 | mount be catatonic (see next call). | 
|  | 302 |  | 
|  | 303 | The call requires an initialized struct autofs_dev_ioctl with the | 
|  | 304 | ioctlfd field set to the descriptor obtained from the open call and | 
|  | 305 | the pipefd field of struct args_setpipefd set to descriptor of the pipe. | 
|  | 306 | On success the call also sets the process group id used to identify the | 
|  | 307 | controlling process (eg. the owning automount(8) daemon) to the process | 
|  | 308 | group of the caller. | 
|  | 309 |  | 
|  | 310 |  | 
|  | 311 | AUTOFS_DEV_IOCTL_CATATONIC_CMD | 
|  | 312 | ------------------------------ | 
|  | 313 |  | 
|  | 314 | Make the autofs mount point catatonic. The autofs mount will no longer | 
|  | 315 | issue mount requests, the kernel communication pipe descriptor is released | 
|  | 316 | and any remaining waits in the queue released. | 
|  | 317 |  | 
|  | 318 | The call requires an initialized struct autofs_dev_ioctl with the | 
|  | 319 | ioctlfd field set to the descriptor obtained from the open call. | 
|  | 320 |  | 
|  | 321 |  | 
|  | 322 | AUTOFS_DEV_IOCTL_TIMEOUT_CMD | 
|  | 323 | ---------------------------- | 
|  | 324 |  | 
|  | 325 | Set the expire timeout for mounts within an autofs mount point. | 
|  | 326 |  | 
|  | 327 | The call requires an initialized struct autofs_dev_ioctl with the | 
|  | 328 | ioctlfd field set to the descriptor obtained from the open call. | 
|  | 329 |  | 
|  | 330 |  | 
|  | 331 | AUTOFS_DEV_IOCTL_REQUESTER_CMD | 
|  | 332 | ------------------------------ | 
|  | 333 |  | 
|  | 334 | Return the uid and gid of the last process to successfully trigger a the | 
|  | 335 | mount on the given path dentry. | 
|  | 336 |  | 
|  | 337 | The call requires an initialized struct autofs_dev_ioctl with the path | 
|  | 338 | field set to the mount point in question and the size field adjusted | 
|  | 339 | appropriately. Upon return the uid field of struct args_requester contains | 
|  | 340 | the uid and gid field the gid. | 
|  | 341 |  | 
|  | 342 | When reconstructing an autofs mount tree with active mounts we need to | 
|  | 343 | re-connect to mounts that may have used the original process uid and | 
|  | 344 | gid (or string variations of them) for mount lookups within the map entry. | 
|  | 345 | This call provides the ability to obtain this uid and gid so they may be | 
|  | 346 | used by user space for the mount map lookups. | 
|  | 347 |  | 
|  | 348 |  | 
|  | 349 | AUTOFS_DEV_IOCTL_EXPIRE_CMD | 
|  | 350 | --------------------------- | 
|  | 351 |  | 
|  | 352 | Issue an expire request to the kernel for an autofs mount. Typically | 
|  | 353 | this ioctl is called until no further expire candidates are found. | 
|  | 354 |  | 
|  | 355 | The call requires an initialized struct autofs_dev_ioctl with the | 
|  | 356 | ioctlfd field set to the descriptor obtained from the open call. In | 
|  | 357 | addition an immediate expire, independent of the mount timeout, can be | 
|  | 358 | requested by setting the how field of struct args_expire to 1. If no | 
|  | 359 | expire candidates can be found the ioctl returns -1 with errno set to | 
|  | 360 | EAGAIN. | 
|  | 361 |  | 
|  | 362 | This call causes the kernel module to check the mount corresponding | 
|  | 363 | to the given ioctlfd for mounts that can be expired, issues an expire | 
|  | 364 | request back to the daemon and waits for completion. | 
|  | 365 |  | 
|  | 366 | AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD | 
|  | 367 | ------------------------------ | 
|  | 368 |  | 
|  | 369 | Checks if an autofs mount point is in use. | 
|  | 370 |  | 
|  | 371 | The call requires an initialized struct autofs_dev_ioctl with the | 
|  | 372 | ioctlfd field set to the descriptor obtained from the open call and | 
|  | 373 | it returns the result in the may_umount field of struct args_askumount, | 
|  | 374 | 1 for busy and 0 otherwise. | 
|  | 375 |  | 
|  | 376 |  | 
|  | 377 | AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD | 
|  | 378 | --------------------------------- | 
|  | 379 |  | 
|  | 380 | Check if the given path is a mountpoint. | 
|  | 381 |  | 
|  | 382 | The call requires an initialized struct autofs_dev_ioctl. There are two | 
|  | 383 | possible variations. Both use the path field set to the path of the mount | 
|  | 384 | point to check and the size field adjusted appropriately. One uses the | 
|  | 385 | ioctlfd field to identify a specific mount point to check while the other | 
|  | 386 | variation uses the path and optionally in.type field of struct args_ismountpoint | 
|  | 387 | set to an autofs mount type. The call returns 1 if this is a mount point | 
|  | 388 | and sets out.devid field to the device number of the mount and out.magic | 
|  | 389 | field to the relevant super block magic number (described below) or 0 if | 
|  | 390 | it isn't a mountpoint. In both cases the the device number (as returned | 
|  | 391 | by new_encode_dev()) is returned in out.devid field. | 
|  | 392 |  | 
|  | 393 | If supplied with a file descriptor we're looking for a specific mount, | 
|  | 394 | not necessarily at the top of the mounted stack. In this case the path | 
|  | 395 | the descriptor corresponds to is considered a mountpoint if it is itself | 
|  | 396 | a mountpoint or contains a mount, such as a multi-mount without a root | 
|  | 397 | mount. In this case we return 1 if the descriptor corresponds to a mount | 
|  | 398 | point and and also returns the super magic of the covering mount if there | 
|  | 399 | is one or 0 if it isn't a mountpoint. | 
|  | 400 |  | 
|  | 401 | If a path is supplied (and the ioctlfd field is set to -1) then the path | 
|  | 402 | is looked up and is checked to see if it is the root of a mount. If a | 
|  | 403 | type is also given we are looking for a particular autofs mount and if | 
|  | 404 | a match isn't found a fail is returned. If the the located path is the | 
|  | 405 | root of a mount 1 is returned along with the super magic of the mount | 
|  | 406 | or 0 otherwise. | 
|  | 407 |  |