| xj | b04a402 | 2021-11-25 15:01:52 +0800 | [diff] [blame] | 1 | XFS Delayed Logging Design | 
|  | 2 | -------------------------- | 
|  | 3 |  | 
|  | 4 | Introduction to Re-logging in XFS | 
|  | 5 | --------------------------------- | 
|  | 6 |  | 
|  | 7 | XFS logging is a combination of logical and physical logging. Some objects, | 
|  | 8 | such as inodes and dquots, are logged in logical format where the details | 
|  | 9 | logged are made up of the changes to in-core structures rather than on-disk | 
|  | 10 | structures. Other objects - typically buffers - have their physical changes | 
|  | 11 | logged. The reason for these differences is to reduce the amount of log space | 
|  | 12 | required for objects that are frequently logged. Some parts of inodes are more | 
|  | 13 | frequently logged than others, and inodes are typically more frequently logged | 
|  | 14 | than any other object (except maybe the superblock buffer) so keeping the | 
|  | 15 | amount of metadata logged low is of prime importance. | 
|  | 16 |  | 
|  | 17 | The reason that this is such a concern is that XFS allows multiple separate | 
|  | 18 | modifications to a single object to be carried in the log at any given time. | 
|  | 19 | This allows the log to avoid needing to flush each change to disk before | 
|  | 20 | recording a new change to the object. XFS does this via a method called | 
|  | 21 | "re-logging". Conceptually, this is quite simple - all it requires is that any | 
|  | 22 | new change to the object is recorded with a *new copy* of all the existing | 
|  | 23 | changes in the new transaction that is written to the log. | 
|  | 24 |  | 
|  | 25 | That is, if we have a sequence of changes A through to F, and the object was | 
|  | 26 | written to disk after change D, we would see in the log the following series | 
|  | 27 | of transactions, their contents and the log sequence number (LSN) of the | 
|  | 28 | transaction: | 
|  | 29 |  | 
|  | 30 | Transaction		Contents	LSN | 
|  | 31 | A			   A		   X | 
|  | 32 | B			  A+B		  X+n | 
|  | 33 | C			 A+B+C		 X+n+m | 
|  | 34 | D			A+B+C+D		X+n+m+o | 
|  | 35 | <object written to disk> | 
|  | 36 | E			   E		   Y (> X+n+m+o) | 
|  | 37 | F			  E+F		  Yٍ+p | 
|  | 38 |  | 
|  | 39 | In other words, each time an object is relogged, the new transaction contains | 
|  | 40 | the aggregation of all the previous changes currently held only in the log. | 
|  | 41 |  | 
|  | 42 | This relogging technique also allows objects to be moved forward in the log so | 
|  | 43 | that an object being relogged does not prevent the tail of the log from ever | 
|  | 44 | moving forward.  This can be seen in the table above by the changing | 
|  | 45 | (increasing) LSN of each subsequent transaction - the LSN is effectively a | 
|  | 46 | direct encoding of the location in the log of the transaction. | 
|  | 47 |  | 
|  | 48 | This relogging is also used to implement long-running, multiple-commit | 
|  | 49 | transactions.  These transaction are known as rolling transactions, and require | 
|  | 50 | a special log reservation known as a permanent transaction reservation. A | 
|  | 51 | typical example of a rolling transaction is the removal of extents from an | 
|  | 52 | inode which can only be done at a rate of two extents per transaction because | 
|  | 53 | of reservation size limitations. Hence a rolling extent removal transaction | 
|  | 54 | keeps relogging the inode and btree buffers as they get modified in each | 
|  | 55 | removal operation. This keeps them moving forward in the log as the operation | 
|  | 56 | progresses, ensuring that current operation never gets blocked by itself if the | 
|  | 57 | log wraps around. | 
|  | 58 |  | 
|  | 59 | Hence it can be seen that the relogging operation is fundamental to the correct | 
|  | 60 | working of the XFS journalling subsystem. From the above description, most | 
|  | 61 | people should be able to see why the XFS metadata operations writes so much to | 
|  | 62 | the log - repeated operations to the same objects write the same changes to | 
|  | 63 | the log over and over again. Worse is the fact that objects tend to get | 
|  | 64 | dirtier as they get relogged, so each subsequent transaction is writing more | 
|  | 65 | metadata into the log. | 
|  | 66 |  | 
|  | 67 | Another feature of the XFS transaction subsystem is that most transactions are | 
|  | 68 | asynchronous. That is, they don't commit to disk until either a log buffer is | 
|  | 69 | filled (a log buffer can hold multiple transactions) or a synchronous operation | 
|  | 70 | forces the log buffers holding the transactions to disk. This means that XFS is | 
|  | 71 | doing aggregation of transactions in memory - batching them, if you like - to | 
|  | 72 | minimise the impact of the log IO on transaction throughput. | 
|  | 73 |  | 
|  | 74 | The limitation on asynchronous transaction throughput is the number and size of | 
|  | 75 | log buffers made available by the log manager. By default there are 8 log | 
|  | 76 | buffers available and the size of each is 32kB - the size can be increased up | 
|  | 77 | to 256kB by use of a mount option. | 
|  | 78 |  | 
|  | 79 | Effectively, this gives us the maximum bound of outstanding metadata changes | 
|  | 80 | that can be made to the filesystem at any point in time - if all the log | 
|  | 81 | buffers are full and under IO, then no more transactions can be committed until | 
|  | 82 | the current batch completes. It is now common for a single current CPU core to | 
|  | 83 | be to able to issue enough transactions to keep the log buffers full and under | 
|  | 84 | IO permanently. Hence the XFS journalling subsystem can be considered to be IO | 
|  | 85 | bound. | 
|  | 86 |  | 
|  | 87 | Delayed Logging: Concepts | 
|  | 88 | ------------------------- | 
|  | 89 |  | 
|  | 90 | The key thing to note about the asynchronous logging combined with the | 
|  | 91 | relogging technique XFS uses is that we can be relogging changed objects | 
|  | 92 | multiple times before they are committed to disk in the log buffers. If we | 
|  | 93 | return to the previous relogging example, it is entirely possible that | 
|  | 94 | transactions A through D are committed to disk in the same log buffer. | 
|  | 95 |  | 
|  | 96 | That is, a single log buffer may contain multiple copies of the same object, | 
|  | 97 | but only one of those copies needs to be there - the last one "D", as it | 
|  | 98 | contains all the changes from the previous changes. In other words, we have one | 
|  | 99 | necessary copy in the log buffer, and three stale copies that are simply | 
|  | 100 | wasting space. When we are doing repeated operations on the same set of | 
|  | 101 | objects, these "stale objects" can be over 90% of the space used in the log | 
|  | 102 | buffers. It is clear that reducing the number of stale objects written to the | 
|  | 103 | log would greatly reduce the amount of metadata we write to the log, and this | 
|  | 104 | is the fundamental goal of delayed logging. | 
|  | 105 |  | 
|  | 106 | From a conceptual point of view, XFS is already doing relogging in memory (where | 
|  | 107 | memory == log buffer), only it is doing it extremely inefficiently. It is using | 
|  | 108 | logical to physical formatting to do the relogging because there is no | 
|  | 109 | infrastructure to keep track of logical changes in memory prior to physically | 
|  | 110 | formatting the changes in a transaction to the log buffer. Hence we cannot avoid | 
|  | 111 | accumulating stale objects in the log buffers. | 
|  | 112 |  | 
|  | 113 | Delayed logging is the name we've given to keeping and tracking transactional | 
|  | 114 | changes to objects in memory outside the log buffer infrastructure. Because of | 
|  | 115 | the relogging concept fundamental to the XFS journalling subsystem, this is | 
|  | 116 | actually relatively easy to do - all the changes to logged items are already | 
|  | 117 | tracked in the current infrastructure. The big problem is how to accumulate | 
|  | 118 | them and get them to the log in a consistent, recoverable manner. | 
|  | 119 | Describing the problems and how they have been solved is the focus of this | 
|  | 120 | document. | 
|  | 121 |  | 
|  | 122 | One of the key changes that delayed logging makes to the operation of the | 
|  | 123 | journalling subsystem is that it disassociates the amount of outstanding | 
|  | 124 | metadata changes from the size and number of log buffers available. In other | 
|  | 125 | words, instead of there only being a maximum of 2MB of transaction changes not | 
|  | 126 | written to the log at any point in time, there may be a much greater amount | 
|  | 127 | being accumulated in memory. Hence the potential for loss of metadata on a | 
|  | 128 | crash is much greater than for the existing logging mechanism. | 
|  | 129 |  | 
|  | 130 | It should be noted that this does not change the guarantee that log recovery | 
|  | 131 | will result in a consistent filesystem. What it does mean is that as far as the | 
|  | 132 | recovered filesystem is concerned, there may be many thousands of transactions | 
|  | 133 | that simply did not occur as a result of the crash. This makes it even more | 
|  | 134 | important that applications that care about their data use fsync() where they | 
|  | 135 | need to ensure application level data integrity is maintained. | 
|  | 136 |  | 
|  | 137 | It should be noted that delayed logging is not an innovative new concept that | 
|  | 138 | warrants rigorous proofs to determine whether it is correct or not. The method | 
|  | 139 | of accumulating changes in memory for some period before writing them to the | 
|  | 140 | log is used effectively in many filesystems including ext3 and ext4. Hence | 
|  | 141 | no time is spent in this document trying to convince the reader that the | 
|  | 142 | concept is sound. Instead it is simply considered a "solved problem" and as | 
|  | 143 | such implementing it in XFS is purely an exercise in software engineering. | 
|  | 144 |  | 
|  | 145 | The fundamental requirements for delayed logging in XFS are simple: | 
|  | 146 |  | 
|  | 147 | 1. Reduce the amount of metadata written to the log by at least | 
|  | 148 | an order of magnitude. | 
|  | 149 | 2. Supply sufficient statistics to validate Requirement #1. | 
|  | 150 | 3. Supply sufficient new tracing infrastructure to be able to debug | 
|  | 151 | problems with the new code. | 
|  | 152 | 4. No on-disk format change (metadata or log format). | 
|  | 153 | 5. Enable and disable with a mount option. | 
|  | 154 | 6. No performance regressions for synchronous transaction workloads. | 
|  | 155 |  | 
|  | 156 | Delayed Logging: Design | 
|  | 157 | ----------------------- | 
|  | 158 |  | 
|  | 159 | Storing Changes | 
|  | 160 |  | 
|  | 161 | The problem with accumulating changes at a logical level (i.e. just using the | 
|  | 162 | existing log item dirty region tracking) is that when it comes to writing the | 
|  | 163 | changes to the log buffers, we need to ensure that the object we are formatting | 
|  | 164 | is not changing while we do this. This requires locking the object to prevent | 
|  | 165 | concurrent modification. Hence flushing the logical changes to the log would | 
|  | 166 | require us to lock every object, format them, and then unlock them again. | 
|  | 167 |  | 
|  | 168 | This introduces lots of scope for deadlocks with transactions that are already | 
|  | 169 | running. For example, a transaction has object A locked and modified, but needs | 
|  | 170 | the delayed logging tracking lock to commit the transaction. However, the | 
|  | 171 | flushing thread has the delayed logging tracking lock already held, and is | 
|  | 172 | trying to get the lock on object A to flush it to the log buffer. This appears | 
|  | 173 | to be an unsolvable deadlock condition, and it was solving this problem that | 
|  | 174 | was the barrier to implementing delayed logging for so long. | 
|  | 175 |  | 
|  | 176 | The solution is relatively simple - it just took a long time to recognise it. | 
|  | 177 | Put simply, the current logging code formats the changes to each item into an | 
|  | 178 | vector array that points to the changed regions in the item. The log write code | 
|  | 179 | simply copies the memory these vectors point to into the log buffer during | 
|  | 180 | transaction commit while the item is locked in the transaction. Instead of | 
|  | 181 | using the log buffer as the destination of the formatting code, we can use an | 
|  | 182 | allocated memory buffer big enough to fit the formatted vector. | 
|  | 183 |  | 
|  | 184 | If we then copy the vector into the memory buffer and rewrite the vector to | 
|  | 185 | point to the memory buffer rather than the object itself, we now have a copy of | 
|  | 186 | the changes in a format that is compatible with the log buffer writing code. | 
|  | 187 | that does not require us to lock the item to access. This formatting and | 
|  | 188 | rewriting can all be done while the object is locked during transaction commit, | 
|  | 189 | resulting in a vector that is transactionally consistent and can be accessed | 
|  | 190 | without needing to lock the owning item. | 
|  | 191 |  | 
|  | 192 | Hence we avoid the need to lock items when we need to flush outstanding | 
|  | 193 | asynchronous transactions to the log. The differences between the existing | 
|  | 194 | formatting method and the delayed logging formatting can be seen in the | 
|  | 195 | diagram below. | 
|  | 196 |  | 
|  | 197 | Current format log vector: | 
|  | 198 |  | 
|  | 199 | Object    +---------------------------------------------+ | 
|  | 200 | Vector 1      +----+ | 
|  | 201 | Vector 2                    +----+ | 
|  | 202 | Vector 3                                   +----------+ | 
|  | 203 |  | 
|  | 204 | After formatting: | 
|  | 205 |  | 
|  | 206 | Log Buffer    +-V1-+-V2-+----V3----+ | 
|  | 207 |  | 
|  | 208 | Delayed logging vector: | 
|  | 209 |  | 
|  | 210 | Object    +---------------------------------------------+ | 
|  | 211 | Vector 1      +----+ | 
|  | 212 | Vector 2                    +----+ | 
|  | 213 | Vector 3                                   +----------+ | 
|  | 214 |  | 
|  | 215 | After formatting: | 
|  | 216 |  | 
|  | 217 | Memory Buffer +-V1-+-V2-+----V3----+ | 
|  | 218 | Vector 1      +----+ | 
|  | 219 | Vector 2           +----+ | 
|  | 220 | Vector 3                +----------+ | 
|  | 221 |  | 
|  | 222 | The memory buffer and associated vector need to be passed as a single object, | 
|  | 223 | but still need to be associated with the parent object so if the object is | 
|  | 224 | relogged we can replace the current memory buffer with a new memory buffer that | 
|  | 225 | contains the latest changes. | 
|  | 226 |  | 
|  | 227 | The reason for keeping the vector around after we've formatted the memory | 
|  | 228 | buffer is to support splitting vectors across log buffer boundaries correctly. | 
|  | 229 | If we don't keep the vector around, we do not know where the region boundaries | 
|  | 230 | are in the item, so we'd need a new encapsulation method for regions in the log | 
|  | 231 | buffer writing (i.e. double encapsulation). This would be an on-disk format | 
|  | 232 | change and as such is not desirable.  It also means we'd have to write the log | 
|  | 233 | region headers in the formatting stage, which is problematic as there is per | 
|  | 234 | region state that needs to be placed into the headers during the log write. | 
|  | 235 |  | 
|  | 236 | Hence we need to keep the vector, but by attaching the memory buffer to it and | 
|  | 237 | rewriting the vector addresses to point at the memory buffer we end up with a | 
|  | 238 | self-describing object that can be passed to the log buffer write code to be | 
|  | 239 | handled in exactly the same manner as the existing log vectors are handled. | 
|  | 240 | Hence we avoid needing a new on-disk format to handle items that have been | 
|  | 241 | relogged in memory. | 
|  | 242 |  | 
|  | 243 |  | 
|  | 244 | Tracking Changes | 
|  | 245 |  | 
|  | 246 | Now that we can record transactional changes in memory in a form that allows | 
|  | 247 | them to be used without limitations, we need to be able to track and accumulate | 
|  | 248 | them so that they can be written to the log at some later point in time.  The | 
|  | 249 | log item is the natural place to store this vector and buffer, and also makes sense | 
|  | 250 | to be the object that is used to track committed objects as it will always | 
|  | 251 | exist once the object has been included in a transaction. | 
|  | 252 |  | 
|  | 253 | The log item is already used to track the log items that have been written to | 
|  | 254 | the log but not yet written to disk. Such log items are considered "active" | 
|  | 255 | and as such are stored in the Active Item List (AIL) which is a LSN-ordered | 
|  | 256 | double linked list. Items are inserted into this list during log buffer IO | 
|  | 257 | completion, after which they are unpinned and can be written to disk. An object | 
|  | 258 | that is in the AIL can be relogged, which causes the object to be pinned again | 
|  | 259 | and then moved forward in the AIL when the log buffer IO completes for that | 
|  | 260 | transaction. | 
|  | 261 |  | 
|  | 262 | Essentially, this shows that an item that is in the AIL can still be modified | 
|  | 263 | and relogged, so any tracking must be separate to the AIL infrastructure. As | 
|  | 264 | such, we cannot reuse the AIL list pointers for tracking committed items, nor | 
|  | 265 | can we store state in any field that is protected by the AIL lock. Hence the | 
|  | 266 | committed item tracking needs it's own locks, lists and state fields in the log | 
|  | 267 | item. | 
|  | 268 |  | 
|  | 269 | Similar to the AIL, tracking of committed items is done through a new list | 
|  | 270 | called the Committed Item List (CIL).  The list tracks log items that have been | 
|  | 271 | committed and have formatted memory buffers attached to them. It tracks objects | 
|  | 272 | in transaction commit order, so when an object is relogged it is removed from | 
|  | 273 | it's place in the list and re-inserted at the tail. This is entirely arbitrary | 
|  | 274 | and done to make it easy for debugging - the last items in the list are the | 
|  | 275 | ones that are most recently modified. Ordering of the CIL is not necessary for | 
|  | 276 | transactional integrity (as discussed in the next section) so the ordering is | 
|  | 277 | done for convenience/sanity of the developers. | 
|  | 278 |  | 
|  | 279 |  | 
|  | 280 | Delayed Logging: Checkpoints | 
|  | 281 |  | 
|  | 282 | When we have a log synchronisation event, commonly known as a "log force", | 
|  | 283 | all the items in the CIL must be written into the log via the log buffers. | 
|  | 284 | We need to write these items in the order that they exist in the CIL, and they | 
|  | 285 | need to be written as an atomic transaction. The need for all the objects to be | 
|  | 286 | written as an atomic transaction comes from the requirements of relogging and | 
|  | 287 | log replay - all the changes in all the objects in a given transaction must | 
|  | 288 | either be completely replayed during log recovery, or not replayed at all. If | 
|  | 289 | a transaction is not replayed because it is not complete in the log, then | 
|  | 290 | no later transactions should be replayed, either. | 
|  | 291 |  | 
|  | 292 | To fulfill this requirement, we need to write the entire CIL in a single log | 
|  | 293 | transaction. Fortunately, the XFS log code has no fixed limit on the size of a | 
|  | 294 | transaction, nor does the log replay code. The only fundamental limit is that | 
|  | 295 | the transaction cannot be larger than just under half the size of the log.  The | 
|  | 296 | reason for this limit is that to find the head and tail of the log, there must | 
|  | 297 | be at least one complete transaction in the log at any given time. If a | 
|  | 298 | transaction is larger than half the log, then there is the possibility that a | 
|  | 299 | crash during the write of a such a transaction could partially overwrite the | 
|  | 300 | only complete previous transaction in the log. This will result in a recovery | 
|  | 301 | failure and an inconsistent filesystem and hence we must enforce the maximum | 
|  | 302 | size of a checkpoint to be slightly less than a half the log. | 
|  | 303 |  | 
|  | 304 | Apart from this size requirement, a checkpoint transaction looks no different | 
|  | 305 | to any other transaction - it contains a transaction header, a series of | 
|  | 306 | formatted log items and a commit record at the tail. From a recovery | 
|  | 307 | perspective, the checkpoint transaction is also no different - just a lot | 
|  | 308 | bigger with a lot more items in it. The worst case effect of this is that we | 
|  | 309 | might need to tune the recovery transaction object hash size. | 
|  | 310 |  | 
|  | 311 | Because the checkpoint is just another transaction and all the changes to log | 
|  | 312 | items are stored as log vectors, we can use the existing log buffer writing | 
|  | 313 | code to write the changes into the log. To do this efficiently, we need to | 
|  | 314 | minimise the time we hold the CIL locked while writing the checkpoint | 
|  | 315 | transaction. The current log write code enables us to do this easily with the | 
|  | 316 | way it separates the writing of the transaction contents (the log vectors) from | 
|  | 317 | the transaction commit record, but tracking this requires us to have a | 
|  | 318 | per-checkpoint context that travels through the log write process through to | 
|  | 319 | checkpoint completion. | 
|  | 320 |  | 
|  | 321 | Hence a checkpoint has a context that tracks the state of the current | 
|  | 322 | checkpoint from initiation to checkpoint completion. A new context is initiated | 
|  | 323 | at the same time a checkpoint transaction is started. That is, when we remove | 
|  | 324 | all the current items from the CIL during a checkpoint operation, we move all | 
|  | 325 | those changes into the current checkpoint context. We then initialise a new | 
|  | 326 | context and attach that to the CIL for aggregation of new transactions. | 
|  | 327 |  | 
|  | 328 | This allows us to unlock the CIL immediately after transfer of all the | 
|  | 329 | committed items and effectively allow new transactions to be issued while we | 
|  | 330 | are formatting the checkpoint into the log. It also allows concurrent | 
|  | 331 | checkpoints to be written into the log buffers in the case of log force heavy | 
|  | 332 | workloads, just like the existing transaction commit code does. This, however, | 
|  | 333 | requires that we strictly order the commit records in the log so that | 
|  | 334 | checkpoint sequence order is maintained during log replay. | 
|  | 335 |  | 
|  | 336 | To ensure that we can be writing an item into a checkpoint transaction at | 
|  | 337 | the same time another transaction modifies the item and inserts the log item | 
|  | 338 | into the new CIL, then checkpoint transaction commit code cannot use log items | 
|  | 339 | to store the list of log vectors that need to be written into the transaction. | 
|  | 340 | Hence log vectors need to be able to be chained together to allow them to be | 
|  | 341 | detached from the log items. That is, when the CIL is flushed the memory | 
|  | 342 | buffer and log vector attached to each log item needs to be attached to the | 
|  | 343 | checkpoint context so that the log item can be released. In diagrammatic form, | 
|  | 344 | the CIL would look like this before the flush: | 
|  | 345 |  | 
|  | 346 | CIL Head | 
|  | 347 | | | 
|  | 348 | V | 
|  | 349 | Log Item <-> log vector 1	-> memory buffer | 
|  | 350 | |				-> vector array | 
|  | 351 | V | 
|  | 352 | Log Item <-> log vector 2	-> memory buffer | 
|  | 353 | |				-> vector array | 
|  | 354 | V | 
|  | 355 | ...... | 
|  | 356 | | | 
|  | 357 | V | 
|  | 358 | Log Item <-> log vector N-1	-> memory buffer | 
|  | 359 | |				-> vector array | 
|  | 360 | V | 
|  | 361 | Log Item <-> log vector N	-> memory buffer | 
|  | 362 | -> vector array | 
|  | 363 |  | 
|  | 364 | And after the flush the CIL head is empty, and the checkpoint context log | 
|  | 365 | vector list would look like: | 
|  | 366 |  | 
|  | 367 | Checkpoint Context | 
|  | 368 | | | 
|  | 369 | V | 
|  | 370 | log vector 1	-> memory buffer | 
|  | 371 | |		-> vector array | 
|  | 372 | |		-> Log Item | 
|  | 373 | V | 
|  | 374 | log vector 2	-> memory buffer | 
|  | 375 | |		-> vector array | 
|  | 376 | |		-> Log Item | 
|  | 377 | V | 
|  | 378 | ...... | 
|  | 379 | | | 
|  | 380 | V | 
|  | 381 | log vector N-1	-> memory buffer | 
|  | 382 | |		-> vector array | 
|  | 383 | |		-> Log Item | 
|  | 384 | V | 
|  | 385 | log vector N	-> memory buffer | 
|  | 386 | -> vector array | 
|  | 387 | -> Log Item | 
|  | 388 |  | 
|  | 389 | Once this transfer is done, the CIL can be unlocked and new transactions can | 
|  | 390 | start, while the checkpoint flush code works over the log vector chain to | 
|  | 391 | commit the checkpoint. | 
|  | 392 |  | 
|  | 393 | Once the checkpoint is written into the log buffers, the checkpoint context is | 
|  | 394 | attached to the log buffer that the commit record was written to along with a | 
|  | 395 | completion callback. Log IO completion will call that callback, which can then | 
|  | 396 | run transaction committed processing for the log items (i.e. insert into AIL | 
|  | 397 | and unpin) in the log vector chain and then free the log vector chain and | 
|  | 398 | checkpoint context. | 
|  | 399 |  | 
|  | 400 | Discussion Point: I am uncertain as to whether the log item is the most | 
|  | 401 | efficient way to track vectors, even though it seems like the natural way to do | 
|  | 402 | it. The fact that we walk the log items (in the CIL) just to chain the log | 
|  | 403 | vectors and break the link between the log item and the log vector means that | 
|  | 404 | we take a cache line hit for the log item list modification, then another for | 
|  | 405 | the log vector chaining. If we track by the log vectors, then we only need to | 
|  | 406 | break the link between the log item and the log vector, which means we should | 
|  | 407 | dirty only the log item cachelines. Normally I wouldn't be concerned about one | 
|  | 408 | vs two dirty cachelines except for the fact I've seen upwards of 80,000 log | 
|  | 409 | vectors in one checkpoint transaction. I'd guess this is a "measure and | 
|  | 410 | compare" situation that can be done after a working and reviewed implementation | 
|  | 411 | is in the dev tree.... | 
|  | 412 |  | 
|  | 413 | Delayed Logging: Checkpoint Sequencing | 
|  | 414 |  | 
|  | 415 | One of the key aspects of the XFS transaction subsystem is that it tags | 
|  | 416 | committed transactions with the log sequence number of the transaction commit. | 
|  | 417 | This allows transactions to be issued asynchronously even though there may be | 
|  | 418 | future operations that cannot be completed until that transaction is fully | 
|  | 419 | committed to the log. In the rare case that a dependent operation occurs (e.g. | 
|  | 420 | re-using a freed metadata extent for a data extent), a special, optimised log | 
|  | 421 | force can be issued to force the dependent transaction to disk immediately. | 
|  | 422 |  | 
|  | 423 | To do this, transactions need to record the LSN of the commit record of the | 
|  | 424 | transaction. This LSN comes directly from the log buffer the transaction is | 
|  | 425 | written into. While this works just fine for the existing transaction | 
|  | 426 | mechanism, it does not work for delayed logging because transactions are not | 
|  | 427 | written directly into the log buffers. Hence some other method of sequencing | 
|  | 428 | transactions is required. | 
|  | 429 |  | 
|  | 430 | As discussed in the checkpoint section, delayed logging uses per-checkpoint | 
|  | 431 | contexts, and as such it is simple to assign a sequence number to each | 
|  | 432 | checkpoint. Because the switching of checkpoint contexts must be done | 
|  | 433 | atomically, it is simple to ensure that each new context has a monotonically | 
|  | 434 | increasing sequence number assigned to it without the need for an external | 
|  | 435 | atomic counter - we can just take the current context sequence number and add | 
|  | 436 | one to it for the new context. | 
|  | 437 |  | 
|  | 438 | Then, instead of assigning a log buffer LSN to the transaction commit LSN | 
|  | 439 | during the commit, we can assign the current checkpoint sequence. This allows | 
|  | 440 | operations that track transactions that have not yet completed know what | 
|  | 441 | checkpoint sequence needs to be committed before they can continue. As a | 
|  | 442 | result, the code that forces the log to a specific LSN now needs to ensure that | 
|  | 443 | the log forces to a specific checkpoint. | 
|  | 444 |  | 
|  | 445 | To ensure that we can do this, we need to track all the checkpoint contexts | 
|  | 446 | that are currently committing to the log. When we flush a checkpoint, the | 
|  | 447 | context gets added to a "committing" list which can be searched. When a | 
|  | 448 | checkpoint commit completes, it is removed from the committing list. Because | 
|  | 449 | the checkpoint context records the LSN of the commit record for the checkpoint, | 
|  | 450 | we can also wait on the log buffer that contains the commit record, thereby | 
|  | 451 | using the existing log force mechanisms to execute synchronous forces. | 
|  | 452 |  | 
|  | 453 | It should be noted that the synchronous forces may need to be extended with | 
|  | 454 | mitigation algorithms similar to the current log buffer code to allow | 
|  | 455 | aggregation of multiple synchronous transactions if there are already | 
|  | 456 | synchronous transactions being flushed. Investigation of the performance of the | 
|  | 457 | current design is needed before making any decisions here. | 
|  | 458 |  | 
|  | 459 | The main concern with log forces is to ensure that all the previous checkpoints | 
|  | 460 | are also committed to disk before the one we need to wait for. Therefore we | 
|  | 461 | need to check that all the prior contexts in the committing list are also | 
|  | 462 | complete before waiting on the one we need to complete. We do this | 
|  | 463 | synchronisation in the log force code so that we don't need to wait anywhere | 
|  | 464 | else for such serialisation - it only matters when we do a log force. | 
|  | 465 |  | 
|  | 466 | The only remaining complexity is that a log force now also has to handle the | 
|  | 467 | case where the forcing sequence number is the same as the current context. That | 
|  | 468 | is, we need to flush the CIL and potentially wait for it to complete. This is a | 
|  | 469 | simple addition to the existing log forcing code to check the sequence numbers | 
|  | 470 | and push if required. Indeed, placing the current sequence checkpoint flush in | 
|  | 471 | the log force code enables the current mechanism for issuing synchronous | 
|  | 472 | transactions to remain untouched (i.e. commit an asynchronous transaction, then | 
|  | 473 | force the log at the LSN of that transaction) and so the higher level code | 
|  | 474 | behaves the same regardless of whether delayed logging is being used or not. | 
|  | 475 |  | 
|  | 476 | Delayed Logging: Checkpoint Log Space Accounting | 
|  | 477 |  | 
|  | 478 | The big issue for a checkpoint transaction is the log space reservation for the | 
|  | 479 | transaction. We don't know how big a checkpoint transaction is going to be | 
|  | 480 | ahead of time, nor how many log buffers it will take to write out, nor the | 
|  | 481 | number of split log vector regions are going to be used. We can track the | 
|  | 482 | amount of log space required as we add items to the commit item list, but we | 
|  | 483 | still need to reserve the space in the log for the checkpoint. | 
|  | 484 |  | 
|  | 485 | A typical transaction reserves enough space in the log for the worst case space | 
|  | 486 | usage of the transaction. The reservation accounts for log record headers, | 
|  | 487 | transaction and region headers, headers for split regions, buffer tail padding, | 
|  | 488 | etc. as well as the actual space for all the changed metadata in the | 
|  | 489 | transaction. While some of this is fixed overhead, much of it is dependent on | 
|  | 490 | the size of the transaction and the number of regions being logged (the number | 
|  | 491 | of log vectors in the transaction). | 
|  | 492 |  | 
|  | 493 | An example of the differences would be logging directory changes versus logging | 
|  | 494 | inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then | 
|  | 495 | there are lots of transactions that only contain an inode core and an inode log | 
|  | 496 | format structure. That is, two vectors totaling roughly 150 bytes. If we modify | 
|  | 497 | 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each | 
|  | 498 | vector is 12 bytes, so the total to be logged is approximately 1.75MB. In | 
|  | 499 | comparison, if we are logging full directory buffers, they are typically 4KB | 
|  | 500 | each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a | 
|  | 501 | buffer format structure for each buffer - roughly 800 vectors or 1.51MB total | 
|  | 502 | space.  From this, it should be obvious that a static log space reservation is | 
|  | 503 | not particularly flexible and is difficult to select the "optimal value" for | 
|  | 504 | all workloads. | 
|  | 505 |  | 
|  | 506 | Further, if we are going to use a static reservation, which bit of the entire | 
|  | 507 | reservation does it cover? We account for space used by the transaction | 
|  | 508 | reservation by tracking the space currently used by the object in the CIL and | 
|  | 509 | then calculating the increase or decrease in space used as the object is | 
|  | 510 | relogged. This allows for a checkpoint reservation to only have to account for | 
|  | 511 | log buffer metadata used such as log header records. | 
|  | 512 |  | 
|  | 513 | However, even using a static reservation for just the log metadata is | 
|  | 514 | problematic. Typically log record headers use at least 16KB of log space per | 
|  | 515 | 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be | 
|  | 516 | large enough to handle arbitrary sized checkpoint transactions. This | 
|  | 517 | reservation needs to be made before the checkpoint is started, and we need to | 
|  | 518 | be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a | 
|  | 519 | reservation of around 150KB, which is a non-trivial amount of space. | 
|  | 520 |  | 
|  | 521 | A static reservation needs to manipulate the log grant counters - we can take a | 
|  | 522 | permanent reservation on the space, but we still need to make sure we refresh | 
|  | 523 | the write reservation (the actual space available to the transaction) after | 
|  | 524 | every checkpoint transaction completion. Unfortunately, if this space is not | 
|  | 525 | available when required, then the regrant code will sleep waiting for it. | 
|  | 526 |  | 
|  | 527 | The problem with this is that it can lead to deadlocks as we may need to commit | 
|  | 528 | checkpoints to be able to free up log space (refer back to the description of | 
|  | 529 | rolling transactions for an example of this).  Hence we *must* always have | 
|  | 530 | space available in the log if we are to use static reservations, and that is | 
|  | 531 | very difficult and complex to arrange. It is possible to do, but there is a | 
|  | 532 | simpler way. | 
|  | 533 |  | 
|  | 534 | The simpler way of doing this is tracking the entire log space used by the | 
|  | 535 | items in the CIL and using this to dynamically calculate the amount of log | 
|  | 536 | space required by the log metadata. If this log metadata space changes as a | 
|  | 537 | result of a transaction commit inserting a new memory buffer into the CIL, then | 
|  | 538 | the difference in space required is removed from the transaction that causes | 
|  | 539 | the change. Transactions at this level will *always* have enough space | 
|  | 540 | available in their reservation for this as they have already reserved the | 
|  | 541 | maximal amount of log metadata space they require, and such a delta reservation | 
|  | 542 | will always be less than or equal to the maximal amount in the reservation. | 
|  | 543 |  | 
|  | 544 | Hence we can grow the checkpoint transaction reservation dynamically as items | 
|  | 545 | are added to the CIL and avoid the need for reserving and regranting log space | 
|  | 546 | up front. This avoids deadlocks and removes a blocking point from the | 
|  | 547 | checkpoint flush code. | 
|  | 548 |  | 
|  | 549 | As mentioned early, transactions can't grow to more than half the size of the | 
|  | 550 | log. Hence as part of the reservation growing, we need to also check the size | 
|  | 551 | of the reservation against the maximum allowed transaction size. If we reach | 
|  | 552 | the maximum threshold, we need to push the CIL to the log. This is effectively | 
|  | 553 | a "background flush" and is done on demand. This is identical to | 
|  | 554 | a CIL push triggered by a log force, only that there is no waiting for the | 
|  | 555 | checkpoint commit to complete. This background push is checked and executed by | 
|  | 556 | transaction commit code. | 
|  | 557 |  | 
|  | 558 | If the transaction subsystem goes idle while we still have items in the CIL, | 
|  | 559 | they will be flushed by the periodic log force issued by the xfssyncd. This log | 
|  | 560 | force will push the CIL to disk, and if the transaction subsystem stays idle, | 
|  | 561 | allow the idle log to be covered (effectively marked clean) in exactly the same | 
|  | 562 | manner that is done for the existing logging method. A discussion point is | 
|  | 563 | whether this log force needs to be done more frequently than the current rate | 
|  | 564 | which is once every 30s. | 
|  | 565 |  | 
|  | 566 |  | 
|  | 567 | Delayed Logging: Log Item Pinning | 
|  | 568 |  | 
|  | 569 | Currently log items are pinned during transaction commit while the items are | 
|  | 570 | still locked. This happens just after the items are formatted, though it could | 
|  | 571 | be done any time before the items are unlocked. The result of this mechanism is | 
|  | 572 | that items get pinned once for every transaction that is committed to the log | 
|  | 573 | buffers. Hence items that are relogged in the log buffers will have a pin count | 
|  | 574 | for every outstanding transaction they were dirtied in. When each of these | 
|  | 575 | transactions is completed, they will unpin the item once. As a result, the item | 
|  | 576 | only becomes unpinned when all the transactions complete and there are no | 
|  | 577 | pending transactions. Thus the pinning and unpinning of a log item is symmetric | 
|  | 578 | as there is a 1:1 relationship with transaction commit and log item completion. | 
|  | 579 |  | 
|  | 580 | For delayed logging, however, we have an asymmetric transaction commit to | 
|  | 581 | completion relationship. Every time an object is relogged in the CIL it goes | 
|  | 582 | through the commit process without a corresponding completion being registered. | 
|  | 583 | That is, we now have a many-to-one relationship between transaction commit and | 
|  | 584 | log item completion. The result of this is that pinning and unpinning of the | 
|  | 585 | log items becomes unbalanced if we retain the "pin on transaction commit, unpin | 
|  | 586 | on transaction completion" model. | 
|  | 587 |  | 
|  | 588 | To keep pin/unpin symmetry, the algorithm needs to change to a "pin on | 
|  | 589 | insertion into the CIL, unpin on checkpoint completion". In other words, the | 
|  | 590 | pinning and unpinning becomes symmetric around a checkpoint context. We have to | 
|  | 591 | pin the object the first time it is inserted into the CIL - if it is already in | 
|  | 592 | the CIL during a transaction commit, then we do not pin it again. Because there | 
|  | 593 | can be multiple outstanding checkpoint contexts, we can still see elevated pin | 
|  | 594 | counts, but as each checkpoint completes the pin count will retain the correct | 
|  | 595 | value according to it's context. | 
|  | 596 |  | 
|  | 597 | Just to make matters more slightly more complex, this checkpoint level context | 
|  | 598 | for the pin count means that the pinning of an item must take place under the | 
|  | 599 | CIL commit/flush lock. If we pin the object outside this lock, we cannot | 
|  | 600 | guarantee which context the pin count is associated with. This is because of | 
|  | 601 | the fact pinning the item is dependent on whether the item is present in the | 
|  | 602 | current CIL or not. If we don't pin the CIL first before we check and pin the | 
|  | 603 | object, we have a race with CIL being flushed between the check and the pin | 
|  | 604 | (or not pinning, as the case may be). Hence we must hold the CIL flush/commit | 
|  | 605 | lock to guarantee that we pin the items correctly. | 
|  | 606 |  | 
|  | 607 | Delayed Logging: Concurrent Scalability | 
|  | 608 |  | 
|  | 609 | A fundamental requirement for the CIL is that accesses through transaction | 
|  | 610 | commits must scale to many concurrent commits. The current transaction commit | 
|  | 611 | code does not break down even when there are transactions coming from 2048 | 
|  | 612 | processors at once. The current transaction code does not go any faster than if | 
|  | 613 | there was only one CPU using it, but it does not slow down either. | 
|  | 614 |  | 
|  | 615 | As a result, the delayed logging transaction commit code needs to be designed | 
|  | 616 | for concurrency from the ground up. It is obvious that there are serialisation | 
|  | 617 | points in the design - the three important ones are: | 
|  | 618 |  | 
|  | 619 | 1. Locking out new transaction commits while flushing the CIL | 
|  | 620 | 2. Adding items to the CIL and updating item space accounting | 
|  | 621 | 3. Checkpoint commit ordering | 
|  | 622 |  | 
|  | 623 | Looking at the transaction commit and CIL flushing interactions, it is clear | 
|  | 624 | that we have a many-to-one interaction here. That is, the only restriction on | 
|  | 625 | the number of concurrent transactions that can be trying to commit at once is | 
|  | 626 | the amount of space available in the log for their reservations. The practical | 
|  | 627 | limit here is in the order of several hundred concurrent transactions for a | 
|  | 628 | 128MB log, which means that it is generally one per CPU in a machine. | 
|  | 629 |  | 
|  | 630 | The amount of time a transaction commit needs to hold out a flush is a | 
|  | 631 | relatively long period of time - the pinning of log items needs to be done | 
|  | 632 | while we are holding out a CIL flush, so at the moment that means it is held | 
|  | 633 | across the formatting of the objects into memory buffers (i.e. while memcpy()s | 
|  | 634 | are in progress). Ultimately a two pass algorithm where the formatting is done | 
|  | 635 | separately to the pinning of objects could be used to reduce the hold time of | 
|  | 636 | the transaction commit side. | 
|  | 637 |  | 
|  | 638 | Because of the number of potential transaction commit side holders, the lock | 
|  | 639 | really needs to be a sleeping lock - if the CIL flush takes the lock, we do not | 
|  | 640 | want every other CPU in the machine spinning on the CIL lock. Given that | 
|  | 641 | flushing the CIL could involve walking a list of tens of thousands of log | 
|  | 642 | items, it will get held for a significant time and so spin contention is a | 
|  | 643 | significant concern. Preventing lots of CPUs spinning doing nothing is the | 
|  | 644 | main reason for choosing a sleeping lock even though nothing in either the | 
|  | 645 | transaction commit or CIL flush side sleeps with the lock held. | 
|  | 646 |  | 
|  | 647 | It should also be noted that CIL flushing is also a relatively rare operation | 
|  | 648 | compared to transaction commit for asynchronous transaction workloads - only | 
|  | 649 | time will tell if using a read-write semaphore for exclusion will limit | 
|  | 650 | transaction commit concurrency due to cache line bouncing of the lock on the | 
|  | 651 | read side. | 
|  | 652 |  | 
|  | 653 | The second serialisation point is on the transaction commit side where items | 
|  | 654 | are inserted into the CIL. Because transactions can enter this code | 
|  | 655 | concurrently, the CIL needs to be protected separately from the above | 
|  | 656 | commit/flush exclusion. It also needs to be an exclusive lock but it is only | 
|  | 657 | held for a very short time and so a spin lock is appropriate here. It is | 
|  | 658 | possible that this lock will become a contention point, but given the short | 
|  | 659 | hold time once per transaction I think that contention is unlikely. | 
|  | 660 |  | 
|  | 661 | The final serialisation point is the checkpoint commit record ordering code | 
|  | 662 | that is run as part of the checkpoint commit and log force sequencing. The code | 
|  | 663 | path that triggers a CIL flush (i.e. whatever triggers the log force) will enter | 
|  | 664 | an ordering loop after writing all the log vectors into the log buffers but | 
|  | 665 | before writing the commit record. This loop walks the list of committing | 
|  | 666 | checkpoints and needs to block waiting for checkpoints to complete their commit | 
|  | 667 | record write. As a result it needs a lock and a wait variable. Log force | 
|  | 668 | sequencing also requires the same lock, list walk, and blocking mechanism to | 
|  | 669 | ensure completion of checkpoints. | 
|  | 670 |  | 
|  | 671 | These two sequencing operations can use the mechanism even though the | 
|  | 672 | events they are waiting for are different. The checkpoint commit record | 
|  | 673 | sequencing needs to wait until checkpoint contexts contain a commit LSN | 
|  | 674 | (obtained through completion of a commit record write) while log force | 
|  | 675 | sequencing needs to wait until previous checkpoint contexts are removed from | 
|  | 676 | the committing list (i.e. they've completed). A simple wait variable and | 
|  | 677 | broadcast wakeups (thundering herds) has been used to implement these two | 
|  | 678 | serialisation queues. They use the same lock as the CIL, too. If we see too | 
|  | 679 | much contention on the CIL lock, or too many context switches as a result of | 
|  | 680 | the broadcast wakeups these operations can be put under a new spinlock and | 
|  | 681 | given separate wait lists to reduce lock contention and the number of processes | 
|  | 682 | woken by the wrong event. | 
|  | 683 |  | 
|  | 684 |  | 
|  | 685 | Lifecycle Changes | 
|  | 686 |  | 
|  | 687 | The existing log item life cycle is as follows: | 
|  | 688 |  | 
|  | 689 | 1. Transaction allocate | 
|  | 690 | 2. Transaction reserve | 
|  | 691 | 3. Lock item | 
|  | 692 | 4. Join item to transaction | 
|  | 693 | If not already attached, | 
|  | 694 | Allocate log item | 
|  | 695 | Attach log item to owner item | 
|  | 696 | Attach log item to transaction | 
|  | 697 | 5. Modify item | 
|  | 698 | Record modifications in log item | 
|  | 699 | 6. Transaction commit | 
|  | 700 | Pin item in memory | 
|  | 701 | Format item into log buffer | 
|  | 702 | Write commit LSN into transaction | 
|  | 703 | Unlock item | 
|  | 704 | Attach transaction to log buffer | 
|  | 705 |  | 
|  | 706 | <log buffer IO dispatched> | 
|  | 707 | <log buffer IO completes> | 
|  | 708 |  | 
|  | 709 | 7. Transaction completion | 
|  | 710 | Mark log item committed | 
|  | 711 | Insert log item into AIL | 
|  | 712 | Write commit LSN into log item | 
|  | 713 | Unpin log item | 
|  | 714 | 8. AIL traversal | 
|  | 715 | Lock item | 
|  | 716 | Mark log item clean | 
|  | 717 | Flush item to disk | 
|  | 718 |  | 
|  | 719 | <item IO completion> | 
|  | 720 |  | 
|  | 721 | 9. Log item removed from AIL | 
|  | 722 | Moves log tail | 
|  | 723 | Item unlocked | 
|  | 724 |  | 
|  | 725 | Essentially, steps 1-6 operate independently from step 7, which is also | 
|  | 726 | independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 | 
|  | 727 | at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur | 
|  | 728 | at the same time. If the log item is in the AIL or between steps 6 and 7 | 
|  | 729 | and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 | 
|  | 730 | are entered and completed is the object considered clean. | 
|  | 731 |  | 
|  | 732 | With delayed logging, there are new steps inserted into the life cycle: | 
|  | 733 |  | 
|  | 734 | 1. Transaction allocate | 
|  | 735 | 2. Transaction reserve | 
|  | 736 | 3. Lock item | 
|  | 737 | 4. Join item to transaction | 
|  | 738 | If not already attached, | 
|  | 739 | Allocate log item | 
|  | 740 | Attach log item to owner item | 
|  | 741 | Attach log item to transaction | 
|  | 742 | 5. Modify item | 
|  | 743 | Record modifications in log item | 
|  | 744 | 6. Transaction commit | 
|  | 745 | Pin item in memory if not pinned in CIL | 
|  | 746 | Format item into log vector + buffer | 
|  | 747 | Attach log vector and buffer to log item | 
|  | 748 | Insert log item into CIL | 
|  | 749 | Write CIL context sequence into transaction | 
|  | 750 | Unlock item | 
|  | 751 |  | 
|  | 752 | <next log force> | 
|  | 753 |  | 
|  | 754 | 7. CIL push | 
|  | 755 | lock CIL flush | 
|  | 756 | Chain log vectors and buffers together | 
|  | 757 | Remove items from CIL | 
|  | 758 | unlock CIL flush | 
|  | 759 | write log vectors into log | 
|  | 760 | sequence commit records | 
|  | 761 | attach checkpoint context to log buffer | 
|  | 762 |  | 
|  | 763 | <log buffer IO dispatched> | 
|  | 764 | <log buffer IO completes> | 
|  | 765 |  | 
|  | 766 | 8. Checkpoint completion | 
|  | 767 | Mark log item committed | 
|  | 768 | Insert item into AIL | 
|  | 769 | Write commit LSN into log item | 
|  | 770 | Unpin log item | 
|  | 771 | 9. AIL traversal | 
|  | 772 | Lock item | 
|  | 773 | Mark log item clean | 
|  | 774 | Flush item to disk | 
|  | 775 | <item IO completion> | 
|  | 776 | 10. Log item removed from AIL | 
|  | 777 | Moves log tail | 
|  | 778 | Item unlocked | 
|  | 779 |  | 
|  | 780 | From this, it can be seen that the only life cycle differences between the two | 
|  | 781 | logging methods are in the middle of the life cycle - they still have the same | 
|  | 782 | beginning and end and execution constraints. The only differences are in the | 
|  | 783 | committing of the log items to the log itself and the completion processing. | 
|  | 784 | Hence delayed logging should not introduce any constraints on log item | 
|  | 785 | behaviour, allocation or freeing that don't already exist. | 
|  | 786 |  | 
|  | 787 | As a result of this zero-impact "insertion" of delayed logging infrastructure | 
|  | 788 | and the design of the internal structures to avoid on disk format changes, we | 
|  | 789 | can basically switch between delayed logging and the existing mechanism with a | 
|  | 790 | mount option. Fundamentally, there is no reason why the log manager would not | 
|  | 791 | be able to swap methods automatically and transparently depending on load | 
|  | 792 | characteristics, but this should not be necessary if delayed logging works as | 
|  | 793 | designed. |