|  | ================================== | 
|  | DMAengine controller documentation | 
|  | ================================== | 
|  |  | 
|  | Hardware Introduction | 
|  | ===================== | 
|  |  | 
|  | Most of the Slave DMA controllers have the same general principles of | 
|  | operations. | 
|  |  | 
|  | They have a given number of channels to use for the DMA transfers, and | 
|  | a given number of requests lines. | 
|  |  | 
|  | Requests and channels are pretty much orthogonal. Channels can be used | 
|  | to serve several to any requests. To simplify, channels are the | 
|  | entities that will be doing the copy, and requests what endpoints are | 
|  | involved. | 
|  |  | 
|  | The request lines actually correspond to physical lines going from the | 
|  | DMA-eligible devices to the controller itself. Whenever the device | 
|  | will want to start a transfer, it will assert a DMA request (DRQ) by | 
|  | asserting that request line. | 
|  |  | 
|  | A very simple DMA controller would only take into account a single | 
|  | parameter: the transfer size. At each clock cycle, it would transfer a | 
|  | byte of data from one buffer to another, until the transfer size has | 
|  | been reached. | 
|  |  | 
|  | That wouldn't work well in the real world, since slave devices might | 
|  | require a specific number of bits to be transferred in a single | 
|  | cycle. For example, we may want to transfer as much data as the | 
|  | physical bus allows to maximize performances when doing a simple | 
|  | memory copy operation, but our audio device could have a narrower FIFO | 
|  | that requires data to be written exactly 16 or 24 bits at a time. This | 
|  | is why most if not all of the DMA controllers can adjust this, using a | 
|  | parameter called the transfer width. | 
|  |  | 
|  | Moreover, some DMA controllers, whenever the RAM is used as a source | 
|  | or destination, can group the reads or writes in memory into a buffer, | 
|  | so instead of having a lot of small memory accesses, which is not | 
|  | really efficient, you'll get several bigger transfers. This is done | 
|  | using a parameter called the burst size, that defines how many single | 
|  | reads/writes it's allowed to do without the controller splitting the | 
|  | transfer into smaller sub-transfers. | 
|  |  | 
|  | Our theoretical DMA controller would then only be able to do transfers | 
|  | that involve a single contiguous block of data. However, some of the | 
|  | transfers we usually have are not, and want to copy data from | 
|  | non-contiguous buffers to a contiguous buffer, which is called | 
|  | scatter-gather. | 
|  |  | 
|  | DMAEngine, at least for mem2dev transfers, require support for | 
|  | scatter-gather. So we're left with two cases here: either we have a | 
|  | quite simple DMA controller that doesn't support it, and we'll have to | 
|  | implement it in software, or we have a more advanced DMA controller, | 
|  | that implements in hardware scatter-gather. | 
|  |  | 
|  | The latter are usually programmed using a collection of chunks to | 
|  | transfer, and whenever the transfer is started, the controller will go | 
|  | over that collection, doing whatever we programmed there. | 
|  |  | 
|  | This collection is usually either a table or a linked list. You will | 
|  | then push either the address of the table and its number of elements, | 
|  | or the first item of the list to one channel of the DMA controller, | 
|  | and whenever a DRQ will be asserted, it will go through the collection | 
|  | to know where to fetch the data from. | 
|  |  | 
|  | Either way, the format of this collection is completely dependent on | 
|  | your hardware. Each DMA controller will require a different structure, | 
|  | but all of them will require, for every chunk, at least the source and | 
|  | destination addresses, whether it should increment these addresses or | 
|  | not and the three parameters we saw earlier: the burst size, the | 
|  | transfer width and the transfer size. | 
|  |  | 
|  | The one last thing is that usually, slave devices won't issue DRQ by | 
|  | default, and you have to enable this in your slave device driver first | 
|  | whenever you're willing to use DMA. | 
|  |  | 
|  | These were just the general memory-to-memory (also called mem2mem) or | 
|  | memory-to-device (mem2dev) kind of transfers. Most devices often | 
|  | support other kind of transfers or memory operations that dmaengine | 
|  | support and will be detailed later in this document. | 
|  |  | 
|  | DMA Support in Linux | 
|  | ==================== | 
|  |  | 
|  | Historically, DMA controller drivers have been implemented using the | 
|  | async TX API, to offload operations such as memory copy, XOR, | 
|  | cryptography, etc., basically any memory to memory operation. | 
|  |  | 
|  | Over time, the need for memory to device transfers arose, and | 
|  | dmaengine was extended. Nowadays, the async TX API is written as a | 
|  | layer on top of dmaengine, and acts as a client. Still, dmaengine | 
|  | accommodates that API in some cases, and made some design choices to | 
|  | ensure that it stayed compatible. | 
|  |  | 
|  | For more information on the Async TX API, please look the relevant | 
|  | documentation file in Documentation/crypto/async-tx-api.txt. | 
|  |  | 
|  | DMAEngine APIs | 
|  | ============== | 
|  |  | 
|  | ``struct dma_device`` Initialization | 
|  | ------------------------------------ | 
|  |  | 
|  | Just like any other kernel framework, the whole DMAEngine registration | 
|  | relies on the driver filling a structure and registering against the | 
|  | framework. In our case, that structure is dma_device. | 
|  |  | 
|  | The first thing you need to do in your driver is to allocate this | 
|  | structure. Any of the usual memory allocators will do, but you'll also | 
|  | need to initialize a few fields in there: | 
|  |  | 
|  | - ``channels``: should be initialized as a list using the | 
|  | INIT_LIST_HEAD macro for example | 
|  |  | 
|  | - ``src_addr_widths``: | 
|  | should contain a bitmask of the supported source transfer width | 
|  |  | 
|  | - ``dst_addr_widths``: | 
|  | should contain a bitmask of the supported destination transfer width | 
|  |  | 
|  | - ``directions``: | 
|  | should contain a bitmask of the supported slave directions | 
|  | (i.e. excluding mem2mem transfers) | 
|  |  | 
|  | - ``residue_granularity``: | 
|  | granularity of the transfer residue reported to dma_set_residue. | 
|  | This can be either: | 
|  |  | 
|  | - Descriptor: | 
|  | your device doesn't support any kind of residue | 
|  | reporting. The framework will only know that a particular | 
|  | transaction descriptor is done. | 
|  |  | 
|  | - Segment: | 
|  | your device is able to report which chunks have been transferred | 
|  |  | 
|  | - Burst: | 
|  | your device is able to report which burst have been transferred | 
|  |  | 
|  | - ``dev``: should hold the pointer to the ``struct device`` associated | 
|  | to your current driver instance. | 
|  |  | 
|  | Supported transaction types | 
|  | --------------------------- | 
|  |  | 
|  | The next thing you need is to set which transaction types your device | 
|  | (and driver) supports. | 
|  |  | 
|  | Our ``dma_device structure`` has a field called cap_mask that holds the | 
|  | various types of transaction supported, and you need to modify this | 
|  | mask using the dma_cap_set function, with various flags depending on | 
|  | transaction types you support as an argument. | 
|  |  | 
|  | All those capabilities are defined in the ``dma_transaction_type enum``, | 
|  | in ``include/linux/dmaengine.h`` | 
|  |  | 
|  | Currently, the types available are: | 
|  |  | 
|  | - DMA_MEMCPY | 
|  |  | 
|  | - The device is able to do memory to memory copies | 
|  |  | 
|  | - DMA_XOR | 
|  |  | 
|  | - The device is able to perform XOR operations on memory areas | 
|  |  | 
|  | - Used to accelerate XOR intensive tasks, such as RAID5 | 
|  |  | 
|  | - DMA_XOR_VAL | 
|  |  | 
|  | - The device is able to perform parity check using the XOR | 
|  | algorithm against a memory buffer. | 
|  |  | 
|  | - DMA_PQ | 
|  |  | 
|  | - The device is able to perform RAID6 P+Q computations, P being a | 
|  | simple XOR, and Q being a Reed-Solomon algorithm. | 
|  |  | 
|  | - DMA_PQ_VAL | 
|  |  | 
|  | - The device is able to perform parity check using RAID6 P+Q | 
|  | algorithm against a memory buffer. | 
|  |  | 
|  | - DMA_INTERRUPT | 
|  |  | 
|  | - The device is able to trigger a dummy transfer that will | 
|  | generate periodic interrupts | 
|  |  | 
|  | - Used by the client drivers to register a callback that will be | 
|  | called on a regular basis through the DMA controller interrupt | 
|  |  | 
|  | - DMA_PRIVATE | 
|  |  | 
|  | - The devices only supports slave transfers, and as such isn't | 
|  | available for async transfers. | 
|  |  | 
|  | - DMA_ASYNC_TX | 
|  |  | 
|  | - Must not be set by the device, and will be set by the framework | 
|  | if needed | 
|  |  | 
|  | - TODO: What is it about? | 
|  |  | 
|  | - DMA_SLAVE | 
|  |  | 
|  | - The device can handle device to memory transfers, including | 
|  | scatter-gather transfers. | 
|  |  | 
|  | - While in the mem2mem case we were having two distinct types to | 
|  | deal with a single chunk to copy or a collection of them, here, | 
|  | we just have a single transaction type that is supposed to | 
|  | handle both. | 
|  |  | 
|  | - If you want to transfer a single contiguous memory buffer, | 
|  | simply build a scatter list with only one item. | 
|  |  | 
|  | - DMA_CYCLIC | 
|  |  | 
|  | - The device can handle cyclic transfers. | 
|  |  | 
|  | - A cyclic transfer is a transfer where the chunk collection will | 
|  | loop over itself, with the last item pointing to the first. | 
|  |  | 
|  | - It's usually used for audio transfers, where you want to operate | 
|  | on a single ring buffer that you will fill with your audio data. | 
|  |  | 
|  | - DMA_INTERLEAVE | 
|  |  | 
|  | - The device supports interleaved transfer. | 
|  |  | 
|  | - These transfers can transfer data from a non-contiguous buffer | 
|  | to a non-contiguous buffer, opposed to DMA_SLAVE that can | 
|  | transfer data from a non-contiguous data set to a continuous | 
|  | destination buffer. | 
|  |  | 
|  | - It's usually used for 2d content transfers, in which case you | 
|  | want to transfer a portion of uncompressed data directly to the | 
|  | display to print it | 
|  |  | 
|  | These various types will also affect how the source and destination | 
|  | addresses change over time. | 
|  |  | 
|  | Addresses pointing to RAM are typically incremented (or decremented) | 
|  | after each transfer. In case of a ring buffer, they may loop | 
|  | (DMA_CYCLIC). Addresses pointing to a device's register (e.g. a FIFO) | 
|  | are typically fixed. | 
|  |  | 
|  | Device operations | 
|  | ----------------- | 
|  |  | 
|  | Our dma_device structure also requires a few function pointers in | 
|  | order to implement the actual logic, now that we described what | 
|  | operations we were able to perform. | 
|  |  | 
|  | The functions that we have to fill in there, and hence have to | 
|  | implement, obviously depend on the transaction types you reported as | 
|  | supported. | 
|  |  | 
|  | - ``device_alloc_chan_resources`` | 
|  |  | 
|  | - ``device_free_chan_resources`` | 
|  |  | 
|  | - These functions will be called whenever a driver will call | 
|  | ``dma_request_channel`` or ``dma_release_channel`` for the first/last | 
|  | time on the channel associated to that driver. | 
|  |  | 
|  | - They are in charge of allocating/freeing all the needed | 
|  | resources in order for that channel to be useful for your driver. | 
|  |  | 
|  | - These functions can sleep. | 
|  |  | 
|  | - ``device_prep_dma_*`` | 
|  |  | 
|  | - These functions are matching the capabilities you registered | 
|  | previously. | 
|  |  | 
|  | - These functions all take the buffer or the scatterlist relevant | 
|  | for the transfer being prepared, and should create a hardware | 
|  | descriptor or a list of hardware descriptors from it | 
|  |  | 
|  | - These functions can be called from an interrupt context | 
|  |  | 
|  | - Any allocation you might do should be using the GFP_NOWAIT | 
|  | flag, in order not to potentially sleep, but without depleting | 
|  | the emergency pool either. | 
|  |  | 
|  | - Drivers should try to pre-allocate any memory they might need | 
|  | during the transfer setup at probe time to avoid putting to | 
|  | much pressure on the nowait allocator. | 
|  |  | 
|  | - It should return a unique instance of the | 
|  | ``dma_async_tx_descriptor structure``, that further represents this | 
|  | particular transfer. | 
|  |  | 
|  | - This structure can be initialized using the function | 
|  | ``dma_async_tx_descriptor_init``. | 
|  |  | 
|  | - You'll also need to set two fields in this structure: | 
|  |  | 
|  | - flags: | 
|  | TODO: Can it be modified by the driver itself, or | 
|  | should it be always the flags passed in the arguments | 
|  |  | 
|  | - tx_submit: A pointer to a function you have to implement, | 
|  | that is supposed to push the current transaction descriptor to a | 
|  | pending queue, waiting for issue_pending to be called. | 
|  |  | 
|  | - In this structure the function pointer callback_result can be | 
|  | initialized in order for the submitter to be notified that a | 
|  | transaction has completed. In the earlier code the function pointer | 
|  | callback has been used. However it does not provide any status to the | 
|  | transaction and will be deprecated. The result structure defined as | 
|  | ``dmaengine_result`` that is passed in to callback_result | 
|  | has two fields: | 
|  |  | 
|  | - result: This provides the transfer result defined by | 
|  | ``dmaengine_tx_result``. Either success or some error condition. | 
|  |  | 
|  | - residue: Provides the residue bytes of the transfer for those that | 
|  | support residue. | 
|  |  | 
|  | - ``device_issue_pending`` | 
|  |  | 
|  | - Takes the first transaction descriptor in the pending queue, | 
|  | and starts the transfer. Whenever that transfer is done, it | 
|  | should move to the next transaction in the list. | 
|  |  | 
|  | - This function can be called in an interrupt context | 
|  |  | 
|  | - ``device_tx_status`` | 
|  |  | 
|  | - Should report the bytes left to go over on the given channel | 
|  |  | 
|  | - Should only care about the transaction descriptor passed as | 
|  | argument, not the currently active one on a given channel | 
|  |  | 
|  | - The tx_state argument might be NULL | 
|  |  | 
|  | - Should use dma_set_residue to report it | 
|  |  | 
|  | - In the case of a cyclic transfer, it should only take into | 
|  | account the current period. | 
|  |  | 
|  | - This function can be called in an interrupt context. | 
|  |  | 
|  | - device_config | 
|  |  | 
|  | - Reconfigures the channel with the configuration given as argument | 
|  |  | 
|  | - This command should NOT perform synchronously, or on any | 
|  | currently queued transfers, but only on subsequent ones | 
|  |  | 
|  | - In this case, the function will receive a ``dma_slave_config`` | 
|  | structure pointer as an argument, that will detail which | 
|  | configuration to use. | 
|  |  | 
|  | - Even though that structure contains a direction field, this | 
|  | field is deprecated in favor of the direction argument given to | 
|  | the prep_* functions | 
|  |  | 
|  | - This call is mandatory for slave operations only. This should NOT be | 
|  | set or expected to be set for memcpy operations. | 
|  | If a driver support both, it should use this call for slave | 
|  | operations only and not for memcpy ones. | 
|  |  | 
|  | - device_pause | 
|  |  | 
|  | - Pauses a transfer on the channel | 
|  |  | 
|  | - This command should operate synchronously on the channel, | 
|  | pausing right away the work of the given channel | 
|  |  | 
|  | - device_resume | 
|  |  | 
|  | - Resumes a transfer on the channel | 
|  |  | 
|  | - This command should operate synchronously on the channel, | 
|  | resuming right away the work of the given channel | 
|  |  | 
|  | - device_terminate_all | 
|  |  | 
|  | - Aborts all the pending and ongoing transfers on the channel | 
|  |  | 
|  | - For aborted transfers the complete callback should not be called | 
|  |  | 
|  | - Can be called from atomic context or from within a complete | 
|  | callback of a descriptor. Must not sleep. Drivers must be able | 
|  | to handle this correctly. | 
|  |  | 
|  | - Termination may be asynchronous. The driver does not have to | 
|  | wait until the currently active transfer has completely stopped. | 
|  | See device_synchronize. | 
|  |  | 
|  | - device_synchronize | 
|  |  | 
|  | - Must synchronize the termination of a channel to the current | 
|  | context. | 
|  |  | 
|  | - Must make sure that memory for previously submitted | 
|  | descriptors is no longer accessed by the DMA controller. | 
|  |  | 
|  | - Must make sure that all complete callbacks for previously | 
|  | submitted descriptors have finished running and none are | 
|  | scheduled to run. | 
|  |  | 
|  | - May sleep. | 
|  |  | 
|  |  | 
|  | Misc notes | 
|  | ========== | 
|  |  | 
|  | (stuff that should be documented, but don't really know | 
|  | where to put them) | 
|  |  | 
|  | ``dma_run_dependencies`` | 
|  |  | 
|  | - Should be called at the end of an async TX transfer, and can be | 
|  | ignored in the slave transfers case. | 
|  |  | 
|  | - Makes sure that dependent operations are run before marking it | 
|  | as complete. | 
|  |  | 
|  | dma_cookie_t | 
|  |  | 
|  | - it's a DMA transaction ID that will increment over time. | 
|  |  | 
|  | - Not really relevant any more since the introduction of ``virt-dma`` | 
|  | that abstracts it away. | 
|  |  | 
|  | DMA_CTRL_ACK | 
|  |  | 
|  | - If clear, the descriptor cannot be reused by provider until the | 
|  | client acknowledges receipt, i.e. has has a chance to establish any | 
|  | dependency chains | 
|  |  | 
|  | - This can be acked by invoking async_tx_ack() | 
|  |  | 
|  | - If set, does not mean descriptor can be reused | 
|  |  | 
|  | DMA_CTRL_REUSE | 
|  |  | 
|  | - If set, the descriptor can be reused after being completed. It should | 
|  | not be freed by provider if this flag is set. | 
|  |  | 
|  | - The descriptor should be prepared for reuse by invoking | 
|  | ``dmaengine_desc_set_reuse()`` which will set DMA_CTRL_REUSE. | 
|  |  | 
|  | - ``dmaengine_desc_set_reuse()`` will succeed only when channel support | 
|  | reusable descriptor as exhibited by capabilities | 
|  |  | 
|  | - As a consequence, if a device driver wants to skip the | 
|  | ``dma_map_sg()`` and ``dma_unmap_sg()`` in between 2 transfers, | 
|  | because the DMA'd data wasn't used, it can resubmit the transfer right after | 
|  | its completion. | 
|  |  | 
|  | - Descriptor can be freed in few ways | 
|  |  | 
|  | - Clearing DMA_CTRL_REUSE by invoking | 
|  | ``dmaengine_desc_clear_reuse()`` and submitting for last txn | 
|  |  | 
|  | - Explicitly invoking ``dmaengine_desc_free()``, this can succeed only | 
|  | when DMA_CTRL_REUSE is already set | 
|  |  | 
|  | - Terminating the channel | 
|  |  | 
|  | - DMA_PREP_CMD | 
|  |  | 
|  | - If set, the client driver tells DMA controller that passed data in DMA | 
|  | API is command data. | 
|  |  | 
|  | - Interpretation of command data is DMA controller specific. It can be | 
|  | used for issuing commands to other peripherals/register reads/register | 
|  | writes for which the descriptor should be in different format from | 
|  | normal data descriptors. | 
|  |  | 
|  | General Design Notes | 
|  | ==================== | 
|  |  | 
|  | Most of the DMAEngine drivers you'll see are based on a similar design | 
|  | that handles the end of transfer interrupts in the handler, but defer | 
|  | most work to a tasklet, including the start of a new transfer whenever | 
|  | the previous transfer ended. | 
|  |  | 
|  | This is a rather inefficient design though, because the inter-transfer | 
|  | latency will be not only the interrupt latency, but also the | 
|  | scheduling latency of the tasklet, which will leave the channel idle | 
|  | in between, which will slow down the global transfer rate. | 
|  |  | 
|  | You should avoid this kind of practice, and instead of electing a new | 
|  | transfer in your tasklet, move that part to the interrupt handler in | 
|  | order to have a shorter idle window (that we can't really avoid | 
|  | anyway). | 
|  |  | 
|  | Glossary | 
|  | ======== | 
|  |  | 
|  | - Burst: A number of consecutive read or write operations that | 
|  | can be queued to buffers before being flushed to memory. | 
|  |  | 
|  | - Chunk: A contiguous collection of bursts | 
|  |  | 
|  | - Transfer: A collection of chunks (be it contiguous or not) |