| xj | b04a402 | 2021-11-25 15:01:52 +0800 | [diff] [blame] | 1 | ======================================= | 
 | 2 | The padata parallel execution mechanism | 
 | 3 | ======================================= | 
 | 4 |  | 
 | 5 | :Last updated: for 2.6.36 | 
 | 6 |  | 
 | 7 | Padata is a mechanism by which the kernel can farm work out to be done in | 
 | 8 | parallel on multiple CPUs while retaining the ordering of tasks.  It was | 
 | 9 | developed for use with the IPsec code, which needs to be able to perform | 
 | 10 | encryption and decryption on large numbers of packets without reordering | 
 | 11 | those packets.  The crypto developers made a point of writing padata in a | 
 | 12 | sufficiently general fashion that it could be put to other uses as well. | 
 | 13 |  | 
 | 14 | The first step in using padata is to set up a padata_instance structure for | 
 | 15 | overall control of how tasks are to be run:: | 
 | 16 |  | 
 | 17 |     #include <linux/padata.h> | 
 | 18 |  | 
 | 19 |     struct padata_instance *padata_alloc(struct workqueue_struct *wq, | 
 | 20 | 					 const struct cpumask *pcpumask, | 
 | 21 | 					 const struct cpumask *cbcpumask); | 
 | 22 |  | 
 | 23 | The pcpumask describes which processors will be used to execute work | 
 | 24 | submitted to this instance in parallel. The cbcpumask defines which | 
 | 25 | processors are allowed to be used as the serialization callback processor. | 
 | 26 | The workqueue wq is where the work will actually be done; it should be | 
 | 27 | a multithreaded queue, naturally. | 
 | 28 |  | 
 | 29 | To allocate a padata instance with the cpu_possible_mask for both | 
 | 30 | cpumasks this helper function can be used:: | 
 | 31 |  | 
 | 32 |     struct padata_instance *padata_alloc_possible(struct workqueue_struct *wq); | 
 | 33 |  | 
 | 34 | Note: Padata maintains two kinds of cpumasks internally. The user supplied | 
 | 35 | cpumasks, submitted by padata_alloc/padata_alloc_possible and the 'usable' | 
 | 36 | cpumasks. The usable cpumasks are always a subset of active CPUs in the | 
 | 37 | user supplied cpumasks; these are the cpumasks padata actually uses. So | 
 | 38 | it is legal to supply a cpumask to padata that contains offline CPUs. | 
 | 39 | Once an offline CPU in the user supplied cpumask comes online, padata | 
 | 40 | is going to use it. | 
 | 41 |  | 
 | 42 | There are functions for enabling and disabling the instance:: | 
 | 43 |  | 
 | 44 |     int padata_start(struct padata_instance *pinst); | 
 | 45 |     void padata_stop(struct padata_instance *pinst); | 
 | 46 |  | 
 | 47 | These functions are setting or clearing the "PADATA_INIT" flag; | 
 | 48 | if that flag is not set, other functions will refuse to work. | 
 | 49 | padata_start returns zero on success (flag set) or -EINVAL if the | 
 | 50 | padata cpumask contains no active CPU (flag not set). | 
 | 51 | padata_stop clears the flag and blocks until the padata instance | 
 | 52 | is unused. | 
 | 53 |  | 
 | 54 | The list of CPUs to be used can be adjusted with these functions:: | 
 | 55 |  | 
 | 56 |     int padata_set_cpumasks(struct padata_instance *pinst, | 
 | 57 | 			    cpumask_var_t pcpumask, | 
 | 58 | 			    cpumask_var_t cbcpumask); | 
 | 59 |     int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type, | 
 | 60 | 			   cpumask_var_t cpumask); | 
 | 61 |     int padata_add_cpu(struct padata_instance *pinst, int cpu, int mask); | 
 | 62 |     int padata_remove_cpu(struct padata_instance *pinst, int cpu, int mask); | 
 | 63 |  | 
 | 64 | Changing the CPU masks are expensive operations, though, so it should not be | 
 | 65 | done with great frequency. | 
 | 66 |  | 
 | 67 | It's possible to change both cpumasks of a padata instance with | 
 | 68 | padata_set_cpumasks by specifying the cpumasks for parallel execution (pcpumask) | 
 | 69 | and for the serial callback function (cbcpumask). padata_set_cpumask is used to | 
 | 70 | change just one of the cpumasks. Here cpumask_type is one of PADATA_CPU_SERIAL, | 
 | 71 | PADATA_CPU_PARALLEL and cpumask specifies the new cpumask to use. | 
 | 72 | To simply add or remove one CPU from a certain cpumask the functions | 
 | 73 | padata_add_cpu/padata_remove_cpu are used. cpu specifies the CPU to add or | 
 | 74 | remove and mask is one of PADATA_CPU_SERIAL, PADATA_CPU_PARALLEL. | 
 | 75 |  | 
 | 76 | If a user is interested in padata cpumask changes, he can register to | 
 | 77 | the padata cpumask change notifier:: | 
 | 78 |  | 
 | 79 |     int padata_register_cpumask_notifier(struct padata_instance *pinst, | 
 | 80 | 					 struct notifier_block *nblock); | 
 | 81 |  | 
 | 82 | To unregister from that notifier:: | 
 | 83 |  | 
 | 84 |     int padata_unregister_cpumask_notifier(struct padata_instance *pinst, | 
 | 85 | 					   struct notifier_block *nblock); | 
 | 86 |  | 
 | 87 | The padata cpumask change notifier notifies about changes of the usable | 
 | 88 | cpumasks, i.e. the subset of active CPUs in the user supplied cpumask. | 
 | 89 |  | 
 | 90 | Padata calls the notifier chain with:: | 
 | 91 |  | 
 | 92 |     blocking_notifier_call_chain(&pinst->cpumask_change_notifier, | 
 | 93 | 				 notification_mask, | 
 | 94 | 				 &pd_new->cpumask); | 
 | 95 |  | 
 | 96 | Here cpumask_change_notifier is registered notifier, notification_mask | 
 | 97 | is one of PADATA_CPU_SERIAL, PADATA_CPU_PARALLEL and cpumask is a pointer | 
 | 98 | to a struct padata_cpumask that contains the new cpumask information. | 
 | 99 |  | 
 | 100 | Actually submitting work to the padata instance requires the creation of a | 
 | 101 | padata_priv structure:: | 
 | 102 |  | 
 | 103 |     struct padata_priv { | 
 | 104 |         /* Other stuff here... */ | 
 | 105 | 	void                    (*parallel)(struct padata_priv *padata); | 
 | 106 | 	void                    (*serial)(struct padata_priv *padata); | 
 | 107 |     }; | 
 | 108 |  | 
 | 109 | This structure will almost certainly be embedded within some larger | 
 | 110 | structure specific to the work to be done.  Most of its fields are private to | 
 | 111 | padata, but the structure should be zeroed at initialisation time, and the | 
 | 112 | parallel() and serial() functions should be provided.  Those functions will | 
 | 113 | be called in the process of getting the work done as we will see | 
 | 114 | momentarily. | 
 | 115 |  | 
 | 116 | The submission of work is done with:: | 
 | 117 |  | 
 | 118 |     int padata_do_parallel(struct padata_instance *pinst, | 
 | 119 | 		           struct padata_priv *padata, int cb_cpu); | 
 | 120 |  | 
 | 121 | The pinst and padata structures must be set up as described above; cb_cpu | 
 | 122 | specifies which CPU will be used for the final callback when the work is | 
 | 123 | done; it must be in the current instance's CPU mask.  The return value from | 
 | 124 | padata_do_parallel() is zero on success, indicating that the work is in | 
 | 125 | progress. -EBUSY means that somebody, somewhere else is messing with the | 
 | 126 | instance's CPU mask, while -EINVAL is a complaint about cb_cpu not being | 
 | 127 | in that CPU mask or about a not running instance. | 
 | 128 |  | 
 | 129 | Each task submitted to padata_do_parallel() will, in turn, be passed to | 
 | 130 | exactly one call to the above-mentioned parallel() function, on one CPU, so | 
 | 131 | true parallelism is achieved by submitting multiple tasks.  Despite the | 
 | 132 | fact that the workqueue is used to make these calls, parallel() is run with | 
 | 133 | software interrupts disabled and thus cannot sleep.  The parallel() | 
 | 134 | function gets the padata_priv structure pointer as its lone parameter; | 
 | 135 | information about the actual work to be done is probably obtained by using | 
 | 136 | container_of() to find the enclosing structure. | 
 | 137 |  | 
 | 138 | Note that parallel() has no return value; the padata subsystem assumes that | 
 | 139 | parallel() will take responsibility for the task from this point.  The work | 
 | 140 | need not be completed during this call, but, if parallel() leaves work | 
 | 141 | outstanding, it should be prepared to be called again with a new job before | 
 | 142 | the previous one completes.  When a task does complete, parallel() (or | 
 | 143 | whatever function actually finishes the job) should inform padata of the | 
 | 144 | fact with a call to:: | 
 | 145 |  | 
 | 146 |     void padata_do_serial(struct padata_priv *padata); | 
 | 147 |  | 
 | 148 | At some point in the future, padata_do_serial() will trigger a call to the | 
 | 149 | serial() function in the padata_priv structure.  That call will happen on | 
 | 150 | the CPU requested in the initial call to padata_do_parallel(); it, too, is | 
 | 151 | done through the workqueue, but with local software interrupts disabled. | 
 | 152 | Note that this call may be deferred for a while since the padata code takes | 
 | 153 | pains to ensure that tasks are completed in the order in which they were | 
 | 154 | submitted. | 
 | 155 |  | 
 | 156 | The one remaining function in the padata API should be called to clean up | 
 | 157 | when a padata instance is no longer needed:: | 
 | 158 |  | 
 | 159 |     void padata_free(struct padata_instance *pinst); | 
 | 160 |  | 
 | 161 | This function will busy-wait while any remaining tasks are completed, so it | 
 | 162 | might be best not to call it while there is work outstanding.  Shutting | 
 | 163 | down the workqueue, if necessary, should be done separately. |