The art of maximizing throughput with coroutines
Consider an array of 100 items, each of which require I/O to process. The goal is to process as many items in the array as possible in a unit of time (items/second = throughput).
With traditional threads (preemptive multitasking), one would use a thread pool that is easy to reason about; however, threads are expensive and few, and coding with mutexes is next level. With coroutines, the program must make due with one thread, aka cooperative multitasking.
How many coroutines should process this array? Launching 100 coroutines at once is wasteful because they consume memory and stress the scheduler and the garbage collector. Allocating memory and waiting on I/O reduces throughput. In fact, this approach can saturate the resource at the other end of the wire and it reduces the amount of memory available to the process to perform unrelated work (for example, to service another incoming REST request).
Another downside to scheduling 100 coroutines at once is error handling. A single error is usually enough to abort the processing of the entire array. When an error occurs, every coroutine represents wasted resources. Therefore, check the success of all finished coroutines before starting new ones. This can be done with a shared error variable. More complicated error handling requirements can easily consume most of your coding and testing budget.
Computers of course do not scale perfectly. Scalability is limited by shared resources such as memory and system I/O handles. External resources such as databases and REST services are similarity constrained. Successful engineers anticipate and work around these limits as best as possible.
The optimal number of parallel coroutines is often unintuitive. The degree of parallelism is dominated (in the sense of Big O notation) by the number of I/O waits in the item processor. If there is one wait, two coroutines will double the throughout compared to one. If there are two waits, three coroutines are almost guaranteed to increase the throughput noticeably. Additional numbers of coroutines beyond <I/O waits>+1 may provide only nominal benefits, depending on the capacity and throughput of the relevant systems.
For example, reading from DynamoDb inside AWS’ network is so fast that a single-threaded program can barely queue up more than a few simultaneous requests at a time — perhaps at most 25. Are DynamoDb resources shared across a multiuser application? If so, a single job should not be too greedy.
Experiment with real environments in order to balance the many concerns of the ecosystem.
Programs that utilize coroutines for high-volume batch operations must conserve resources or suffer intermittent runtime failures, especially under load. This is accomplished by utilizing a limited number of coroutines via third-party libraries or custom logic. The latter is not recommended because rock-solid work queues are difficult to build. Error handling must be considered during the design phase.
Maximizing throughput is an art that depends on the specific task at hand and the nature of all relevant resources, local and remote. Discovering the best solution requires rigorous experimentation.