You would put tasks in the queue by calling put() on the queue.
From the main thread, you can call join() on the queue to wait until all pending tasks have been completed.
This approach has the benefit that you are not creating and destroying threads, which is expensive. The worker threads will run continuously, but will be asleep when no tasks are in the queue, using zero CPU time.
submit is used to generate a Futureobject for a single function call with its associated arguments.
ProcessPoolExecutor is doing the exact same thing as multiprocessing.Pool with a simpler (and more limited) API. If you can get away with using ProcessPoolExecutor, use that, because I think it's more likely to get enhancements in the long-term.
Note that you can use all the helpers from multiprocessing with ProcessPoolExecutor, like Lock, Queue, Manager, etc. The main reasons to use multiprocessing.Pool is if you need initializer/initargs (though there is an open bug to get those added to ProcessPoolExecutor), or maxtasksperchild. Or you're running Python 2.7 or earlier, and don't want to install (or require your users to install) the backport of concurrent.futures.
Apart from that, an obvious way to proceed using multiprocessing is to use the Pool.apply_async() method, put the async result objects on a bounded Queue.Queue, and have threads in your main program pull those off the Queue and wait for the results to show up. This is easy enough, but it's not magic. It solves your problem because bounded Queues are the canonical way to mediate between producers and consumers that run at different speeds. Nothing in concurrent.futures addresses that problem directly, and it's at the heart of your "massive amounts of memory" problem.
However, worth noting, multiprocessing.Pool.map outperforms ProcessPoolExecutor.map. Note that the performance difference is very small per work item, so you'll probably only notice a large performance difference if you're using map on a very large iterable. See this closed bug filed against ProcessPoolExecutor for more info. The reason for the performance difference is that multiprocessing. Pool will batch the iterable passed to map into chunks, and then pass the chunks to the worker processes, which reduces the overhead of IPC between the parent and children. ProcessPoolExecutor always passes one item from the iterable at a time to the children, which can lead to much slower performance with large iterables, due to the increased IPC overhead.