Is it possible to select workers for specific tasks in Dask?

I have a process I'm running on my Kubernetes cluster with Dask that consists of two map-reduce phases, but both maps across the nodes download potentially numerous large files to each worker. In order to avoid having two different machines process the same subset of files on the two different map steps, is it possible to deterministically select which workers get which arguments for the same jobs? Conceptually, what I want might be something like:

workers : List = client.get_workers();
#                       ^^^^^^^^^^^
filenames : List[str] = get_filenames(); # input data to process

# map each file to a specific worker
file_to_worker = { filename : workers[hash(filename) % len(workers)] for filename in filenames }

# submit each file, specifying which worker should be assigned the task
futures = [client.submit(my_func, filename, worker=file_to_worker[filename]) for filename in filenames]
#                                           ^^^^^^

Something like this would allow me to direct different steps of computation for the same files to the same nodes, eliminating any need to do a second caching of files.

Answer

Yes, you can achieve this kind of deterministic assignment of tasks to specific workers in Dask. Here's a breakdown of how you can implement it efficiently.

Key Concepts

Worker Assignment Based on File Hashing:
- You can use the hash of the filename (or another consistent attribute) to assign each file to a specific worker.
- Dask's client.submit allows you to submit tasks with specific worker assignments using the worker keyword.
Worker Affinity:
- In Dask, while there isn’t a built-in worker argument directly in client.submit that works exactly the way you describe, you can achieve similar behavior by specifying worker affinity.
Dask's dask.distributed.Client:
- The client.submit method allows you to control where tasks are executed by using worker names. This ensures that tasks are scheduled on specific workers.

Steps to Implement

Get List of Workers:
- Use client.get_workers() to get a list of workers. You'll then map filenames to these workers using a deterministic hash function.
Map Files to Workers:
- Use a hash function like hash(filename) % len(workers) to assign each file to a worker. This ensures that the same file will always go to the same worker, avoiding the issue of redundant file downloads.
Submit Tasks to Specific Workers:
- Use the worker keyword argument in client.submit() to ensure the task for each file is sent to the corresponding worker.

Here's an implementation example:

from dask.distributed import Client

# Example setup
client = Client('your-dask-scheduler-address')  # Connect to your Dask cluster
workers = client.get_workers()  # Get list of workers
filenames = get_filenames()  # List of filenames to process

# Map each filename to a specific worker based on hash
file_to_worker = {filename: list(workers.keys())[hash(filename) % len(workers)] for filename in filenames}

# Submit tasks to workers
futures = [
    client.submit(my_func, filename, worker=file_to_worker[filename])
    for filename in filenames
]

# Optionally gather results if needed
results = client.gather(futures)

Explanation

Get the Workers:
- workers = client.get_workers() returns a dictionary where the keys are worker names and the values are Worker objects.
- workers.keys() gives the list of worker names, which are used for assigning tasks deterministically.
Map Files to Workers:
- file_to_worker = {filename: list(workers.keys())[hash(filename) % len(workers)] for filename in filenames} uses the hash of the filename to ensure that each file is consistently assigned to the same worker.
Submit the Tasks:
- The client.submit(my_func, filename, worker=file_to_worker[filename]) line submits each file processing task to the worker mapped in file_to_worker.

Considerations

Worker Affinity:
- If your tasks depend heavily on the data residing on the workers (for example, caching files to avoid reloading), you can ensure better worker affinity by controlling which workers are assigned which tasks.
File Caching:
- In your case, by deterministically assigning tasks to workers, you avoid the problem of multiple workers downloading the same file. This can be important when files are large and you want to prevent redundant data transfers.
Scalability:
- This approach should scale well if you have a large number of workers, as the hashing approach ensures an even distribution of files across workers.
Data Locality:
- If you're working with very large datasets that are local to specific workers (e.g., on a distributed filesystem), it can be important to make sure tasks are scheduled on workers that already have access to the required files, rather than forcing them to download the files repeatedly.

Final Thoughts

By hashing the filenames and assigning each task to a worker based on that hash, you ensure that the same task is consistently assigned to the same worker. This not only helps in optimizing performance by preventing redundant file downloads but also enables better control over the task distribution across workers in your Dask cluster.