I've been thinking about Python and Kubernetes and long-lived pods (such as celery workers); and I have some questions and thoughts.
When you use memory-intensive modules like pandas, would it make more sense to have the "worker" simply be a listener and fork a process (passing the environment, of course) to do the actual processing using memory-intensive modules? The thought process here is that, by launching a subprocess, the memory utilization should go down once that subprocess exits. Therefore memory should be freed and any chance of memory leaks should be circumvented.
Secondly, with kafka and faust, is celery even relevant for high-availability microservice applications?
I would really like to hear some real-world experience with any of these.
Top comments (1)
For the first question, I assume that you will have tasks which are memory-heavy (as you said, using pandas in those tasks), and your original thought is to use a listene that is responsible for launching subprocesses, right?
I will say it'll work, but we can handle it by an easier solution which involved in the designs of K8s and celery themselves.
First, we can mark the specific queue which will be filled with those memory-heavy tasks for the tasks, there are many ways to route the tasks to the queues you want. Here is an example wrote in celery documentation (docs.celeryq.dev/en/stable/usergui...).
Second, we can run a celery worker on k8s which has more memory resources, specify the queue name in the worker startup command (docs.celeryq.dev/en/stable/usergui...), you can easily define the resources of the worker container on k8s by following this example(kubernetes.io/docs/concepts/config...).
So that the worker which has more memory will receive all these memory-heavy tasks sent by the application and process them.
Hope it helps.