ChatGPT解决这个技术问题 Extra ChatGPT

Is Apache Kafka appropriate for use as an unordered task queue?

Kafka splits incoming messages up into partitions, according to the partition assigned by the producer. Messages from partitions then get consumed by consumers in different consumer groups.

This architecture makes me wary of using Kafka as a work/task queue, because I have to specify the partition at time of production, which indirectly limits which consumers can work on it because a partition is sent to only one consumer in a consumer group. I would rather not specify the partition ahead of time, so that whichever consumer is available to take that task can do so. Is there a way to structure partitions/producers in a Kafka architecture where tasks can be pulled by the next available consumer, without having to split up work ahead of time by choosing a partition when the work is produced?

Using only one partition for this topic would put all the tasks in the same queue, but then the number of consumers is limited to 1 per consumer group, so each consumer would have to be in a different group. Then all of the task get distributed to each consumer group, though, which is not the kind of work queue I'm looking for.

Is Apache Kafka appropriate for use as a task queue?

On a side note: your problem can be solved using Apache Pulsar which has a shared topic-consumer subscription. See pulsar.apache.org/docs/latest/getting-started/…

O
Ofer Eliassaf

Using Kafka for a task queue is a bad idea. Use RabbitMQ instead, it does it much better and more elegantly.

Although you can use Kafka for a task queue - you will get some issues: Kafka is not allowing to consume a single partition by many consumers (by design), so if for example a single partition gets filled with many tasks and the consumer who owns the partition is busy, the tasks in that partition will get "starvation". This also means that the order of consumption of tasks in the topic will not be identical to the order which the tasks were produced which might cause serious problems if the tasks needs to be consumed in a specific order (in Kafka to fully achieve that you must have only one consumer and one partition - which means serial consumption by just one node. If you have multiple consumers and multiple partitions the order of tasks consumption will not be guaranteed in the topic level).

In fact - Kafka topics are not queues in the computer science manner. Queue means First in First out - this is not what you get in Kafka in the topic level.

Another issue is that it is difficult to change the number of partitions dynamically. Adding or removing new workers should be dynamic. If you want to ensure that the new workers will get tasks in Kakfa you will have to set the partition number to the maximum possible workers. This is not elegant enough.

So the bottom line - use RabbitMQ or other queues instead.

Having said all of that - Samza (by linkedin) is using kafka as some sort of streaming based task queue: Samza

Edit: scale considerations: I forgot to mention that Kakfa is a big data/big scale tool. If your job rate is huge then Kafka might be good option for you despite the things I wrote earlier, since dealing with huge scale is very challenging and Kafka is very good in doing that. If we are talking about smaller scales (say, up to few dosens/hundreds of jobs per second) then again Kafka is a poor choice compared to RabbitMQ.


Might also be worth mentioning the fact that committing offsets quickly gets complex to handle failing tasks that needs retrying.
"in Kafka to fully achieve that you must have only one consumer and one partition" is incorrect. Order is guaranteed for each partition in topic based on the partition key. So if order matters, you need to partition by the value on which order matters. This is actually stronger ordering guarantees than rabbitmq, which may only have one consumer to guarantee ordering.
One consumer per partition, not per topic. The issue is in rabbitmq as well. If you want messages to be processed in guaranteed order, then you can only have one consumer for that queue. You cannot process work in order with parallel consumers.
Kafka main advantage is in streaming of huge amount of data. If u r not streaming huge amount of data - Kafka is probably a bad choice
Order is not guaranteed when you have multiple consumers in any meaningful way. What if one consumer fails and the task gets requeued? What if a consumer A finishes a task before consumer B, even though they received them in the opposite order? Kafka has iron clad ordering guarantees. The vast majority of message queues do not, including rabbit mq, unless you have a single producer and a single consumer.
M
Marko Bonaci

I would say that this depends on the scale. How many tasks do you anticipate in a unit of time?

What you describe as your end goal is basically how Kafka works by default. When you produce messages, default (most widely used) option is to use random partitioner, which chooses partitions in the round robin fashion, keeping partitions evenly used (so it's possible to avoid specifying a partition).
The main purpose of partitions is to parallelize processing of messages, so you should use it in such a manner.
Other commonly used "thing" that partitions are used for is assuring that certain messages get consumed in the same order as they are produced (then you specify partitioning key in such a way that all such messages end up in the same partition. E.g. using userId as key would assure all users are processed in such a way).


Thanks for your answer Marko, maybe we can get to the bottom of this with an example. So say we have 20 partitions and 2 workers, and 100 new jobs come in. With round robin, the job messages get distributed 5 to each partition, and then each consumer gets 10 partitions, which is 50 jobs. Say that one consumer's 50 jobs takes 100 milliseconds (for all of them combined), but the other consumer's 50 jobs takes 2 minutes. Will the consumer that finished early be able to to help out the overloaded consumer? Does Kafka make some kind of assumption about equal job difficulties?
Hey Marko, I think my last question in that comment got to the heart of the issue here, if you can just add some more detail for that, then I'll definitely accept your answer!
Any of those 100 messages would go to a random partition and would get picked up by one of those two (i.e. random) Consumers, then the second message, then the third, ... so it's not like each Consumer will get a bulk of 50 messages, i.e. they "help each other out". But why would you limit yourself to only 2 Consumer threads? Also, you would commit the offset only after each message is processed, to make sure you don't lose any messages if processing is unsuccessful.
R
Rodney P. Barbati

There is a lot of discussion in this topic revolving around order of execution of tasks in a work or task queue. I would put forth the notion that order of execution should not be a feature of a work queue.

A work queue is a means of controlling resource usage by applying a controllable number of worker threads towards completion of distinct tasks. Enforcing a processing order on tasks in a queue means you are also enforcing a completion order on tasks in the queue which effectively means that tasks in the queue would always be processed sequentially with the next task being processed only after the END of the preceding task. This effectively means you have a single threaded task queue.

If order of execution is important in some of your tasks, then those tasks should add the next task in the sequence to the work queue upon its completion. Either that or you support a Sequential Job type which when processed actually processes a list of jobs sequentially on one worker.

In no way should the work queue actually order any of its work - the next available processor should always take the next task with no regards to what has occurred prior to or after the task completes.

I was also looking at kafka as a basis for a work queue, but the more I research it, the less it looks like the desired platform.

I see it mainly being used as a means of synchronizing disparate resources and not so much as a means of executing disparate job requests.

Another area that I think is important in a work queue is the support of a prioritization of tasks. For example, if I have 20 tasks in the queue, and a new task arrives with a higher priority, I want that task to jump to the start of the line to be picked up by the next available worker. Kafka would not allow this.


a
adamw

There are two main obstacles in trying to use Kafka as a message queue:

as described in Ofer's answer, you can only consume a single partition from a single consumer, and order of processing is guaranteed only within a partition. So if you can't distribute the tasks fairly across partitions, this might be a problem by default, you can only acknowledge processing of all messages up to a given point (offset). Unlike in traditional message queues, you can't do selective acknowledgment and in case of failure, selective retries. This can be address by using kmq, which adds individual acks capability with the help of an additional topic (disclaimer: I'm the author of kmq).

RabbitMQ is an alternative of course, but it also gives different (lower) performance and replication guarantees. In short, RabbitMQ docs state that the broker is not partition tolerant. See also our comparison of message queues with data replication, mqperf.


J
Jing is coding

I am developing a library that implement a job queue on top of kafka, https://github.com/JingIsCoding/kafka-job-queue I am using multiple queues to maintain tasks that are ready to be processed, future tasks and dead tasks, contribution is welcomed