We have two types of queues.
- Message queues
- Work item queues
Message queues are primarily used to let two pieces of code exchange data. Could be a Node.js script talking to a Python script. Or it could be Node-RED talking to OpenRPA. This is also how OpenCore notifies clients that we have pending work items, so they do not have to constantly ask OpenCore if there is any work to be done.
It can also be used to broadcast messages to multiple clients. Like when the API nodes are exchanging data about caching and users online, or when you notify multiple clients about triggers in OpenRPA.
In our system, we use RabbitMQ in a stateless form. This means if something goes wrong, any pending messages are lost. It’s very unlikely, but not guaranteed. But there is no reason you cannot add persistent storage to RabbitMQ and make it stateful (and therefore also increase the timeout)
Lastly, RabbitMQ (I don’t know about other message queue systems) does not handle huge messages very well and has a pretty low max size for messages (128MB I think? I also see 512MB in some posts). So we generally want to keep the messages small and compact.
Work item queues are where we store units of work. This is where we keep state, and there is a mechanism around retry logic, error handling, and reporting. This is also how we can store payloads and multiple files associated with the work. You should always try to keep the payload of work items below 500 KB, but you can potentially have a payload of up to 16 MB. Hence, we should always store data in files, and keep the core information we need handy in the payload. This also allows us to optimize how we distribute that data to clients that need to work with the payload and files.
So how did we end up here? It’s kind of been a process.
Originally the idea was that, “out of the box,” you use a default RabbitMQ without state enabled. This would make it easy to get started, and the default timeout was also fine. If a timeout happens, you are responsible for retrying if you want to.
I made a whole system similar to our current work item queues, based on Node-RED workflows and Grafana reports and triggers, that I was helping people set up.
Since RabbitMQ is stateless, we don’t want a long timeout; we want to “catch it” in case something goes wrong. For example, if RabbitMQ restarts, you will never know, and everything hangs forever.
The thinking was that if people needed it, we could set up state in RabbitMQ and increase the timeout, and let RabbitMQ do what RabbitMQ is good for: store and retry messages.
But this would not give us as much fine-grained control as the work item queues do; also, this would make it pretty hard to show the things we currently do inside the UI (I could create a wrapper and make RabbitMQ-specific API wrappers, but then we would no longer have the freedom to choose which message queue server we want, and I don’t like that).
I was constantly getting complaints that we didn’t have an “out of the box” queue system; we need state. A place where we never can lose a message/state. Also, people kept sending huge amounts of data over the message queues, even base64 encoding files, and then complained it was not performing very well, getting errors when messages were too big, and felt the system was “unstable” when RabbitMQ crashed and restarted. And no one used the framework I posted on GitHub for how to use MongoDB for state, so I decided to add “work item queues.”
Once we had those, that instantly also changed how we should tell OpenRPA what to do. Instead of sending a message using RabbitMQ to one, or a pool of robots (with the risk of the message or the reply message getting lost if RabbitMQ restarts), we can use work item queues to both tell the robots (or other code and services) what to do, and also easily keep track of it and even retry when something goes wrong. We completely eliminate the issues with sending too much data over the message queue.
So why do we still have message queues? Why do we still have the RPA node in Node-RED?
Well, first of all, message queues and exchanges are an integral part of how OpenCore works, and are a wonderful tool to use in your code/designs. It’s just not the tool for everything. Use message queues when state is not vital, but speed is vital and you can handle outages yourself. And use work item queues when state is vital and you don’t want to handle outages yourself. Also use work item queues if you want to send a large amount of data between the clients.
I kept the OpenRPA node mostly for backward compatibility, but sometimes it’s also handy to be able to just tell a robot to do something without needing to set up a message queue and a few OpenRPA workflows. You just need to know what the caveats are.