"Error: jwt expired" in api (mongos, workflow) nodes

Hi Alan, how is it going?

Context:
We have some environments running perfectly fine for 1yr+ (we are using version 1.3.55) in Docker containers, but quite limited in the High Availability sense.

Then recently (3-4 months ago), we upgraded from a Docker solution to a Kubernets (k8s, AWS) solution with X-pods min per application (openflow, nodered) and some Y-pods max for scalability and redundancy and whatnot.

Problem:
One of these environments hosts OpenRPA bots (the rest don’t) and only on this enviroment we are seeing some weird behaviour related to JWT on the ‘api_nodes’ (workflow in, mongo api get/delete/insert, workflow bot, workflow out) on NodeRED, the JWT are expiring and are not being refreshed.

How we reproduce it
We basically just leave everything running, OpenRPA initiates Workflow In in NodeRED, these trigger some api-get on MongoDB and etc and after 15 to 25 min the ‘Error: jwt expired’ errors start and our only solution is to restart the Openflow application as a whole to refresh everything

What we attempted so far:
As this environment is well protected (not exposed to internet, etc), we attempted to simply extend the JWT duration to a very long one, but no luck, they still expire after 15-25min of usage.

    Config.shorttoken_expires_in = Config.getEnv("shorttoken_expires_in", "365d");
    Config.longtoken_expires_in = Config.getEnv("longtoken_expires_in", "365d");
    Config.downloadtoken_expires_in = Config.getEnv("downloadtoken_expires_in", "365d");
    Config.personalnoderedtoken_expires_in = Config.getEnv("personalnoderedtoken_expires_in", "365d");

We are aware that our version is quite behind compared to current stable version, but since we are recently done migrating our backend infrastructure, we were trying to stabilize it before upgrading the stack version, as that may require extra attention to assure backcompatibility and may affect our clients

My intention is purely because who knows anyone might have encountered the same issue and found a fix or yourself Alan may point to something we haven’t looked yet.

Thank you for your time beforehand,

Yours,
Thiago.

I must admit, I’m still not sure what happended to trigger this error ?
but what i often see:

Using form workflows. When a user (or openrpa) starts a form workflow and triggeres “workflow in” it created a new longer lasting jwt ( it has changed a few times, but i think your version is 15 minutes ) if the workflow you rin after that take longer, it will expire. You can use get jwt to “refresh it” ( not sure if that feature was possible in your version ?
image
Another work around is to just run as nodered and delete the token in a function now with

delete msg.jwt;

In later version of openflow i update this to be a long token ( the 1 year one ) to by pass this.

Was that the case ? or can you explain an example step by step what happens to “trigger” the error ?
PS: please not that if you upgrade openflow, then kubernetes is no longer part of the open source solution.

Hi Allan!

I’ve found a small subflow that may be easier to troubleshoot.

This subflow for example is composed by:
1- http in
2- function (just validate if rest api payload is valid etc)
3- switch for status code (wire1: valid 200, wire:2 invalid 400)
4- api get (mongo)

I like this subflow because this one does not have any “complex” (complex in the sense of non-native NodeRED nodes) other than the single ‘api get’
I assure the function script/node does not mess with msg.jwt (or msg.__jwt) at all, the property .jwt does NOT exists on the message being inputed into ‘api get’

Yet the ‘api-get’ is returning the exception of ‘jwt expired’

Our only fix when this happens is to reset openflow+nodered as a whole.

Does this ring any bell about what may be causing it? Any ideas of what I can use for extra information for troubleshooting? The environment was healthy for about 7 to 8 hours until this subflow started throwing errors.

(and I know this must be something very corner-y case, as our other 3 environments are running perfectly fine, running the same backend/cloud architecture and are on the same version 1.3.55).

Hi Alan,

+troubleshooting information:

  • The issue of jwt seems to be related to mongodb being throttled/slow.
    As we vertically scaled mongodb by a factor of x4 + some (our end) housekeeping, the issue seems to be gone.

With the environment cleaner, mongodb seems more responsive and since then we faced up to 3-4 days since our last jwt expiring error, that brings our best shot to the issue maybe being related to a token refresh() that isn’t happening properly in v1.3.55.

Additionally, not completely offtopic as it is related to mongodb collection good maintenance, could you provide more information about these two collections?

  • workflow_instances (+_hist)
  • openrpa_instances (+_hist)

Q: What is the criteria / when a document is moved from ‘workflow_instances / openrpa_instances’ to its _hist counterpart?
Reason: There are a LOT of documents (+0.5M) in our “openrpa_instances” collection, and most (99%+) currently have “isCompleted: True”

Sorry, forgot about this thread.
Glad you found a solution (?) …

workflow_instances_hist and openrpa_instances_hist should not exists, they should be in your skip_history_collections if your setting it your self ( default is audit,oauthtokens,openrpa_instances,workflow_instances,workitems,dbsize,mailhist )

workflow_instances contains each openrpa workflow that has been run, while the workflow is running it also contains the state ( bookmarks, placement, variables etc. ) workflow_instances contains all workflow instances started with “workflow_in” node … until cleared with workflow out completed and clear state, all objects in msg will be saved in the instance.

Hi!

I can see all these information (variables, state, etc) and seems so far so good, just worried about the amount of documents (over 500k) that piled up over the months…

Could you provide more information about the actual trigger that removes/deletes the document from “openrpa_instances”? Asking because I saw a lot of documents with both “isCompleted: True” and “state: failed/completed/etc” and they were still there - what is the criteria for this specific housekeeping?

They are not deleted, they are keept for reporting purposes. If you dont need that, just delete them once in a while.

Allan,

Thanks for the response.
I will let this discussion close as my problem seems to be solved.

Just to summarize it up, I am still unsure if the root-root cause is something specific to 1.3.55 when its due to refresh its token, but the “middle root cause” was definitely Mongodb being slow/throttled and better housekeeping/hardware fixed the issue.

Thanks once again for your help detailing the inner workings of the solution, definitely others will read this and find the information as valuable as it was to me!

Have a nice evening.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.