app.openiap.io has been down ALOT the last 2 days.
Has been a pain to figure out what was wrong, but I think i finally “nailed” it
History is generated using jsondiffpatch an absolute wonderful and brilliant package, but it finally meat it’s match. If you feed it a document with an array of more than 57395 entries, it uses all the ram, even if you give it 6 gigabyte ram, and i was doing that, while doing housekeeping on my users role.
Users role has been kind of a problem for me since openflow started to gain traction. Keeping that up to date started get harder once i got over 2,000 users and with +50,000 it just got to big.
A long time ago, i moved the “users” logic out of the actual role, but i keept updating the role object in the collection too, “just to be safe” … I have now removed that, to save IO in the database and to avoid more issues with ram. It makes no sense to have an object that big inside mongodb anyway.
this is getting embarrassing
one-two times an hour it crashes, EITHER due to the database not responding, or one of the api nodes using all the ram on the host it’s running on.
A long time ago i added heapdump_onstop setting so i can debug what is wrong, but’s taking so long to do, kubernetes kills it … ARGH …
Anyway, I’m sorry to anyone affected by this, i’m trying really hard to figure out what is going on.
i think i managed to get it solved now.
And I did not need to change any code. For some strange reason, mongodb is now requiring almost 3 times as much memory as it used too.
It used to crash 2-3 times an hour … but i have only seen one crash in the last 24 hours.
god damn it, this has been driving me nuts.
But finally found one more thing. i keept having a crash 3 times a day, and always at the same “time” of day. So either someone externally was doing something scheduled or my housekeeping job was having issues.
Turns out my housekeeping job was the problem. My calculate db usage job was making my mongodb run out of memory and crashing.
This has now been fixed. And while working on that, i’m preparing to add some crude interface for tracking slow queries.
I feel like giving up soon.
So i spent HOURS refining my sidecar to better handle mongodb replicat sets on kubernetes, and i freaking forget to set the connection string to use the full replicate set, so when mongodb did a failover, app.openiap.io was down for 30 minutes until i saw the email about the site being down … ARGH …
That is hard to answer, what is the issue you are having ? what is the symptoms and do you see any errors ? ( my issue is inside kubernetes, if something use to much ram, i don’t get an error, it just gets killed, and i have no way of “tracking” what is using all that ram. ) on docker this is usually not a problem, since it by default does not have resource throttling
Mine is deployed on the Docker. The OpenFlow is down and inaccessible, and runs normally after the restart on the docker. Unluckily, I didn’t capture the Docker’s log