Announcing beta for new NodeJS, Python, dotnet and rust SDK

and then I will quickly make an update so the code will auto detect port 50051 as being grpc …

Ok, sorry for spam, i just tested, it will NOT work …
I’m quickly making a fix and publishing a new version

Grpc seems to ignore port:

2024-10-11T09:59:55.654301Z DEBUG openiap_clib: Tracing enabled
2024-10-11T09:59:55.654325Z  INFO openiap_clib: enable_tracing rust_log: "openiap=trace", tracing: "new"
2024-10-11T09:59:55.655218Z  INFO connect_async: openiap_clib: new
2024-10-11T09:59:55.655237Z DEBUG connect_async: openiap_clib: connect_async
2024-10-11T09:59:55.655244Z  INFO connect_async:c_char_to_str: openiap_clib::safe_wrappers: new
2024-10-11T09:59:55.655252Z DEBUG connect_async: openiap_clib: server_address = "grpc://api:50051"
2024-10-11T09:59:55.655257Z  INFO connect_async:safe_wrapper: openiap_clib::safe_wrappers: new
2024-10-11T09:59:55.655263Z TRACE connect_async: openiap_clib: Spawn the async task
2024-10-11T09:59:55.655268Z  INFO connect_async:connect: openiap_client: new
2024-10-11T09:59:55.655631Z  INFO connect_async:connect:connect_async: openiap_client: new
2024-10-11T09:59:55.655677Z  INFO connect_async:connect:connect_async:is_connect_called: openiap_client: new
2024-10-11T09:59:55.655722Z  INFO connect_async:connect:connect_async: openiap_client: Connecting to http://api
2024-10-11T09:59:55.656888Z ERROR connect_async:connect:connect_async: openiap_client: Failed to get config: Connection refused (os error 111)
2024-10-11T09:59:55.658140Z TRACE openiap_clib: Client::Calling callback with result
Unhandled exception. Client+ClientCreationError: Connection failed: ClientError("Failed to connect: transport error")
   at Program.InitializeOpeniapClient() in /app/package/package/main.cs:line 117
   at Program.Main(String[] args) in /app/package/package/main.cs:line 18
   at Program.<Main>(String[] args)

I have now published version 0.0.6 that assumes http with port 50051 is unsecured grpc … and no longer removes the port.

rm -rf bin && dotnet nuget locals all --clear && dotnet add package openiap -s ../openiap/dotnet/packages && dotnet run 
1 Like

Thank you :slight_smile:
Updated to 0.0.6, still connects fine from outside, now also connects from within.

I’ll let it run over the weekend and let’s see what happens :slight_smile:

Sidenote: I’ve removed the manual reconnect code, and changed the retry logic to wait 12000ms at first (increased on subsequent retries) to incorporate the 10s reconnect delay.

1 Like

Hey @AndrewK, thank you for the feedback on Monday…
We talked about two issues, and I’m pretty sure both are now fixed.
I need to wrap up something, then I will push a new version of the SDK’s

Rust APIs sometimes use 100% CPU after a disconnect

Man, this one was brutal. You suggested it was a Docker-only issue, and for a long time, it actually seemed to be true (it was not, but was much easier to provoke inside Docker).

  • I rewrote my .NET test project to run using devContainers and to simulate your test of just calling PopWorkitem over and over again. I then started disconnecting/reconnecting the network. After 5 to 10 tries, I could trigger the issue, but not debug it.
  • I added support for running my Rust repo in devContainers. I then added a test to my Rust CLI that loops over non-stop calls to PopWorkitem. I then started disconnecting/reconnecting the network. After 5 to 10 tries, I could trigger the issue, but debugging it was painful.

After a connection is made, I start two threads (that are not really threads but tasks in the Tokio async framework). This made debugging this so much harder. One for handling incoming messages and one for handling outgoing messages. It turns out that when the 100% CPU usage bug happens, only one of those would be running. That seems to have been fixed now (I cannot reproduce it anymore, even after doing a stupid amount of network disconnects/reconnects), but then I noticed something else. I was seeing the same issue as you reported a while back…

gRPC disconnects after 60 seconds

The 60-second timeout on gRPC but not websockets… again, you were right. It WAS a Docker problem, not something local to you.
I’m constantly swapping between testing toward a local Docker install that either redirects traffic to my dev machine or a test install inside Docker and my Kubernetes cluster in Google Cloud. I noticed I was having the same issue toward Docker, so I tested using my Node.js API and the new Rust API, and BOTH had issues, but only toward Docker… After way too much Googling and ChatGPT talks, I came up with the below… Add this under command on the traefik service in your Docker Compose file, and the problem goes away. I have also added these (commented out) in all the Docker files at the Docker repo.

      # **Timeout settings for gRPC**
      - "--entrypoints.websecure.transport.respondingTimeouts.readTimeout=90000s"
      - "--entrypoints.websecure.transport.respondingTimeouts.writeTimeout=90000s"
      - "--entrypoints.websecure.transport.lifeCycle.requestAcceptGraceTimeout=90000s"

Damn it … after writing this, I saw it only “bumped” up the timeout to 15 minutes … so not completely fixed yet, but seems we are getting closer at least. After reading this, I am now testing using value 0 for all 3

I’m going to stop looking more into this … I cannot for the life of me see where/why connections are being dropped after 15 minutes on Docker, but not Kubernetes.
This would suggest the issue is with Traefik, but when I look at the logs I see

2024-10-19 13:11:59 2024-10-19T11:11:59Z DBG log/log.go:245 > 2024/10/19 11:11:59 reverseproxy.go:663: httputil: ReverseProxy read error during body copy: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug="max_age"
2024-10-19 13:11:59 2024-10-19T11:11:59Z DBG github.com/traefik/traefik/v3/pkg/middlewares/recovery/recovery.go:38 > Request has been aborted [192.168.65.1:47682 - /openiap.FlowService/SetupStream]: net/http: abort Handler middlewareName=traefik-internal-recovery middlewareType=Recovery

This suggests that the Node.js server is gracefully closing the connection and Traefik is just relaying that information to the client. But when I’m testing, I’m pointing my Docker Traefik at the SAME API on the exact same pod. So that would suggest the same server is behaving differently depending on what Traefik instance it’s talking to, which makes no sense to me.

I guess we just have to “live with that” for now and use Kubernetes or WebSockets as an alternative.

I need to remove all my debug code from Rust, clean up some functions, and then I will push a version 0.0.7

I have now pushed version 0.0.7.

I have cut down memory usage by 3/4 (it was crazy high before).
I have improved the retry logic a lot (and made it work the same across websockets and gRPC).
The Rust client now has ALL protocol functions added.
More functions have been mapped to Node.js/Python. .NET is a little behind, but hopefully, I will have mapped all functions from the Rust client to Node.js/Python and .NET at 0.0.8.

I had to “kill” monitoring network usage (this was the thing using a ton of RAM). I’m a little undecided on whether I should keep chasing a solution for this or just accept only monitoring the number of function calls, packages, CPU, and memory usage.

1 Like

And I found the “solution” … Traefik “broke” gRPC in newer versions, so simply using an older image fixes this.
I don’t know when, but I just tested this version and that seems to work just fine with gRPC without disconnecting every 15th minute.
Simply update image: traefik to image: traefik:v2.10.6 in your Docker Compose file.

Edit: I have had a client connected for 6 hours now, without any disconnects, after downgrading the Traefik image to 2.10.6 … so that is definitely the solution right now. Maybe someone should create an issue on their GitHub, or buy a support subscription to raise awareness around this.

Ok, I’ve been testing a ton of different tools for tracking memory in Rust (that is one thing .NET got right) and I managed to fix the memory leak and decrease the memory usage massively. Here is a Rust client, a Node.js client, a .NET client, and a Python client who is spamming pop workitem requests over 2 days.

I have also managed to get all the missing functions from the API added to all 4 languages. For .NET, I also made a little workaround for serializing results. If you just want the JSON string and can parse that faster yourself, use:

string results = await client.Query<string>("entities", "{}", "{\"name\": 1}");
client.info("Query results: " + results);

But you can also let the client do it for you. So, if you have your own class (in this case Base), you can do something like this:

var results = await client.Query<List<Base>>("entities", "{}", "{\"name\": 1}");
for (int i = 0; i < results.Count; i++)
{
    client.info(results[i]._id + " " + results[i].name);
}

I hope this makes things a little easier to work with for .NET devs. (I’m open to suggestions on something similar for Python and Node.js)

Rust : crates.io / example project / documentation
Node : npmjs.com / example project
dotnet : nuget.org / example project
python : pypi.org / example project

1 Like

FYI - running on 0.0.13 now.
The memory usage does look better.
The query change required a slight adjustment to specify return types now, but that was just a small adjustment needed.
CPU usage so far looks to be a tiny bit higher than on earlier version (really slight difference), but the memory usage being kept down seems definitely worth it.
I’ll keep it running for a couple of days and report back around the workweek end or when something pops up.

1 Like

mapping usage

Christ, these past few weeks have been incredibly stressful but also full of learning as I delved into many new tools for tracking memory allocation and usage. I’ve explored dotnet-counters, dotnet-dump, heaptrack, jemalloc, valgrind, and tons of new tricks in gdb.

I personally believe I’m reasonably skilled at tracking and fixing memory leaks in .NET and Node.js, but this was quite a journey.

rust finally rock stable

Here is finally the result we were looking for:
500,000 calls to pop workitem, with NO memory increase.
This is from a unit test calling the clib functions, so were at right at the border from rust to the other languages.

I made a few mistakes here and there in the code, but that’s no excuse. Memory issues accumulate, and we need this to work consistently, even when running for months or even years.

  • In my Rust client (openiap-client), I was using a oneshot channel to handle results coming back from OpenFlow (similar to how we use TaskCompletionSource in .NET), but I wasn’t releasing it when complete.
  • In my C exposing layer (openiap-clib), I wasn’t releasing memory properly. free_pop_workitem_response wasn’t freeing all strings and wasn’t handling the release of the actual workitem object. Since our tests weren’t returning a workitem, tracking this issue took much longer.
  • In the .NET client, I had a significant bug where I wasn’t calling free_pop_workitem_response if pop workitem didn’t return a workitem. However, even after fixing that, the above issues were still creating leaks.

dotnet

So when firing up my dotnet test code, I was kind of surprised to see this after 1,500,000 pop_workitem request’s from dotnet.

I mean, it’s not a classic memory leak, but I’m not convinced either.
Here is shared+private memory usage

After sooooo many hours of back and forth with chatgpt, and testing different things, i tracked it down to, some issue around how to release memory for some strings, and dotnet not liking when you create a lot of delegates, it now looks more like this.
After calling pop_workitem_async 450 times a second, for more than 2,5 hours, we see this


That looks nice and flat, but …

*insert rage quiting smiley* … for fuck’s sake !!!

Well, I have a ton of good learnings i need to implement all the functions functions in both rust and dotnet.
And i need to keep looking for what is still making the dotnet VM keep allocating more ram, than it’s using.

2 Likes