4 weeks ago, I decided to learn Rust and decided to recreate all SDKs for OpenIAP using Rust as a backend as a way to learn it.
The reason was to enforce a consistent experience across all programming languages. I had hoped using protobuf “was enough,” but time/skill/language limits ended up making it far from a perfect experience across the different languages (Python does not support stream, so upload/download was not working. .NET was horribly behind since I was never really touching it, Node.js and the browser version are always “out of sync,” etc.).
I can normally “wing it” when learning a new programming language, but after 2 days with Rust, I was ready to kill anyone who talked to me. Goddamn, Rust was giving me a hard time, but after watching and reproducing every example in The Rust Book, I restarted my journey with Rust.
I was sure I would only take 1-2 weeks to get a “basic” version up and running using gRPC, and then I could start adding WebSocket, REST, TCP, and named pipes.
I was wrong… It has been an extremely fun journey, and I learned a lot along the way, but I’m almost done creating a Rust SDK that also works as a “backend” for a Node.js, Python, and .NET SDK (source code here).
I’m going away for a few weeks so I cannot work on it for a while, but I would love if someone would have a look at it and test it out.
I have finally managed to add all the proto buff functions to rust.
I have added unit testing for all, and slowly started on adding documentation.
The “core”/most used functions are also added to nodejs/python and dotnet, but i still need to add all of them.
I have pushed an update for all 4 languages as version 0.0.4
Once i have added all functions to all languages I will start work on adding support for WS/REST/TCP and named pipes … but for now, this is still gRPC only.
So I finally had some time to test the new api for dotnet
First thing that jumps out is that unless I’m doing something weird, the api seems to live within the top level global namespace, which makes it awkward for intellisense. Would probably be nice to wrap into some namespace for ambiguity error avoidance (especially since some of the names are by necessity very simple and common, like Client).
After reworking the older project to use the new one, when running from local the channel gets closed after 60 seconds, same as the old one. Not sure if it is some setting in our OpenFlow, but it’s very consistent on that.
And I haven’t found the OnDisconnect equivalent, so that makes it a bit harder to handle this time.
Checking queues...
Retry attempt 1 after delay of 250ms after exception: Client+ClientError: Pop workitem failed: ClientError("Custom Error Client Error channel closed")
I’ll try to test with a published agent, and later on against the app.openiap.io (probably the best test to see if it’s just a config issue), and will update here.
Aside of that, looks nice so far and the functionality as is seems to work. After I get the disconnection sorted out, will get it up and running as a background agent for a couple of days to check stabilities.
I find it very weird, you keep having those 10 minute disconnects.
We should investigate that a little more.
I like your idea of testing toward app.openiap.io … if you also have the problem there, the problem most likely is “in your end” and is not traefic or openflow.
Ok, so I checked some more things and:
When running locally against app.openiap.io it does not disconnect after the 60s, so it must be some configuration that differs.
If I publish the same agent openflow, it doesn’t initialize the client:
// my openflow
2024-09-10T09:13:02.182932Z INFO connect: openiap_client: Connecting to http://api
Unhandled exception. Client+ClientCreationError: Connection failed: ClientError("Failed to connect: transport error")
// app.openiap.io
Unhandled exception. Client+ClientCreationError: Connection failed: ClientError("Failed to connect: transport error")
2024-09-10T10:51:04.222362Z INFO connect: openiap_client: Connecting to http://api-grpc
Looks like a routing issue when running from docker?
Also a bit weird that the api endpoint is different.
Also as a sidenote - looks like when supplying the jwt for authorization, the Signin call is not needed? Seems to be “working fine” with just the .Connect() without the .Signin() call.
wow, thank you for putting time into it.
Sorry, i have not gotten to testing in docker yet, I’m working on adding telemetry now.
your logs look correct. On docker-compose install it should be connecting to host “api” but on port 50051
on kubernetes it should be connecting to host api-grpc and port 80 … both “insecure” since the tls layer is adding using traefik
The plan was to start working on adding websocket support, after telemetry, but i guess looking into docker is better ?
Ok, that’s useful to know - I can work with that
I worked around it with hardcoding the connection address to http://api:50051 (can parametrize around it for local runs), so feel free to set the priorities as you see fit
It’s not pretty, but works.
So far when running as a published agent within my openflow, the connection seems stable (still polls every 1s on queues). I’ll let it run for some time on mocked functionality and see if it’s stable.
EDIT:
The agent is running for…~20h right now, with 2 queue checks every second (~72k check iterations x2 queues). I don’t see any hiccups in the logs, but we have the reconnection handling, so even if there were some small ones it didn’t interrupt.
So while not the most wholesome test ever, it looks to be working better then the old API.
And it just stayed like that, looks like the build didn’t trigger. And no info on why it restarted.
Previous restart-on-uncaught-error went fine:
2024-09-11T13:11:15.783706931Z Iteration: 84222, 2024-09-11T13:11:15.7836148+00:00
2024-09-11T13:11:15.783754110Z Checking queues...
2024-09-11T13:11:15.789584242Z Nothing, waiting...
2024-09-11T13:11:16.789899922Z Iteration: 84223, 2024-09-11T13:11:16.7897166+00:00
2024-09-11T13:11:16.790210597Z Checking queues...
2024-09-11T13:11:16.808555109Z Process terminated. A callback was made on a garbage collected delegate of type 'openiap!Client+PopWorkitemCallback::Invoke'.
2024-09-11T13:11:17.529973655Z create client
2024-09-11T13:11:17.594138717Z Setting up server stream api:50051
2024-09-11T13:11:17.646151780Z connected!
2024-09-11T13:11:17.997916981Z package myagent-newapi
[build info follows and things start]
This might be more related to the dotnet6 image than the api itself, but it’s a bit hard to distinguish.
Which brings a question - how to monitor agents “automatically”?
The first one, i need to know where it’s “hanging” to have an idea on what might be wrong and hot to fix it.
If you don’t mind A LOT of debug information you can call
client.enabletracing("openiap=trace", "new");
There is a “bug” (?) with this function. Rust will not allow me to update the logging settings, so you can only call this function once, every call after that will have no effect.
Then if you also add a lot of verbose logging yourself in your code, so we can see if it stops in your or my code ? you can use my client.verbose and client.trace if you want an easy way to enable/disable log messages.
This is an error in my code. I need to use variables for the callbacks, so they do not get garbage collected. But luckily it also crashes the process, so it will restart when this happens, but i will ofc get this fixed.
a little info about enabletracing. This sets up a tracing filter ( this can also be set by using environment variable RUST_LOG )
all “my” code will be within the openiap space, so by using openiap=LEVEL you only get tracing information for that. If you want all you could simply use LEVEL. level can be either info, warn, debug, trace. The second parameter is for span’s, it defines if it should log every time it enters/exit a function, this can be empty to log nothing, valid values are new, enter, exit, close, active or full.
This is going to be a bit tough Right now the agent is running since Friday (~5 days) without a hiccup. So I’m blaming cosmic rays at this point
If this occurs again I’ll try setting up a randomly failing agent to force more frequent restarts (the issue occurred on container restart, just before the build step).
To summarize a bit:
For now this seems much more stable than the old api package, and the few hiccups it had could be attributed to anything in-between on the call stack (+ I’m definitely not saying that the way I’m using the api is optimal either, or that my part of the code is perfect in any way).
The small api call hiccups can be addressed with retry logic (which is standard for remote calls anyway), and the build hang happened exactly once so far, so not a big deal.
So from my side I’m going to focus more on the maintainability of agents now → monitoring, logging, start/stop signals etc. as mentioned in the other topic.
Ok, the hang we need to check on our side. It has happened again on rebuild, and it looks like it’s a container restart. The timing is also almost exactly the same as last time (off by a couple seconds). So it may be unrelated to the OpenFlow itself, but how it’s set up.
Will post here if we find something useful.
Ok, the end result it’s that it is a bit of a combination of both.
The hang happens when the whole docker restarts. It looks like the agent comes online earlier than the openflow, and it doesn’t time out somewhere, not exactly sure where.
So I think without modifying the dotnet6 docker image this will need to remain a known issue of (at least this type of) agents, that they can hang on startup if openflow isn’t fully up yet when they try to fetch the package.
Found the cause of restarts, and it is related to the agent. But I’m unsure if it’s related to the api library, the way I’m using it, or both.
What happens is after this error:
Attempting reconnect after ClientError: Pop workitem failed: ClientError("Custom Error Client Error channel closed")
// handling code:
if (ex is Client.ClientError)
{
Console.WriteLine("Attempting reconnect after ClientError: " + ex.Message);
await openflowClient.connect(apiUrl);
}
and the reconnection, the CPU usage massively goes up (looks like it fully occupies 1 vCPU every time). After this happens a couple of times, and all the vCPU’s get hogged down, docker dies and things go back to normal after restart.
So, not sure what to do about it now. It looks like an orphaned object problem, but without the GC, since it’s on the native side.
I think explicit disposing of the client could help, but it’s hard to test since I can’t “force” a channel closing for that purpose.
Since the client is passed inwards, it could be passed byref instead, so that it can be “replaced” within the retryhelper if needed, but I’m not sure if that’s a good idea and wouldn’t just lead to more problems.
I’m open to suggestions and/or debugging this further, but for now all I can conclude is that trying to reconnect like shown above seems to be a bad idea.
Its in the lib, not you. And im sure it has been fixed. The version here is VERY stable, and also handles reconnecting now. That was not a thing before.
I will see if i can push it tonight or tomorrow.
I have now pushed version 0.0.5 to
The client now support websockets ( using ws:// or wss:// as apiurl )
The client’s now support subscribing to “connected”/“disconnected” events. The client will now auto-reconnect if it loses connection ( when using grpc, it can take a LONG time to detect, so you might get an error doing first call, then it will instantly try and reconnect )
Ok, it works from outside, but when publishing to openflow (docker), can’t get it to run.
This is with the override to port 50051:
Logs1 - http://api:50051
Starting XYZ Agent, rust-api edition...
2024-10-11T08:32:08.438667Z DEBUG openiap_clib: Tracing enabled
2024-10-11T08:32:08.438691Z INFO openiap_clib: enable_tracing rust_log: "openiap=trace", tracing: "new"
2024-10-11T08:32:08.439558Z INFO connect_async: openiap_clib: new
2024-10-11T08:32:08.439579Z DEBUG connect_async: openiap_clib: connect_async
2024-10-11T08:32:08.439586Z INFO connect_async:c_char_to_str: openiap_clib::safe_wrappers: new
2024-10-11T08:32:08.439593Z DEBUG connect_async: openiap_clib: server_address = "http://api:50051"
2024-10-11T08:32:08.439598Z INFO connect_async:safe_wrapper: openiap_clib::safe_wrappers: new
2024-10-11T08:32:08.439605Z TRACE connect_async: openiap_clib: Spawn the async task
2024-10-11T08:32:08.439616Z INFO connect_async:connect: openiap_client: new
2024-10-11T08:32:08.439954Z INFO connect_async:connect:connect_async: openiap_client: new
2024-10-11T08:32:08.439997Z INFO connect_async:connect:connect_async:is_connect_called: openiap_client: new
2024-10-11T08:32:08.440024Z INFO connect_async:connect:connect_async: openiap_client: Connecting to http://api:50051
2024-10-11T08:32:08.441267Z ERROR connect_async:connect:connect_async: openiap_client: Failed to get config: Connection refused (os error 111)
2024-10-11T08:32:08.443119Z TRACE openiap_clib: Client::Calling callback with result
Client+ClientCreationError: Connection failed: ClientError("Failed to connect to WS: URL error: URL scheme not supported")
at Program.InitializeOpeniapClient() in /app/package/package/main.cs:line 119
2024-10-11T08:32:08.465242Z INFO connect_async: openiap_clib: new
2024-10-11T08:32:08.465251Z DEBUG connect_async: openiap_clib: connect_async
2024-10-11T08:32:08.465281Z INFO connect_async:c_char_to_str: openiap_clib::safe_wrappers: new
2024-10-11T08:32:08.465288Z DEBUG connect_async: openiap_clib: server_address = ""
2024-10-11T08:32:08.465293Z INFO connect_async:safe_wrapper: openiap_clib::safe_wrappers: new
2024-10-11T08:32:08.465315Z TRACE connect_async: openiap_clib: Spawn the async task
2024-10-11T08:32:08.465321Z INFO connect_async:connect: openiap_client: new
thread 'tokio-runtime-worker' panicked at /home/allan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.2/src/runtime/blocking/shutdown.rs:51:21:
Cannot drop a runtime in a context where blocking is not allowed. This happens when a runtime is dropped from within an asynchronous context.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'tokio-runtime-worker' panicked at library/core/src/panicking.rs:221:5:
panic in a function that cannot unwind
stack backtrace:
0: 0x75f97f747835 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h1b9dad2a88e955ff
1: 0x75f97f76f66b - core::fmt::write::h4b5a1270214bc4a7
2: 0x75f97f7441bf - std::io::Write::write_fmt::hd04af345a50c312d
3: 0x75f97f748b31 - std::panicking::default_hook::{{closure}}::h96ab15e9936be7ed
4: 0x75f97f74880c - std::panicking::default_hook::h3cacb9c27561ad33
5: 0x75f97f749191 - std::panicking::rust_panic_with_hook::hfe205f6954b2c97b
6: 0x75f97f748fc3 - std::panicking::begin_panic_handler::{{closure}}::h6cb44b3a50f28c44
7: 0x75f97f747cf9 - std::sys::backtrace::__rust_end_short_backtrace::hf1c1f2a92799bb0e
8: 0x75f97f748c84 - rust_begin_unwind
9: 0x75f97f0301e5 - core::panicking::panic_nounwind_fmt::h4c4dc67d0bbc166c
10: 0x75f97f030272 - core::panicking::panic_nounwind::hb98133c151c787e4
11: 0x75f97f0303b6 - core::panicking::panic_cannot_unwind::he9511e6e72319a3e
12: 0x75f97f03a25a - connect_async
13: 0x75f90ac54ac9 - <unknown>
thread caused non-unwinding panic. aborting.
Looks like this is the key info:
Client+ClientCreationError: Connection failed: ClientError("Failed to connect to WS: URL error: URL scheme not supported")
This is with a standard connect call (parameterless):
Logs2 - parameterless
Starting XYZ Agent, rust-api edition...
2024-10-11T08:37:17.898405Z DEBUG openiap_clib: Tracing enabled
2024-10-11T08:37:17.898427Z INFO openiap_clib: enable_tracing rust_log: "openiap=trace", tracing: "new"
2024-10-11T08:37:17.899326Z INFO connect_async: openiap_clib: new
2024-10-11T08:37:17.899345Z DEBUG connect_async: openiap_clib: connect_async
2024-10-11T08:37:17.899352Z INFO connect_async:c_char_to_str: openiap_clib::safe_wrappers: new
2024-10-11T08:37:17.899360Z DEBUG connect_async: openiap_clib: server_address = ""
2024-10-11T08:37:17.899365Z INFO connect_async:safe_wrapper: openiap_clib::safe_wrappers: new
2024-10-11T08:37:17.899371Z TRACE connect_async: openiap_clib: Spawn the async task
2024-10-11T08:37:17.899376Z INFO connect_async:connect: openiap_client: new
2024-10-11T08:37:17.899695Z INFO connect_async:connect:connect_async: openiap_client: new
2024-10-11T08:37:17.899804Z INFO connect_async:connect:connect_async:is_connect_called: openiap_client: new
2024-10-11T08:37:17.899847Z INFO connect_async:connect:connect_async: openiap_client: Connecting to http://api
2024-10-11T08:37:17.901002Z ERROR connect_async:connect:connect_async: openiap_client: Failed to get config: Connection refused (os error 111)
2024-10-11T08:37:17.902116Z TRACE openiap_clib: Client::Calling callback with result
Unhandled exception. Client+ClientCreationError: Connection failed: ClientError("Failed to connect: transport error")
at Program.InitializeOpeniapClient() in /app/package/package/main.cs:line 125
at Program.Main(String[] args) in /app/package/package/main.cs:line 18
at Program.<Main>(String[] args)
Here it looks like the same wrong port issue as before.
Help me, did we “override” the api url ? … if so, can you change it to
ws:// or grpc://
i think the problem is it say’s http, so it does not know what protocol to use ( websocket’s or grpc )