Hi @Allan_Zimmermann
I have an automation in which i need to scrape data from a website in number of 50000 or more records than that . For the initial scraping it works fine but after that automation got stuck. System memory and CPU utilization consumed by OpenRPA increases and reaches by 95% or above that and most important no other process is working in background except OpenRPA and chrome. There is another thing that OpenRPA demands authentication token frequently although i am already login to the OpenFlow .
As you can see in the image below it opens the multiple tab . Can you please suggest how can we handle this or what is the reason.
you need to figure out why openrpa is getting disconnected. check openflow and/or openrpa logs
it’s impossible to say why your code is using alot of ram … are then 500,000 items on a single page or are you paging ? are you remembering to save the data before mocing on to the next page or are you trying to save all in ram ?
The Process is to open a Single URL of a page then save data to sql db and after completing extraction from that page it will open new page in same tab. this process is working in a loop. we are getting data from different 50000 pages. My process runs fine for first 2-3 hours then it starts running slow , within 10 hours it is gives error or OpenRPA is hanging.
I check the logs in logging it shows :- [Warning] Message a9047c79-df9b-4739-b34f-44db6c409cfc (insertorupdateone) state loaded timed out, retrying state:
Overall process is running for 24 hours and scraped data for 50000 pages/transaction.
Couple of things that come to mind without seeing the code:
Have you checked for memory clean-up? It kind of looks like you’re slowly accumulating garbage in memory that is not being GC’d (so either it’s the runner itself, or you’re keeping references to some of the data without realizing it).
a. I’d start with looking at the workflow split. While GC with workflow foundation can be finicky, workflow boundaries do often help with not keeping more in memory than you need to.
Have you tried restarting the browser every N pages? It’s a workaround, but should help with Chrome accumulating memory usage over time.
All that aside, depending on the specifics of the use case (pages need login or not, there’s UI navigation or not etc.), running a headless browser (or just simple GET’s against the addresses, or using puppeteer) might be a much more performant solution.
RPA is great when you need finer control, but if you’re doing mass website scraping it’s much slower compared to “traditional” web crawlers.
We have checked the memory issue. Additionally, we are restarting the browser periodically during scraping and clearing the cache, as well as browsing history.
There is always a cause, and while it is possible it could be OpenRPA itself, it’s really hard to say without details.
I don’t suppose you could share your implementation details?
Other than that, the other option is analyzing memory dumps on what is holding onto it.