Asynchronous scraping of rss feeds

emcho · October 22, 2015, 8:49pm

Hi, I’m using the scrape-parser package by bobbigmac on atmospherejs to get some scraped and parsed data from an rss feed. The thing is that the scraping is obviously done from the server side, and the clients will have the option to trigger the scraping to get the results. So basically I have this one line in my server side code which does all the scraping.

ScrapeParser.get('http://www.scrapethiswebsite.com");

However sometimes there are a bunch of pages to scrape and this gets called a few times in a row and gets queued up, where all the requests are waiting for the one to scrape and get the results. Now I figured out to place this in front of the above mentioned function, which allows the scraper functions to start at the time they are called, without waiting for each other to finish.

this.unblock();
ScrapeParser.get('http://www.scrapethiswebsite.com");

So that’s good they all start at independent times and do their work, but when they get back the data, the response is again processed in a queue fashion. I used Kadira to monitor this and I saw that the calls go out pretty much asynchronously, but the results are processed synchronously causing a bottleneck, where the results appear one after each other instead of all together, quickly, and what is worse if one client is scraping the data this means the server is busy with that and in the meantime if another client issued a scrape request, then he has to wait for someone else’s scraping to be finished.

I still don’t understand what is going on beneath the hood, if anyone could enlighten me I would be very grateful, and if you could also propose a solution, then even better. I would hate to have to have multiple servers to process these requests like this, hopefully it is possible with only one.