Galaxy run out of memory on large aggregations

mrmsupport · April 4, 2017, 2:45pm

I’m running an aggregation that returns a large amount of data (around 30 000 records). This causes Galaxy server to crash due to out of memory.
How can I send this large chunks of data to the client without causing my server to overload memory?
I would like to server the data as an excel / csv file to the client. It’s currently working well for smaller aggregation volumes. The data is updated regularly so it cannot be a static file.

SkyRooms · April 4, 2017, 6:14pm

Write it to a local storage file.

So pull back 1,000. Then APPEND that to your folder. On Galaxy, you get a 500 meg partition in /tmp.

So, pull the ID of your customer, 1209348ffasd.xls > foreach(row){ write:"1,2,3,4,5 …

Then show the user a circle spinner (Generating Report)

Once it’s done, have it download.

If you go over 500 megs, you’ll need an AmazonS3 account to write to, in a similar but definatly more complicated manner.

mrmsupport · April 4, 2017, 9:55pm

Ok, sounds like a option, but how do I pull back only a 1000 from the aggregation?
I can filter to do some cursor on the result and effectively ‘paging’ through the aggregation result, but this will mean running the aggregation 30 times. and additional logic to process the aggregation ‘paging’ algorithm, which can be tricky since we are running all sorts of aggregations.

So am I right in thinking that the problem is one variable containing the data result from the aggregation is too large (thus takes to much memory). Thus I can actually save a lot by looking at how variables is assigned in the process, since I might in fact be assigning the result more than once in processing.

Here is my algorithm thus far, In a server meteor method:

const jsonToExcel = (jsonData) => {
    function sheet_to_workbook (sheet, opts) {
        let n = opts && opts.sheet ? opts.sheet : 'Sheet1';
        let sheets = {}; sheets[n] = sheet;
        return { SheetNames: [n], Sheets: sheets };
    }

    function aoa_to_workbook (data, opts) {
        return sheet_to_workbook(XLSX.utils.aoa_to_sheet(data, opts), opts);
    }
    function data_to_arrayOfArrays (jsonD) {
        if (!jsonD[0])
            return [];
        let fieldnames = [];

        let data = [];
        jsonD.forEach((obj) => {
            for (let key in obj) {
                //only handel flat primative data
                if (typeof obj[key] === 'boolean' || typeof obj[key] === 'string' || typeof obj[key] === 'number' || obj[key] instanceof Date) {
                    fieldnames.push(key);
                }
            }
            fieldnames = [...new Set(fieldnames)];
        });
        data.push(fieldnames);
        jsonD.forEach((obj) => {
            let row = [];
            fieldnames.forEach((key) => {
                if (obj[key]) {
                    row.push(obj[key]);
                } else {
                    row.push(null);
                }
            });
            data.push(row);
        });
        return data;
    }

    let flatdata = data_to_arrayOfArrays(jsonData);
    let wb = aoa_to_workbook(flatdata); // wb will be a workbook with one sheet aligning with the data
    let wopts = { bookType:'xlsx', bookSST:false, type:'binary' };
    let wbout = XLSX.write(wb,wopts);
    return wbout;
};

Meteor.methods({
runAggregationQuery (agQuery) {
        let data = ClientData.aggregate(agQuery.pipeline);
        let filedata = jsonToExcel(data);
        return {collection: queryid, filedata: filedata } ;
  }
});

So I’m thinking to optimise, I can for a start write the excel file to S3 and send the download link to the client.
I have do not have a lot of experience in memory optimization, but in this code, what would be a more memory efficient approaches to consider?

For instance, if I do data = jsonToExcel(data); instead of let filedata = jsonToExcel(data); will this be less memory expensive?

SkyRooms · April 5, 2017, 2:52am

Looks complicated and scary, you must feel super lucky you get to write it.

What I’m thinking is that, since this is JS and not a server side system like PHP - you should be able to use setTimeout() to slow your script down in 1k record segments…

So looks like data_to_arrayOfArrays() is your primary entry point. jsonD must mean jsonData then.

I then imagine you have a subscription model in place to retrieve that data from the server.

Hrm. I wonder if you can do this SERVER SIDE in that case, and because you’re not sending data, maybe you get lucky and it doesn’t break your site. But maybe this is already server side and doesnt matter anyway…

In my professional opinion, and what I would do, is as I mentioned earlier. You’re getting to much data and locking up the system. Slow it down, do it in chunks.