High scale parallel processing, looking for better solution

nyadav.shopping · January 24, 2020, 9:41am

I have usecase to process BIG sql tables/xls/txt in batches of N rows to generate its node, relation.

Golang Planned Logic :

Input is Array of Rows like this :

       ColumnA     ColumnB          ColumnC          ColumnD
        C1           C2               C31               C41
        C1           C2               C31               C42
        C1           C2               C32               C43

Logic : For each row generate node, relations in form of RDF triples. Each cell will be a node. Which columns will have relations between them that will be decided by a input json.

Example for first row these could be triples :

id/c1   <name>        "C1" 
id/c2   <name>        "C2"
id/c31  <name>        "C31" 
id/c41  <name>        "C41" 
id/c1   <Relation1>    id/c2
id/c2   <Relation2>    id/c41
id/c31  <Relation3>    id/c41

For speed will be firing K goroutines to process rows parallely.

K,N variables want configurable as per hardware on which my process will run(10k or 50k or more) . Hardware will be generally 32GB to 64 GB machine.

Problem :

For processing huge table where rows can be in few millions, it will generate large output file due to lot of duplicate triples repeating across rows as per nature of data.

Simplest option to avoid duplicates can be to maintain global map[string]struct{} across batches with key as triple string and check if key already exists then do not add. With multiple goroutines map synchronization will be bottleneck and also this global map may not fit in memory due to volume.

I am new to golang please help me with better solution for this issue. In version1 wanted to stick to single machine setup for processing as far as possible by limiting supported input size.
Any github reference to matching usecase code will help.

Thanks,
Naresh

skillian · January 24, 2020, 7:14pm

Hi, Naresh,

Can you clarify what you mean by “RDF triples?”

Can you also clarify what the <RelationshipN> nodes are? I don’t see how to deduce id/c1 relates to id/c2 but then id/c2 relates to id/c41 but not to id/c31 from just seeing ColumnA through D. Are these relationships bidirectional or unidirectional?

When you say:

Do you mean your batches will be about 10-50k rows or that you plan on running 10-50k goroutines?

I don’t yet quite understand what you’re trying to do but here are some ideas that may or may not help you:

The problem doesn’t sound IO-bound (I might be wrong, I’ll understand more after more info), so I don’t recommend having a huge factor of goroutines per hardware core until you’ve measured a performance improvement.
Can you sort the data before the program you’re working on gets it? You might not need to keep all of these triples in memory if you can sort by ColumnA. As you iterate through the rows and get to a new ColumnA value, you can discard the data for the previous ColumnA value. If you can sort by ColumnA then by ColumnB, you might be able to make it even smaller.
You might be able to fan out the batches to workers that create their own maps and after finishing their batches, fan in to a consumer that’s responsible for joining the batches up.

Hope that helps. If not, Can you provide more information about what you’re doing and what you’ve tried?

nyadav.shopping · January 25, 2020, 4:23pm

@skillian thanks for giving your time t my problem.

RDF triple i meant
startnodeid relationname endnodeid
Relations will be unidirectional
I will give clear input to logic whether to build relation between two columns or not, including exact name of relation between columns.
Logic need not deduce any relation, it will just follow given definition.

K variable will control no of goroutines getting created in case facing issues on deployed hardware.

N variable will control no of rows(batch) getting input to processing logic.
Batch Size of input rows will be planning as per hardware and source table size.
Source table can be any size. Few millions rows will be standard case.
Example let us say source table have 1 million rows. Then will be reading them
in batches of N = 50K or 100K and giving only those rows to logic. Thinking behind is
reading all rows one shot will choke up memory as processing logic will also need memory.

As each batch of rows getting processed in isolation so logic of sort, mapreduce will work
for a batch but across batches duplication will still be major concern. How to avoid creating a node/relation triple if it has been created any of batches till now? Not looking for zero duplication
but want to minimize it to as far possible but also maintaining process stablility as memory will be bottleneck in this case.

Will be happy to reply any further questions. Looking for ideas on this as i lack golang experience.

Thanks,
Naresh

system · April 24, 2020, 4:24pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.