Thursday, November 07, 2013
Why Entire partition is used in LOOKUP stage ?
Entire partition has all data across the nodes So while matching(in lookup) the records all data should be present across all nodes.
For lookup sorting is not required.so when we are not using entire partition then reference data splits into all nodes. Then each primary record need check with all nodes for matched reference record.Then we face performance issue.If we use entire in lookup then one primary record needs to look into 1 node is enough.if match found then that record goes to target otherwise it move to reject,drop etc(based on requirement)no need check in another node.In this case if we are running job in 4 nodes then at a time 4 records should process.
Note:Please remember we go for lookup only we have small reference data.If we go for big data it is performance issue(I/O work will increase here) and also some times job will abort.