Thursday, January 31, 2008

Table/Collection lookup with PowerShell

If you have to lookup data in a collection repeatedly, that can be quite slow.

Let us say you have two collections and want to update one collection based on some criteria from the other. You can do it like this -

$collA | % { $itemA=$_; $collB | ? { $_.key -eq $itemA.key} | % { $_.value=$itemA.value } } 


But in this scenario, you have to iterate $collB for each of the value in $collA. This is not efficient, especially not when you have plus 50.000 elements in both collections.



As a way to speed things up, you can build a lookup table. Note that this only works, if the key is unique in $collB. The best choice for this is an hash table. You create it like this -



$collB | % {$hashB=@{}} {$key=$_.key; $hashB.$key=$_}


Now you can use the hash in your code  -



$collA | % { $itemA=$_; $key=$itemA.key; $objB=$hashB.$key; $objB.value=$itemA.value }


The traversal of $collB is replaced by a much faster hash table lookup.



If you only need to know whether the value is in $collB, just do like this -





$collA | % { $itemA=$_; $key=$itemA.key; if ($hashB.$key) { "found" } }




I did a performance comparison (with only 500 items) so you can see the difference -



# Build two collections


1..500 | % { $collA=@(); $collB=@() } {
$c=$_
$collA+="" | select @{n="Key";e={$c}},@{n="Blah";e={$c+100000}};
$collB+="" | select @{n="Key";e={$c}},@{n="Blah2";e={$c+200000}};
}


# Run 'normal' lookup


measure-command {
$collA | % { $itemA=$_; $collB | ? { $_.key -eq $itemA.key} | % { $_.blah2=$itemA.blah+100000 } }
}
# It took 27,3 seconds

# Build hash and run hash lookup


measure-command {
$collB | % { $hashB=@{} } { $key=$_.key; $hashB.$key=$_ }
$collA | % { $itemA=$_; $key=$itemA.key; $objB=$hashB.$key; $objB.blah2=$itemA.blah+200000 }
}

# It took 0,1 seconds


Do I need to say more?

No comments: