Friday, December 12, 2008

Matching multi-line text and converting it into objects

Sometimes your input is simply text and you have to work with it. If it is line-based e.g. each object is a separate line, it is not so hard, but what if the output is very dynamic and the information you need to combine spans multiple lines? In this case, you could start making some kind of state machine e.g. when the starting point is reached, set a variable, save data as you go and when the ending point or the next starting point is reached, construct the object and emit it to the pipeline so it can be used.

Well, there are better ways and in this example, I'll combine these things to make is much more straight-forward and generic -

  • Use regular expressions (regex) for matching the information you want
  • Show how to construct a regex in a readable way spanning multiple lines and containing comments
  • Using a dynamic approach, where the named captures in the regex are automatically converted into properties on the output object. This means that you only have to specify the name one, the loop processing the matches is totally generic and can be reused.

The examples uses output from repadmin /showrepl, but this is just an example I picked up more or less randomly. The idea here is to show the use of regex and converting the result to objects, the idea is not to create a bullet-proof parser of the output from repadmin /showrepl.

Here's the example -

# Define some text, in this case the text is stored in $text, but
# is could come from anywhere

$text=@'
DC Options: IS_GC
Site Options: (none)
DC object GUID: aff429b7-5694-4bda-ae4a-daa7d371ce0f
DC invocationID: b578e349-5846-46c5-8e01-b4a81d609e27
==== INBOUND NEIGHBORS ======================================
DC=company,DC=org
BLL\045ADDC001 via RPC
DC object GUID: 26446473-3433-4c73-942d-c750f0e476ec
Last attempt @ 2007-08-21 13:38:53 was successful.
CN=Configuration,DC=company,DC=org
BLL\045ADDC001 via RPC
DC object GUID: 26446473-3433-4c73-942d-c750f0e476ec
Last attempt @ 2007-08-21 13:38:53 was successful.
CN=Schema,CN=Configuration,DC=company,DC=org
BLL\045ADDC001 via RPC
DC object GUID: 26446473-3433-4c73-942d-c750f0e476ec
Last attempt @ 2007-08-21 13:38:54 was successful.
DC=DomainDnsZones,DC=company,DC=org
BLL\045ADDC001 via RPC
DC object GUID: 26446473-3433-4c73-942d-c750f0e476ec
Last attempt @ 2007-08-21 13:38:54 was successful.
DC=ForestDnsZones,DC=company,DC=org
BLL\045ADDC001 via RPC
DC object GUID: 26446473-3433-4c73-942d-c750f0e476ec
Last attempt @ 2007-08-21 13:38:54 was successful.
'
@

$regex=[regex] "(?msx)
# Option m = multi-line e.g. ^=start of line and $=end of line
# Option s = single-line e.g. . includes end-of-line
# Option x = spaces and comments are allowed in the pattern making this
# line possible

# Start of line (^), match partition, eat until end of line ($)
^ (?<partition> (CN|DC)=[^`$]+? )`$

# any chars - ? means lazy i.e. match as few characters as possible
.+?

# match site before \ using a series of wordchar
(?<Site> \w+) \\

# match domain controller afterwards
(?<DC> \w+)

# any chars
.+?

# match the date last attempted, note the spaces are escaped as option x is used
Last\ attempt\D+ (?<date> [\d\-]+\ [\d\:\.apm]+ )
"


# Search for pattern matches in $text
$regex.matches($text) | Foreach-Object {
# Save current pipeline object, so it is available from inside the next foreach-object
$match=$_
# Construct a new, empty object. Always return objects as output whenever possible. It
# makes using the output must easier
$obj=new-object object
# Get all the group names defined in the pattern - ignore the numeric, auto ones
$regex.GetGroupNames() | Where-Object {$_ -notmatch '^\d+$'} | Foreach-Object {
# And add each match as a property. When multiple results are returned, the
# value must be picked up using an index number hence the GroupNumberFromName call
add-member -inputobject $obj NoteProperty $_ $match.groups[$regex.GroupNumberFromName($_)].value
}
# emit the object to the pipeline
$obj

}



This is the output -




partition                     Site                          DC                            date
--------- ---- -- ----
DC=company,DC=org... BLL 045ADDC001 2007-08-21 13:38:53
CN=Configuration,DC=compan... BLL 045ADDC001 2007-08-21 13:38:53
CN=Schema,CN=Configuration... BLL 045ADDC001 2007-08-21 13:38:54
DC=DomainDnsZones,DC=compa... BLL 045ADDC001 2007-08-21 13:38:54
DC=ForestDnsZones,DC=compa... BLL 045ADDC001 2007-08-21 13:38:54



A final note: If you input is an array of strings, you have to convert it into a single text string before calling $regex.matches. This can easily be done, by joining the elements using newline as a delimiter -




[string]::join("`n",$array)

1 comment:

motardgeek said...

Thanks..
This was what I was looking for...

Created a REGEX to convert an SRT file to an object.