This HOWTO covers a real world example of how to use Convert-FromString which was introduced in PowerShell 5. As a reminder, this is the powerful new cmdlet that allows you to parse any kind of text data and convert the resulting data into structured PowerShell objects by defining “templates” for how the data is laid out and what information you want to extract.
The largest mall in my city includes literally hundreds of stores. I needed to shop for a particular kind of thing and wanted to know what stores I might want to check out. I started by visiting the website for the mall. This ended up giving me output that looked like this:
It’s not bad certainly but I would like to apply some filters to the dataset. Now in reality, I could have certainly figured out what I wanted from this website but I realized this would be a great opportunity to see if I could make the ConvertFrom-String and FlashExtract do something useful. Could I make a PowerShell object out of this data?
The first thing I did was download the webpage in PowerShell using the Invoke-WebRequest cmdlet. Powershell includes a “parseddata” object that tries to break down any webpage into its component parts and return the results as nested objects. I looked at the output and discovered that all of the data I was interested in was stored in parseddata.documentElement.outertext. This gave me the following results:
That is what I was looking for! You’ll note that once the data starts, it follows the following format:
[Storename]
[Category] Telephone [Phone Number]
This is exactly the kind of thing FlashExtract was designed for. The problem though is those alphabet headers (ie “A A”, “B B”, “C C”, etc) as well as the word “Close” in each entry. I needed to remove those. The “Close” was pretty simple as that was just a find and replace. But the headers were more interesting. Ultimately I solved this by using regular expressions, specifically “^[A-Z] [A-Z]” which says the start of the string must be capital letter space capital letter which worked great.
The next challenge I had was the “Telephone” keyword that was included in each line. FlashExtract was able to handle this fine… until some of the categories started including the word “Telephone” as well. I fought with this for a while but ultimately realized that the best way to work with Flash Extract was to ensure that the source data you give it is as clean as possible. I noted that the “Telephone” label always had two spaces after it. I used this as a means of identifying the label as opposed to the category name and removing it entirely.
At this point I was now ready to define my template. This was a lot of trial and error as I had to define a template entry for each type of dataset that would be returned. The complications came from the following kinds of entries:
BEBE-(Coming Soon) (Includes special characters such as dashes, brackets, commas, etc)
Weekend MaxMara (Includes camel case spelling)
In order to figure this out, I stumbled upon an absolutely amazing script/tool called Convert-FromString Buddy. That’s available here:
http://dougfinke.com/blog/powershell-v5-0-convertfrom-string-buddy/
What this basically does is provide you a real-time IDE to test your templates against your actual dataset. The killer feature here is that as you make changes to the template or the source data, the results are updated instantly make it incredibly fast to iterate on your template designs and identify issues. Check out a screenshot below to see how that works in this example:
The end result is that you end up with a standard powershell object including storenames, categories and phone numbers that you can now filter to your hearts content.
For example, below is a filter that shows any categories that include the word ‘apparel’ (Ie to include both Mens, Women’s and Unisex) and are not marked as ‘Coming Soon’
ConvertFrom-String is going to prove to be an invaluable tool in my toolkit going forward. Thanks PowerShell team!
# Connects to the Chinook Mall web site and downloads a list of all stores, categories and phone numbers available at the mall # Home; Chinook; FlashExtract; ConvertFrom-String; Regex; # $Define the webpage that contains the list of all of the stores located at the mall $MallStoresURL = 'https://www.cfshops.com/chinook-centre/stores.html' # Connect to the website and extract the plain text from the parsedhtml document element that contains strings for all of the storename, categories and phone numbers $html = ((invoke-webrequest -uri $MallStoresURL).parsedhtml.documentElement.outertext) -split "`r" # The returned text includes a bunch of crap preceeding our content we don't care about. The actual relevant text starts immediately after the '# #' line $StartLine = (($html | Select-String '# #').LineNumber) # Likewise there is a bunch of stuff after the text we care about that is not needed. The last line after our content is always 'Search by Category' so we'll ignore it and everything after $EndLine = (($html | Select-String 'Search by Category').LineNumber)-2 # Extract out all the text between the start and end lines, also excluding the letter headers in the format of "A A", "B B", "C C" etc. # Each entry also for whatever reason includes the word "Close" on its own line so we'll filter this out as well $RawStores = ($html[$StartLine..$EndLine]).trim() | where {$_ -notmatch "^[A-Z] [A-Z]" -and $_ -ne 'Close'} <# The extracted content looks like this: Wilfred Ladies Apparel Telephone 403-640-1287 Williams-Sonoma Housewares, Home Furnishings & Decor Telephone 403-410-9191 Wind Mobile Electronics, Computers and Telephones Telephone 403-536-0901 Note that every other line also includes the word "Telephone". However, "Telephone" is also included in some category names. This confuses FlashExtract Note that the Telephone label is always followed by two spaces. In order not to confuse FlashExtract, we'll remove this label entirely #> $RawStores = $RawStores -replace ' Telephone ', ' ' # In order to include the last line of text, there needs to be an additional blank line so FlashExtract knows that we are at End of File $RawStores += "`r`n" <# This is where we define our template of the kinds of entries that FlashExtract will encounter These include: - Entries where no phone number is included at all - Entries that include non-alpha characters in their name (dashes, brackets, commas etc) - Entries that include camel case inside a name (ie MaxMara) #> $Template = @' {StoreInfo*:{Name:BEBE-(Coming Soon)} {Category: Ladies Apparel}} {StoreInfo*:{Name:7 For All Mankind} {Category:Unisex Apparel} {Phone:403-262-7326}} {StoreInfo*:{Name:A & W} {Category:Fast Food} {Phone:403-640-4127}} {StoreInfo*:{Name:Addition-Elle} {Category:Ladies Apparel} {Phone:403-252-5155}} {StoreInfo*:{Name:WATCH IT!} {Category:Fashion Accessories, Jewellery} {Phone:403-255-8060}} {StoreInfo*:{Name:Weekend MaxMara} {Category:Ladies Apparel} {Phone:587-293-1800}} {StoreInfo*:{Name:Wind Mobile} {Category:Electronics, Computers and Telephones} {Phone:403-536-0901}} '@ # Clear the $Stores variable so it will be empty for each execution of the script $Stores = $null # Take the raw text from the website and pass it to ConvertFrom-String including the template we defined above and returning the storeinfo object $Stores = ($RawStores | ConvertFrom-String -TemplateContent $Template).storeinfo # Some of the entries do no include phone numbers so we'd like to replace those with the text 'NOT AVAILABLE' # We do that by looping through each entry to look for those that do not have a phone number but that DO have a name # This way we don't add the "NOT AVAIALBLE" to that final blank line we added above # Since the objects without a phone number won't include that property, we have to add it manually using Add-Member $Stores | ForEach-Object {if (!$_.Phone -and $_.name) { $_ | Add-Member -MemberType NoteProperty -Name 'Phone' -Value 'NOT AVAILABLE'} } $Stores
Alternatively if you’d like a short version of the code without any comments, here you go:
$html = ((invoke-webrequest -uri 'https://www.cfshops.com/chinook-centre/stores.html').parsedhtml.documentElement.outertext) -split "`r" $StartLine = (($html | Select-String '# #').LineNumber) $EndLine = (($html | Select-String 'Search by Category').LineNumber)-2 $RawStores = ($html[$StartLine..$EndLine]).trim() | where {$_ -notmatch "^[A-Z] [A-Z]" -and $_ -ne 'Close'} $RawStores = $RawStores -replace ' Telephone ', ' ' $RawStores += "`r`n" $Template = @' {StoreInfo*:{Name:BEBE-(Coming Soon)} {Category: Ladies Apparel}} {StoreInfo*:{Name:7 For All Mankind} {Category:Unisex Apparel} {Phone:403-262-7326}} {StoreInfo*:{Name:A & W} {Category:Fast Food} {Phone:403-640-4127}} {StoreInfo*:{Name:Addition-Elle} {Category:Ladies Apparel} {Phone:403-252-5155}} {StoreInfo*:{Name:WATCH IT!} {Category:Fashion Accessories, Jewellery} {Phone:403-255-8060}} {StoreInfo*:{Name:Weekend MaxMara} {Category:Ladies Apparel} {Phone:587-293-1800}} {StoreInfo*:{Name:Wind Mobile} {Category:Electronics, Computers and Telephones} {Phone:403-536-0901}} '@ $Stores = ($RawStores | ConvertFrom-String -TemplateContent $Template).storeinfo $Stores | ForEach-Object {if (!$_.Phone -and $_.name) { $_ | Add-Member -MemberType NoteProperty -Name 'Phone' -Value 'NOT AVAILABLE'} } $Stores