HOWTO: Parse HTML using PowerShell

Unimportant Backstory

Today I was unfortunate to discover that one of the drives in my FreeNAS box failed.  I replaced the drive and wanted to watch the progress of the rebuild.  If you log into the FreeNAS web management console there is a section that shows you the number of sectors synchronized and the percent complete.  But that’s only useful if you stare at it.  I want to know if it’s locked up which would require grabbing this value and if it doesn’t change after a certain period, send an email alert.

But before I can do any of that, I need to start with the basics and figure out how to pull the actual HTML from the website so I can parse it and do interesting things like that from there.

Important Part

The code below has the following capabilities:

  • Is able to programatically authenticate against any PHP based (and possibly other) authentication mechanisms
  • Connects to a specific URL and pulls down all of the raw HTML for that page into a variable for further manipulation

This is certainly a handy snippet to keep in your back pocket!

 

# This is the URL that when visited with a web browser contains the username and password fields to fill in
$LoginURL = "http://yourwebsite.com/login.php"

# This is the URL of the page you actually want to pull content from but if accessed directly will normally just redirect you to the login page above
$ContentURL = "http://yourwebsite.com/someothercontentthatfirstrequiresauthentication.php"

# The username and password used to authenticate with the site above
$Username = "hero"
$Password = "superman"

# Create a new object that pulls the HTML data from the login page including the username and password fields
$website = Invoke-WebRequest -Uri $LoginURL

# Note the "username" and "password" attributes specified here may have a different name.  
# Verify by checking the contents of $website.Forms[0].fields
$website.Forms[0].Fields.username = $Username
$website.Forms[0].Fields.password = $Password

# Connect to the login URL and send the login credentials you created as POST and save the resulting session
Invoke-WebRequest "$LoginURL" -SessionVariable WebSession -Body $result.Forms[0] -Method Post | Out-Null

# Now that we're authenticated, connect to the actual URL you want and pass in the session object you created above
$data = Invoke-WebRequest -Uri $ContentURL -WebSession $WebSession

# There is a ton of other metadata that is returned that you most likely don't care about.  
#If you just want the raw HTML to pull some specified content, try using the "outerhtml" property as shown below
$HTMLOutput = $data | select -ExpandProperty Parsedhtml | select -ExpandProperty IHTMLDocument3_documentElement | select -expandproperty outerhtml 

# Display the results to the screen.  This will be the raw HTML returned by the site.  You can now do whatever you'd like with it.
$HTMLOutput

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.