Recently I needed to extract all URL:s from several files. I thought this was a fun little challenge where I could improve my limited skills in PowerShell and regular expressions.

Solution

After having been working on the problem for a while, I ended up having this little code.

Console
Get-ChildItem *.txt -Recurse `
  | Get-Content `
  | Select-String -Pattern 'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)' -AllMatches `
  | % { $_.Matches } `
  | % { $_.Value } `
  | Sort-Object `
  | Get-Unique

In short, the script does this:

  • Find all files with the file extensions, including subdirectories.
  • Reads the content in each file.
  • Get all strings matching the regular expression pattern that I found in this StackOverflow thread.
  • Loops thru all Matches expression.
  • Select the Value property.
  • Sorts the output.
  • Get all unique values.

If you want to know how many instances there is of every URL, you could use Group-Object instead.

Console
Get-ChildItem *.txt -Recurse `
  | Get-Content `
  | Select-String -Pattern 'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)' -AllMatches `
  | % { $_.Matches } `
  | % { $_.Value } `
  | Group-Object -Property $_ `
  | Sort-Object -Property Count -Descending `
  | Select-Object Count, Name

Summary

I am really not an expert on PowerShell, so I learned a bit while doing this. Solving these little tiny problems is always fun. I found the solutions to be especially pleasing when it is just a single line of code.