Extracting URL:s from files with PowerShell

Recently I needed to extract all URL:s from several files. I thought this was a fun little challenge where I could improve my limited skills in PowerShell and regular expressions.

Solution

After having been working on the problem for a while, I ended up having this little code.

Console

Get-ChildItem *.txt -Recurse `
  | Get-Content `
  | Select-String -Pattern 'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)' -AllMatches `
  | % { $_.Matches } `
  | % { $_.Value } `
  | Sort-Object `
  | Get-Unique

In short, the script does this:

Find all files with the file extensions, including subdirectories.
Reads the content in each file.
Get all strings matching the regular expression pattern that I found in this StackOverflow thread.
Loops thru all Matches expression.
Select the Value property.
Sorts the output.
Get all unique values.

If you want to know how many instances there is of every URL, you could use Group-Object instead.

Console

Get-ChildItem *.txt -Recurse `
  | Get-Content `
  | Select-String -Pattern 'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)' -AllMatches `
  | % { $_.Matches } `
  | % { $_.Value } `
  | Group-Object -Property $_ `
  | Sort-Object -Property Count -Descending `
  | Select-Object Count, Name

Summary

I am really not an expert on PowerShell, so I learned a bit while doing this. Solving these little tiny problems is always fun. I found the solutions to be especially pleasing when it is just a single line of code.

Solution

Summary

More about PowerShell

Batch upgrade to .NET 6

Finding redundant project references

.NET 5 and Azure Functions

Prettifying JSON with PowerShell

Multiple find and replace with PowerShell

Change encoding on your source files

Compiling with a RAM-drive