Skip to content

Instantly share code, notes, and snippets.

@ashlynnwood
Last active February 22, 2023 19:24
Show Gist options
  • Select an option

  • Save ashlynnwood/17a40b0c5d77e1009743fc4605d20f03 to your computer and use it in GitHub Desktop.

Select an option

Save ashlynnwood/17a40b0c5d77e1009743fc4605d20f03 to your computer and use it in GitHub Desktop.

Regex: Matching a URL

Regex is a helpful tool used for text processing tasks such as searching, replacing, and extracting information from text. It consists of a sequence of characters and special characters that define a search pattern, which can match specific patterns in text. Regular expressions are widely used in programming languages and tools and can greatly simplify complex tasks that would otherwise require great manual effort. Imagine trying to extract every istance of a URL from a 100 page document, and having to "Command-F" each individual URL- sounds terrible and very time consuming, right? Well that's why regexes are here to make your life easier.

Summary

One of the common applications of regex is to match for URLs. URLs have a specific format that can be easily identified using regular expressions. A regular expression for matching URLs typically includes the protocol, domain name, and optionally a path or query parameters. By using regex to match URLs, you can easily extract information from them or validate that they meet a specific format. Let's take a look at the matching a URL regex:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

I know this looks like jibberish and can be intimidating at first glance, so let's go over all of its components and break it down together.

Table of Contents

Regex Components

Anchors

Anchors are special characters that specify what characters the string should begin and end with.

Regex anchors can be compared to real anchors that are used to fix a boat to a specific position in the water. Just as a boat anchor keeps the boat from drifting away, regex anchors keep the pattern from matching anywhere else in the string.

The ^ anchor will match at the beginning of the string, looking for patterns that start with the characters that follow it. The $ anchor matches at the end of a string, looking for patterns that end with the characters that come before it. While searching for URLs, this regex is looking for patterns that start with http, and end with an optional forward slash.

Quantifiers

Quantifiers state how many times a character or group of characters should match. In this way, quantifiers allow for more flexible matching and make it easier to match a wide range of text patterns. This regex has many quantifiers, so let's take a closer look at all of them.

  • ? - This specifies that the preceding character or group is optional. This means in https?, the ? makes the "s" optional. This allows for URLs that begin with https OR http to match the search pattern. The ? here in (https?:\/\/)? makes this entire group of characters optional, so the regex will also match expressions that do not contain the http(s) protocol at all, as some URLs do not. Likewise, the ? quanitfier at the end of the regex, \/?, makes the trailing forward slash at the end of the URL optional.

  • * - This is the "zero or more" quanitfier, which means it allows for the character or group preceding it to appear any number of times, including zero (none at all). In this particular case, you'll notice there are two * quantifiers The first one appears after the character class [\/\w \.-]*, which defines the structure of the URL path. This lets the regex know that these characters can appear as little or as many times in the path. The second * applied to the end of the whole group ([\/\w \.-]*)* means that the regex can match URLs with zero or more path segments. This means that the regex will match URLs with or without a path.

  • + - This is the "one or more" quantifier. It is used here ([\da-z\.-]+), which means it will match any sequence of one or more characters that are digits, lowercase letters, periods, or hyphens.

  • {} - Curley brackets are used to specify the number of characters to be matched. In this example, ([a-z\.]{2,6}), he {2,6} means that the previous character or group, which in this case is [a-z\.], must match at least 2 times and at most 6 times. In other words, the regular expression is looking for a sequence of 2 to 6 lowercase letters or periods.

Grouping Constructs

A grouping construct is a way to group one or more characters together to create a larger pattern. This allows for quantifiers or other operations to be applied to the group as a whole instead of just an individual character, again making our life easier. Grouping constructs are created by enclosing a group of characters in parentheses. There are two main categories of grouping constructs, capturing and non-capturing groups.

There are multiple grouping constructs in the matching a URL regex:

  • (https?:\/\/) - the optional http:// or https:// at the beginning of the URL
  • ([\da-z\.-]+) - the domain name, which can consist of letters, numbers, dots, and hyphens
  • ([a-z\.]{2,6}) - the top-level domain, such as .com or .org
  • ([\/\w \.-]*) - the path, which can consist of forward slashes, letters, numbers, dots, hyphens, and spaces. This group is repeated zero or more times to allow for multiple path segments.

Bracket Expressions

Bracket expressions [] specify a range of characters that the regex wants to match. The pattern can include any combination of only one or all of the characters inside the brackets. There are three bracket expressions here:

  • [\da-z\.-] - Matches any combination of digits 0-9 (\d), letters (a-z), dots (.), or hyphens (-). It's important to note that this is case sensitive, so it will only match with lower case letters.
  • [a-z\.] - Matches any combination of lower case letters (a-z) or dots.
  • [\/\w \.-] - Matches any combination of word characters, forward slashes, spaces, dots, and hyphens, which defines the pattern structure of the URL path.

Character Classes

  • \d - Matches any digit (0-9)
  • \w - Matches any word character (letters, digits, or underscores)

For example, the \d in ([\da-z\.-]+) specifies that any digit can be included in the group.

Character Escapes

The \ is the escape character. This means any character following it should be interpreted literally, and not as a quantifier. For example, the back slashes here (https?:\/\/) are used to "escape" the special meanings of the forward slash and colon in regex syntax, and tell it to look for the literal characters that are a part of https:// protocols.

Author

My name is Ashlynn Wood and I'm a junior software developer. I'm an active contributor on GitHub, check out my work here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment