In this post, we'll explore some common operations on regular expressions in Python, using examples from the world of astronomy.
Regular expressions are a powerful tool for pattern matching and text processing. Python's re
module provides several functions for working with regular expressions, including search()
, match()
, findall()
, and sub()
.
The search()
function searches a string for a pattern and returns a match object if the pattern is found. The match()
function is similar to search()
, but only matches at the beginning of the string. The findall()
function returns a list of all non-overlapping matches of a pattern in a string. The sub()
function replaces all occurrences of a pattern in a string with a specified replacement string.
Here are some examples of using these functions:
import re
text = "The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth."
# Search for a pattern
match = re.search(pattern="spiral", string=text)
if match:
print(f"Found: {match.group()}")
# Output: Found: spiral
# Match at the beginning of the string
match = re.match(pattern=r"The", string=text
if match:
print(f"Found: {match.group()}")
# Output: Found: The
# Find all occurrences of a pattern
matches = re.findall(pattern=r"\b\w{5}\b", string=text)
print(matches)
# Output: ["spiral", "Earth"]
# Replace all occurrences of a pattern
new_text = re.sub(pattern=r"\d", repl="#", string=text)
print(new_text)
# Output: The Andromeda Galaxy is a spiral galaxy approximately #.# million light-years away from Earth.
Regular expressions can also be used to extract specific information from a text. Here are some examples:
text = "The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth."
# Extract the first two words from a text
match = re.search(pattern=r"^(\w+)\s+(\w+)", string=text)
if match:
print(f"First word: {match.group(1)}") # Output: First word: The
print(f"Second word: {match.group(2)}") # Output: Second word: Andromeda
# Extract a starting number as long as it has 10 digits
match = re.search(pattern=r"^\d{10}", string=text)
if match:
print(f"Found: {match.group()}")
Output:
# Separate a number into units and decimals
match = re.search(pattern=r"(\d+)\.(\d+)", string=text)
if match:
print(f"Units: {match.group(1)}") # Output: Units: 2
print(f"Decimals: {match.group(2)}") # Output: Decimals: 5
# Separate text into words using space characters as reference
words = re.split(pattern=r"\s+", string=text)
print(words)
# Output: ['The', 'Andromeda', 'Galaxy', 'is', 'a', 'spiral', 'galaxy', 'approximately', '2.5', 'million', 'light-years', 'away', 'from', 'Earth.']
# Use regex similar to the strip() function
stripped_text = re.sub(pattern=r"^\s+|\s+$", repl="", string=text)
print(stripped_text)
# Output: The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth.
# Remove symbols from a filename except the dot character
filename = "image-of-the-andromeda-galaxy.jpg"
new_filename = re.sub(pattern=r"[^\w\.]", repl="", string=filename)
print(new_filename)
# Output: imageoftheandromedagalaxy.jpg
# Use regex to split a text into a list of words and get the frequency for the list of words
words = re.findall(pattern=r"\b\w+\b", string=text)
word_counts = {}
for word in words:
word_counts[word] = word_counts.get(word, 0) + 1
print(word_counts)
# Output: {'The': 1, 'Andromeda': 1, 'Galaxy': 1, 'is': 1, 'a': 1, 'spiral': 1, 'galaxy': 1, 'approximately': 1, '2': 1, '5': 1, 'million': 1, 'light': 1, 'years': 1, 'away': 1, 'from': 1, 'Earth': 1}
# Use regex to split the text into sentences and get the frequency for each sentence
sentences = re.split(pattern=r"\.\s+", string=text)
sentence_counts = {}
for sentence in sentences:
sentence_counts[sentence] = sentence_counts.get(sentence, 0) + 1
print(sentence_counts)
# Output: {'The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth': 1}
These are just a few examples of the many powerful ways that regular expressions can be used to process and manipulate text in Python. With a little practice, you'll be able to use regular expressions to solve a wide variety of text-processing problems.
Top comments (0)