Regular expressions are incredibly powerful, but they're sometimes described as "looking like cartoon characters swearing", and the syntax can be difficult to remember.
I only find myself needing to write code to "capture" and replace values from a string a couple of times a year and always have to re-learn the syntax, so I'm blogging this for my reference. I hope it will be helpful to you also.
I'm including examples in JavaScript and C#...
Here's the scenario for the following examples: We want to log XML payment account setup requests/responses, but the request contains a bank account number:
<Name>Sparky's Bank Account</Name>
<BankRoutingNumber>123456789</BankRoutingNumber>
<BankAccountNumber>987654321</BankAccountNumber>
Bank account numbers are sensitive information, so we don't want to log them as-is.
Here's the basic Regex to find BankAccountNumber XML elements. The "\d+" (one or more digits) pattern defines the account number:
JavaScript
const regex =
new RegExp('<BankAccountNumber>\\d+<\/BankAccountNumber>', 'g');
C#
var regex = new Regex(
@"<BankAccountNumber>\d+<\/BankAccountNumber>",
RegexOptions.Compiled);
...and code using the Regex to find the bank account XML elements:
JavaScript
const matches = requestXml.matchAll(regex);
console.log([...matches]);
C#
foreach (Match match in regex.Matches(requestXml))
{
Console.WriteLine(match.Value);
}
Results:
<BankAccountNumber>987654321</BankAccountNumber>
Replacing
Let's replace the bank account numbers with asterisks:
JavaScript
const censored = requestXml.replace(
regex,
'<BankAccountNumber>*********</BankAccountNumber>');
console.log(censored);
C#
string censored = regex.Replace(
requestXml,
"<BankAccountNumber>*********</BankAccountNumber>");
Console.WriteLine(censored);
Results:
<Name>Sparky's Bank Account</Name>
<BankRoutingNumber>123456789</BankRoutingNumber>
<BankAccountNumber>*********</BankAccountNumber>
Capturing
A capture group is defined by enclosing part of the regex ("\d+" in this example) in parentheses:
<BankAccountNumber>(\\d+)<\/BankAccountNumber>
The entire match (the XML element in this example) is an automatic capture group, so this will result in two capture groups:
JavaScript
const matches = requestXml.matchAll(regex);
for (const match of matches) {
const captures = [...match];
for (var i = 0; i < captures.length; i++) {
console.log(`[${i}] ${captures[i]}`)
}
}
C#
foreach (Match match in regex.Matches(requestXml))
{
for (int i = 0; i < match.Groups.Count; i++)
{
Console.WriteLine($"[{i}] {match.Groups[i]}");
}
}
Results:
[0] <BankAccountNumber>987654321</BankAccountNumber>
[1] 987654321
Named captures
You can name a capture group with the syntax "(?<name>pattern)". In this example, I'm using the name "acctNum":
<BankAccountNumber>(?<acctNum>\\d+)<\/BankAccountNumber>
(The angle bracket syntax for group naming is a bit confusing for this example because it looks like XML.)
Using a named capture:
JavaScript
const matches = requestXml.matchAll(regex);
for (const match of matches) {
console.log(match.groups.acctNum);
}
C#
foreach (Match match in regex.Matches(requestXml))
{
Console.WriteLine(match.Groups["acctNum"].Value);
}
Results:
987654321
Replacing with a function
We shouldn't log account numbers, but let's say our security policy allows logging them "masked". Let's try replacing with a "callback" function that replaces all but the last four digits with asterisks:
JavaScript
const censored = requestXml.replaceAll(regex, match => {
const innerMatches = [...match.matchAll(regex)];
const acctNum = innerMatches[0].groups.acctNum;
const len = acctNum.length;
const masked = (len > 4)
? '*'.repeat(len - 4) + acctNum.substr(len - 4)
: '*'.repeat(len);
return `<BankAccountNumber>${masked}</BankAccountNumber>`;
});
console.log(censored);
C#
string censored = regex.Replace(requestXml, match =>
{
string accountNumber = match.Groups["acctNum"].Value;
int len = accountNumber.Length;
string masked = (len > 4)
? new string('*', len - 4) + accountNumber.Substring(len - 4)
: new string('*', len);
return $"<BankAccountNumber>{masked}</BankAccountNumber>";
});
Results:
<Amount>123.45</Amount>
<BankRoutingNumber>123456789</BankRoutingNumber>
<BankAccountNumber>*****4321</BankAccountNumber>
Top comments (0)