Capturing all matches of a string value from an array of regex patterns, while prioritizing closest matches

Question

Let's say I have an array of names, along with a regex union of them:

match_array = [/Dan/i, /Danny/i, /Daniel/i]
match_values = Regexp.union(match_array)

I'm using a regex union because the actual data set I'm working with contains strings that often have extraneous characters, whitespaces, and varied capitalization.

I want to iterate over a series of strings to see if they match any of the values in this array. If I use .scan, only the first matching element is returned:

'dan'.scan(match_values) # => ["dan"]
'danny'.scan(match_values) # => ["dan"]
'daniel'.scan(match_values) # => ["dan"]
'dannnniel'.scan(match_values) # => ["dan"]
'dannyel'.scan(match_values) # => ["dan"]

I want to be able to capture all of the matches (which is why I thought to use .scan instead of .match), but I want to prioritize the closest/most exact matches first. If none are found, then I'd want to default to the partial matches. So the results would look like this:

'dan'.scan(match_values) # => ["dan"]
'danny'.scan(match_values) # => ["danny","dan"]
'daniel'.scan(match_values) # => ["daniel","dan"]
'dannnniel'.scan(match_values) # => ["dan"]
'dannyel'.scan(match_values) # => ["danny","dan"]

Is this possible?

What would you expect with "dannnniel"=~/.*/ which is a 'closer' match than "dannnniel"=~/Dan/i? — dawg, Commented Jul 9 at 15:47
What would be the desired return value if the string were "dan and daniel"? — Cary Swoveland, Commented Jul 9 at 19:26

Rajagopalan · Accepted Answer · 2024-07-09 03:03:26Z

match_array = [/Daniel/i, /Danny/i, /Dan/i]

def prioritized_scan(string, match_array)
  matches = []
  match_array.each do |pattern|
    string.scan(pattern) do |match|
      matches << match unless matches.include?(match)
    end
  end
  matches
end

p prioritized_scan('dan', match_array)
p prioritized_scan('danny', match_array)
p prioritized_scan('daniel', match_array)
p prioritized_scan('dannnniel', match_array)
p prioritized_scan('dannyel', match_array)

Output

["dan"]
["danny", "dan"]
["daniel", "dan"]
["dan"]
["danny", "dan"]

GProst · Accepted Answer · 2024-07-09 03:22:08Z

I think you could do the following:

Sort the array of your regexes by the length of chars in them (unless you want to manually sort it):

match_array = [/Dan/i, /Danny/i, /Daniel/i]
sorted_regexes = match_array.sort_by{|x| -x.source.length}

p sorted_regexes

Output:

[/Daniel/i, /Danny/i, /Dan/i]

Iterate over it to find matches (it will find the best match first as it will check the longest regexes first):

def find_matches(string, sorted_regexes)
  sorted_regexes.reduce([]) do |acc, regex|
    match = string.match(regex)
    acc.push(match[0]) if match
    acc
  end
end

p find_matches('dan', sorted_regexes)
p find_matches('danny', sorted_regexes)
p find_matches('daniel', sorted_regexes)
p find_matches('dannnniel', sorted_regexes)
p find_matches('dannyel', sorted_regexes)

Output:

["dan"]
["danny", "dan"]
["daniel", "dan"]
["dan"]
["danny", "dan"]

engineersmnky · Accepted Answer · 2024-07-09 19:40:39Z

While this does not union or use your list I thought I would provide another option using a backref for the "root" of "dan". /(dan)?(\g<1>(?:iel|ny)?)/i

This assumes that each derivative should only appear occur once for instance:

"dandan" will only show ["dan"] rather than ["dan","dan"]; and
"dandannydaniel" will be ["dan","danny","daniel"] rather than ["dan","dan","danny","dan","daniel"]

Example:

a = %w[dan
danny
daniel
dannnniel
dannyel
dandan
dandannydaniel]

a.map {|s| {s => s.scan(/(dan)?(\g<1>(?:iel|ny)?)/i).flatten.uniq} }
#=> [{"dan"=>["dan"]}, 
#    {"danny"=>["dan", "danny"]}, 
#    {"daniel"=>["dan", "daniel"]}, 
#    {"dannnniel"=>["dan"]}, 
#    {"dannyel"=>["dan", "danny"]}, 
#    {"dandan"=>["dan"]}, 
#    {"dandannydaniel"=>["dan", "danny", "daniel"]}]

Thanks for the edit. I see I was inconsistent with the word boundary. I've now removed it. — Cary Swoveland, Commented Jul 10 at 19:47

Cary Swoveland · Accepted Answer · 2024-07-10 21:03:30Z

3

You could write

rgx = /^(?=(dan))(?=(daniel|danny))?/i

Then

["dan", "danny", "daniel", "dannnniel", "dannyel", "dannyboy", "dandan"].each do |str|
  puts "#{str}: #{str.scan(rgx)}"
end

displays

dan: [["dan", nil]]
danny: [["dan", "danny"]]
daniel: [["dan", "daniel"]]
dannnniel: [["dan", nil]]
dannyel: [["dan", "danny"]]
dannyboy: [["dan", "danny"]]
dandan: [["dan", nil]]

Ruby demo | Regex demo

Note that, to make it self-documenting, I've expressed the regular expression at the "Regex demo" link in free-spacing mode.

edited Jul 10 at 21:03

answered Jul 9 at 20:57

Cary Swoveland

110k6 gold badges65 silver badges103 bronze badges

Add a comment |

dawg · Accepted Answer · 2024-07-09 22:37:35Z

You can do something like this:

match_array = [/Dan/i, /Danny/i, /Daniel/i]

strings=['dan','danny','daniel','dannnniel','dannyel']

p strings.
    map{|s| [s, match_array.filter{|m| s=~m}]}.to_h

Prints:

{"dan"=>[/Dan/i], 
 "danny"=>[/Dan/i, /Danny/i], 
 "daniel"=>[/Dan/i, /Daniel/i], 
 "dannnniel"=>[/Dan/i], 
 "dannyel"=>[/Dan/i, /Danny/i]}

And you can convert the regexes to strings of any case if desired:

p strings.
    map{|s| [s, match_array.filter{|m| s=~m}.
       map{|r| r.source.downcase}]}.to_h

Prints:

{"dan"=>["dan"], 
 "danny"=>["dan", "danny"], 
 "daniel"=>["dan", "daniel"], 
 "dannnniel"=>["dan"], 
 "dannyel"=>["dan", "danny"]}

Then if 'closest' is equivalent to 'longest' just sort by length of the regex source (ie, Dan in the regex /Dan/i):

p strings.
    map{|s| [s, match_array.filter{|m| s=~m}.
        map{|r| r.source.downcase}.
            sort_by(&:length).reverse]}.to_h

Prints:

{"dan"=>["dan"], 
 "danny"=>["danny", "dan"], 
 "daniel"=>["daniel", "dan"], 
 "dannnniel"=>["dan"], 
 "dannyel"=>["danny", "dan"]}

But that only works with literal string matches. What would you expect with "dannnniel"=~/.*/ which is a 'closer' match than "dannnniel"=~/Dan/i?

Suppose by 'closest' you mean the longest substring returned by the regex match -- so something like /.*/ is longer than any substring of the string to be matched. You can do:

match_array = [/Dan/i, /Danny/i, /Daniel/i, /.{3}/, /.*/]

strings=['dan','danny','daniel','dannnniel','dannyel']

p strings.
    map{|s| [s, match_array.filter{|m| s=~m}.
        sort_by{|m| s[m].length}.reverse]}.to_h

Which now sorts on the length of the match vs the length of the regex:

{"dan"=>[/.*/, /.{3}/, /Dan/i], 
 "danny"=>[/.*/, /Danny/i, /.{3}/, /Dan/i],
 "daniel"=>[/.*/, /Daniel/i, /.{3}/, /Dan/i], 
 "dannnniel"=>[/.*/, /.{3}/, /Dan/i],
 "dannyel"=>[/.*/, /Danny/i, /.{3}/, /Dan/i]}

Note that if 'mundane' is appended to strings, the key-value pair "mundane"=>[/Dan/i] would be added to the hash. This primarily reflects the vagueness of the question. — Cary Swoveland, Commented Jul 11 at 0:46

Collectives™ on Stack Overflow

Capturing all matches of a string value from an array of regex patterns, while prioritizing closest matches

5 Answers 5

Not the answer you're looking for? Browse other questions tagged
regex
ruby
matching
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Not the answer you're looking for? Browse other questions tagged regexrubymatching or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
regex
ruby
matching
or ask your own question.