Shortest simple regex matching a binary word

Question

Task

Define a simple regex as a non-empty regular expression consisting only of

characters 0 and 1,
grouping parentheses ( and ),
one-or-more repetition quantifier +.

Given a non-empty string of 0s and 1s, your program should find the shortest simple regex matching the full input string. (That is, when matching a simple regex, pretend it’s bookended by ^ and $.) If there are multiple shortest regexes, print any or all of them.)

code-golf, so the shortest submission (in bytes) wins.

Test cases

1 -> 1
00 -> 00 or 0+
010 -> 010
1110 -> 1+0
01010 -> 01010
0101010 -> 0(10)+ or (01)+0
011111 -> 01+
10110110 -> (1+0)+
01100110 -> (0110)+ or (01+0)+
010010010 -> (010)+
111100111 -> 1+001+ or 1+0+1+
00000101010 -> 0+(10)+ or (0+1)+0
1010110001 -> 1(0+1+)+ or (1+0+)+1

You should clarify that you want us to write a program that writes the regex, not write the regex ourself. But this looks interesting. — gcampbell, Commented Jun 7, 2016 at 18:57
In my testing, 01100110 is an interesting case... a naïve algorithm would write 01+0+1+0 or (0+1+)+0 which aren't optimal. — Neil, Commented Jun 7, 2016 at 21:07

Neil · Accepted Answer · 2016-06-08 10:05:22Z

JavaScript (ES6), 488 341 bytes

s=>[s.replace(/(.)\1+/g,'$1+'),...[...Array(60)].map((_,i)=>`(${(i+4).toString(2).slice(1)})+`),...[...Array(1536)].map((_,i)=>`${i>>10?(i>>8&1)+(i&2?'+':''):''}(${i&1}${i&4?i>>4&1:i&16?'+':''}${i&8?''+(i>>7&1)+(i&64?i>>5&1:i&32?'+':''):''})+${i&512?(i>>8&1)+(i&2?'+':''):''}`)].filter(r=>s.match(`^${r}$`)).sort((a,b)=>a.length-b.length)[0]

Explanation: Since six regexes can express all possible binary words, and the longest two are nine characters long, it suffices to check those and all shorter regexes. One candidate is obviously the string with "run length encoding" (i.e. all digit runs replaced with appropriate +s), but also strings with one set of ()s need to be checked. I generate 1596 such regexes (this includes duplicates and useless regexes but they'll just be eliminated) and test all 1597 to see which is the shortest match. The generated regexes fall into two types: $\d{2,5}$\+ (60 regexes) and (\d\+?)?$\d[\d+]?(\d[\d+]?)?$(\d\+?)? (1536 regexes as I avoid generating regexes with both leading and trailing digit).

@LeakyNun Originally I thought there were 4 regexes of length 9 but this is obviously incorrect so I've clarified my explanation. — Neil, Commented Jun 8, 2016 at 7:53

isaacg · Accepted Answer · 2016-06-09 03:38:19Z

Pyth, 20 bytes

hf.x}z:zT1Zy*4"()01+

This takes approximately 30 seconds to run, so it needs to be run offline.

Explanation:

hf.x}z:zT1Zy*4"()01+
                        Implicit: z is the input string.
              "()01+    "()01+"
            *4          Repeated 4 times
           y            All subsequences in length order
hf                      Output the first one such that
      :zT1              Form all regex matches of z with the candidate string
    }z                  Check if the input is one of the strings
  .x      Z             Discard errors

I'm not completely sure that every shortest string is a subsequence of "()01+" * 4, but 4 can be increased to 9 at no byte cost if needed.

Community · Accepted Answer · 2020-06-17 09:04:33Z

2

Pyth - 31 30 29 bytes

Brute force! Can probably golf the iterator a little.

 f=+Yf.x:zjY"^$")Z^"10+()"T1Y

Test Suite.

edited Jun 17, 2020 at 9:04

CommunityBot

1

answered Jun 8, 2016 at 2:17

Maltysen

25.7k4 gold badges54 silver badges127 bronze badges

Add a comment |

ezrast · Accepted Answer · 2016-06-08 02:35:50Z

Ruby, 109 bytes

It's the boring brute force approach. Works because no regex ever need be longer than 9 characters (as Neil notes) and no individual character needs to be repeated more than 4 times (trying it with '01()+'.chars*9 made my CPU unhappy).

10.times{|i|('01()+'.chars*4).combination(i).map{|s|begin
/^#{s*''}$/=~$*[0]&&[puts(s*''),exit]
rescue
end}}

$ for word in `grep -Po '^\S+' test_cases.txt`; do nice -n20 ruby sre.rb $word; done
1
0+
010
1+0
01010
0(10)+
01+
(1+0)+
(01+0)+
(010)+
1+0+1+
0+(10)+
1(0+1+)+

Sherlock9 · Accepted Answer · 2016-07-13 19:07:40Z

1

Python 3, 186 bytes

I'm investigating whether there is an approach to this problem besides brute-forcing, but here is a Python brute-force solution for now.

import re,itertools
def a(b):
 for z in range(10):
  for i in itertools.combinations("01()+"*4,z):
   j=''.join(i)
   try:
    if re.fullmatch(j,b)and len(j)<=len(b):return j
   except:1

answered Jul 13, 2016 at 19:07

Sherlock9

12.3k1 gold badge29 silver badges68 bronze badges

Add a comment |

Stack Exchange Network

Shortest simple regex matching a binary word

Task

Test cases

5 Answers 5

JavaScript (ES6), 488 341 bytes

Pyth, 20 bytes

Pyth - 31 30 29 bytes

Ruby, 109 bytes

Python 3, 186 bytes

Not the answer you're looking for? Browse other questions tagged
code-golf
regular-expression
metagolf
search
or ask your own question.

Linked

Hot Network Questions

Shortest simple regex matching a binary word

Task

Test cases

5 Answers 5

JavaScript (ES6), 488 341 bytes

Pyth, 20 bytes

Pyth - 31 30 29 bytes

Ruby, 109 bytes

Python 3, 186 bytes

Not the answer you're looking for? Browse other questions tagged code-golfregular-expressionmetagolfsearch or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
code-golf
regular-expression
metagolf
search
or ask your own question.