20
\$\begingroup\$

Task

Define a simple regex as a non-empty regular expression consisting only of

  • characters 0 and 1,
  • grouping parentheses ( and ),
  • one-or-more repetition quantifier +.

Given a non-empty string of 0s and 1s, your program should find the shortest simple regex matching the full input string. (That is, when matching a simple regex, pretend it’s bookended by ^ and $.) If there are multiple shortest regexes, print any or all of them.)

, so the shortest submission (in bytes) wins.

Test cases

1 -> 1
00 -> 00 or 0+
010 -> 010
1110 -> 1+0
01010 -> 01010
0101010 -> 0(10)+ or (01)+0
011111 -> 01+
10110110 -> (1+0)+
01100110 -> (0110)+ or (01+0)+
010010010 -> (010)+
111100111 -> 1+001+ or 1+0+1+
00000101010 -> 0+(10)+ or (0+1)+0
1010110001 -> 1(0+1+)+ or (1+0+)+1
\$\endgroup\$
3
  • 3
    \$\begingroup\$ You should clarify that you want us to write a program that writes the regex, not write the regex ourself. But this looks interesting. \$\endgroup\$
    – gcampbell
    Commented Jun 7, 2016 at 18:57
  • 1
    \$\begingroup\$ In my testing, 01100110 is an interesting case... a naïve algorithm would write 01+0+1+0 or (0+1+)+0 which aren't optimal. \$\endgroup\$
    – Neil
    Commented Jun 7, 2016 at 21:07
  • \$\begingroup\$ Related. \$\endgroup\$
    – Leaky Nun
    Commented Jun 8, 2016 at 0:46

5 Answers 5

9
\$\begingroup\$

JavaScript (ES6), 488 341 bytes

s=>[s.replace(/(.)\1+/g,'$1+'),...[...Array(60)].map((_,i)=>`(${(i+4).toString(2).slice(1)})+`),...[...Array(1536)].map((_,i)=>`${i>>10?(i>>8&1)+(i&2?'+':''):''}(${i&1}${i&4?i>>4&1:i&16?'+':''}${i&8?''+(i>>7&1)+(i&64?i>>5&1:i&32?'+':''):''})+${i&512?(i>>8&1)+(i&2?'+':''):''}`)].filter(r=>s.match(`^${r}$`)).sort((a,b)=>a.length-b.length)[0]

Explanation: Since six regexes can express all possible binary words, and the longest two are nine characters long, it suffices to check those and all shorter regexes. One candidate is obviously the string with "run length encoding" (i.e. all digit runs replaced with appropriate +s), but also strings with one set of ()s need to be checked. I generate 1596 such regexes (this includes duplicates and useless regexes but they'll just be eliminated) and test all 1597 to see which is the shortest match. The generated regexes fall into two types: \(\d{2,5}\)\+ (60 regexes) and (\d\+?)?\(\d[\d+]?(\d[\d+]?)?\)(\d\+?)? (1536 regexes as I avoid generating regexes with both leading and trailing digit).

\$\endgroup\$
1
  • \$\begingroup\$ @LeakyNun Originally I thought there were 4 regexes of length 9 but this is obviously incorrect so I've clarified my explanation. \$\endgroup\$
    – Neil
    Commented Jun 8, 2016 at 7:53
2
\$\begingroup\$

Pyth, 20 bytes

hf.x}z:zT1Zy*4"()01+

This takes approximately 30 seconds to run, so it needs to be run offline.

Explanation:

hf.x}z:zT1Zy*4"()01+
                        Implicit: z is the input string.
              "()01+    "()01+"
            *4          Repeated 4 times
           y            All subsequences in length order
hf                      Output the first one such that
      :zT1              Form all regex matches of z with the candidate string
    }z                  Check if the input is one of the strings
  .x      Z             Discard errors

I'm not completely sure that every shortest string is a subsequence of "()01+" * 4, but 4 can be increased to 9 at no byte cost if needed.

\$\endgroup\$
2
\$\begingroup\$

Pyth - 31 30 29 bytes

Brute force! Can probably golf the iterator a little.

 f=+Yf.x:zjY"^$")Z^"10+()"T1Y

Test Suite.

\$\endgroup\$
1
\$\begingroup\$

Ruby, 109 bytes

It's the boring brute force approach. Works because no regex ever need be longer than 9 characters (as Neil notes) and no individual character needs to be repeated more than 4 times (trying it with '01()+'.chars*9 made my CPU unhappy).

10.times{|i|('01()+'.chars*4).combination(i).map{|s|begin
/^#{s*''}$/=~$*[0]&&[puts(s*''),exit]
rescue
end}}
$ for word in `grep -Po '^\S+' test_cases.txt`; do nice -n20 ruby sre.rb $word; done
1
0+
010
1+0
01010
0(10)+
01+
(1+0)+
(01+0)+
(010)+
1+0+1+
0+(10)+
1(0+1+)+
\$\endgroup\$
1
\$\begingroup\$

Python 3, 186 bytes

I'm investigating whether there is an approach to this problem besides brute-forcing, but here is a Python brute-force solution for now.

import re,itertools
def a(b):
 for z in range(10):
  for i in itertools.combinations("01()+"*4,z):
   j=''.join(i)
   try:
    if re.fullmatch(j,b)and len(j)<=len(b):return j
   except:1
\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.