Archive | July, 2011

Message Board #5 – regexp

5 Jul

Not much to report again, in the little bit of time I’ve managed to grab I’ve been looking at the ake links automatically out of URLs in messages task.

I’d already done this in PHP, and planned to do it in Java. After implementing the code to do it in a getter on my Message class I realised the serialization done by com.google.gson.Gson.toJson wasn’t using the getter. After stopping to consider I decided it made more sense to do it in the code doing the actual display anyway, i.e. the javascript.

So looking into Javascript regular expressions and regular expressions for URLs I found this http://daringfireball.net/2010/07/improved_regex_for_matching_urls which looked pretty promising. But had trouble converting it into a javascript regexp (I think mainly due to some none ASCII characters, and the escaping of certain characters in Javascript), and had a feeling that it would actually be more lenient than I really want – e.g. I don’t want to match ftp.
So resorted to just converting what I was doing in PHP, which is a bit inelegant, but seems to work. It uses three separate replaces to cater for URL’s starting with “http://” (but not www’s), then strips off any “http://” from www’s and then sorts out remaining urls starting with “www.”. I know one bug is that it seems to fail to match the url after a “www” on its own, and also with a bit of regexp skill I should be able to combine the 3 separate passes into one regular expression.

As a quick test I used this javascript: ( I definitely need to sort out WordPress displaying code)


var data = "blah http://google.co.uk blah www.bbc.co.uk www www.mrdw.co.uk http://www.bbc.co.uk blah www.bbc.co.uk blah http://google.co.uk /mb mike gahsj asghja";
var reg1 = new RegExp("http://[^w][^w][^w][a-zA-Z0-9|_+-/?&=.%:,~#]*","g");
var data1 = data.replace(reg1, '$&');// deal with non www addresses that have a http://
var reg2 = new RegExp("http://www.","g");
var data2 = data1.replace(reg2, ' www.');// 1st strip off any http://www to just www so next rule will work
var reg3 = new RegExp("www.[a-zA-Z0-9|_+-/?&=.%:,~#]*","gim");
var data3 = data2.replace(reg3, '$&');
document.write("

"+data1+"

");
document.write("

"+data2+"

");
document.write("

"+data3+"

");