3

I am analyzing a collection of large (>150mb) fixed-width data files. I've been slowly reading them in using read.fwf() in 100 line chunks (each row is 7385 characters), then pushing them into a relational database for further manipulation. The problem is that the text files occasionally have a wonky multibyte character (e.g., often enough to be annoying, instead of a "U", the data file has whatever the system assigns to the Unicode U+F8FF. In OS X, that's an apple symbol, but not sure if that is a cross-platform standard). When that happens, I get an error like this:

invalid multibyte string at 'NTY <20> MAINE
000008 [...]

That should have been the latter part of the word "COUNTY", but the U was, as described above, wonky. (Happy to provide more detailed code & data if anyone thinks they would be useful.)

I'd like to do all the coding in R, and I'm just not sure to how to coerce single-byte. Hence the subject-line part of my question: is there some easy way to coerce single-byte ascii out of a text file that has some erroneous multibyte characters in it?

Or maybe there's an even better way to deal with this (should I be calling grep at the system level from R to hunt out the erroneous multi-byte characters)?

Any help much appreciated!

3
  • 1
    U+F8FF is in fact designed to be not cross-platform. It's in a range explicitly designated as Private Use.
    – MSalters
    Commented Aug 15, 2013 at 6:56
  • Yes! Do you know of any way of stripping out (or replacing) those kinds of characters from a text file? Maybe I should do a byte-by-byte search and replace?
    – Don
    Commented Aug 15, 2013 at 12:12
  • 1
    iconv -c -f UTF-8 -t ASCII should do the trick, assuming those multibyte sequences are in fact UTF-8. Else -f ISO-8859-1 might work.
    – MSalters
    Commented Aug 15, 2013 at 13:35

1 Answer 1

2

What does the output of the file command say about your data file?

/tmp >file a.txt b.txt 
a.txt: UTF-8 Unicode text, with LF, NEL line terminators
b.txt: ASCII text, with LF, NEL line terminators

You can try to convert/transliterate the file's contents using iconv. For example, given a file that uses the Windows 1252 encoding:

# \x{93} and \x{94} are Windows 1252 quotes
/tmp >perl -E'say "He said, \x{93}hello!\x{94}"' > a.txt 
/tmp >file a.txt
a.txt: Non-ISO extended-ASCII text
/tmp >cat a.txt 
He said, ?hello!?

Now, with iconv you can try to convert it to ascii:

/tmp >iconv -f windows-1252 -t ascii a.txt 
He said, 
iconv: a.txt:1:9: cannot convert

Since there is no direct conversion here it fails. Instead, you can tell iconv to do a transliteration:

/tmp >iconv -f windows-1252 -t ascii//TRANSLIT a.txt  > converted.txt
/tmp >file converted.txt
converted.txt: ASCII text
/tmp >cat converted.txt 
He said, "hello!"

There might be a way to do this using R's IO layer, but I don't know R.

Hope that helps.

5
  • Here's what I get with one of the files throwing an error: KCRETA1978.DAT: ASCII text, with very long lines, with CRLF line terminators
    – Don
    Commented Aug 15, 2013 at 12:01
  • And if I try to convert from ascii, iconv tells me: "conversion from acii unsupported." Sigh. Thanks for the suggestions, though!
    – Don
    Commented Aug 15, 2013 at 12:10
  • 1
    @Don: That's probably because "conversion from ASCII" only works on something which is in fact ASCII. And ASCII runs from U+0000 to U+007F. You want to convert to ASCII, dropping unrecognized characters. That's done with iconv -c
    – MSalters
    Commented Aug 15, 2013 at 13:30
  • @MSalters Thanks! That almost works. The problem is that the data is fixed width, so dropping faulty characters wrecks subsequent fields for that row. I'd really like to substitute something for the bad characters. Do you know of any way to search byte-by-byte for characters beyond U+007F? If so, I could do that for all the files and add in code that substitutes the correct characters.
    – Don
    Commented Aug 15, 2013 at 14:32
  • @MSalters I'm marking this as solved. It looks like I can do what you have suggested and, by adding the --byte-subst=formatstring option, also include a substitution that preserves the width of each row. Many thanks!
    – Don
    Commented Aug 15, 2013 at 14:42

Not the answer you're looking for? Browse other questions tagged or ask your own question.