Question
I am trying to import a csv that is in Japanese. This code:
url <- 'http://www.mof.go.jp/international_policy/reference/itn_transactions_in_securities/week.csv'
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE)
returns the following error:
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<91>ΊO<8b>y<82>ёΓ<e0><8f>،<94><94><84><94><83><8c>_<96>̏@(<8f>T<8e><9f><81>E<8e>w<92><e8><95>@<8a>փx<81>[<83>X<81>j'
I tried changing the encoding (Encoding(url) <- 'UTF-8'
and also to latin1)
and tried removing the read.csv parameters, but received the same "invalid
multibyte string" message in each case. Is there a different encoding that
should be used, or is there some other problem?
Answer
Encoding
sets the encoding of a character string. It doesn't set the
encoding of the file represented by the character string, which is what you
want.
This worked for me, after trying "UTF-8"
:
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE, fileEncoding="latin1")
And you may want to skip the first 16 lines, and read in the headers separately. Either way, there's still quite a bit of cleaning up to do.
x <- read.csv(url, header=FALSE, stringsAsFactors=FALSE,
fileEncoding="latin1", skip=16)
# get started with the clean-up
x[,1] <- gsub("\u0081|`", "", x[,1]) # get rid of odd characters
x[,-1] <- as.data.frame(lapply(x[,-1], # convert to numbers
function(d) type.convert(gsub(d, pattern=",", replace=""))))