looking for some solutions? You are welcome.

SOLVED: Extracting nth position from space-separated string in dplyr


I have a dataframe that looks something like this:

data <- data.frame(label = c('S', 'SH', 'S', 'S', 'SH'),
               word = c('sip', 'shoe', 'plaster', 'reception', 'reception'),
               word.segs = c('S IH1 P', 'SH UW1', 'P L AE1 S T AH0', 'R AH0 S EH1 P SH AH0 N', 'R AH0 S EH1 P SH AH0 N'),
               seg.index = c(1, 1, 4, 3, 6))

'word.segs' contains a phonetic transcription of the words in the 'word' column, and the value in 'seg.index' refers to the segment of interest - the nth segment in that transcription. What I want to do is to create two new columns containing the two segments after this, i.e. seg.index+1 and seg.index+2.

I've tried it in the following loop, which works but it takes absolutely ages (and I have 100k rows, so it's important to have an efficient solution here)

for (x in 1:nrow(data)){
  data[x, ]$fol.seg = unlist(data$word.segs[x])[data[x, ]$seg.index+1]
  data[x, ]$fol.seg2 = unlist(data$word.segs[x])[data[x, ]$seg.index+2]

(note that I've also tried only unlisting once, saving this to a separate object and then extracting the two values of interest, but this doesn't appear to be significantly faster)

I also tried an alternative in dplyr in the hope that it might be more efficient:

data <- data %>%
  mutate(fol.seg = word.segs %>%
  strsplit(split = " ") %>%
  unlist() %>%

But I get the following error message, and I have no idea why it's not working:

Error in mutate_impl(.data, dots) : Evaluation error: length(n) == 1 is not TRUE.

Any help would be greatly appreciated!

Posted in S.E.F
via StackOverflow & StackExchange Atomic Web Robots

No comments: