regex - r stringdist or levenshtein.distance to replace strings -
i have large, dataset ~ 1 million observations, keyed defined observation type. within dataset, there ~900,000 observations malformed observation types, ~850 (incorrect) variations of 50 acceptable observation types.
keys <- c("day", "evening","sunset", "dusk","night", "midnight", "twilight", "dawn","sunrise", "morning") entries <- c("day", "day", "sunset/dusk", "days", "dayy", "even", "evening", "early dusk", "late day", "nite", "red dawn", "evening sunset", "mid-night", "midnight", "midnite","day", "evening","sunset", "dusk","night", "midnight", "twilight", "dawn","sunrise", "morning")
using gsub akin digging basement hand shovel, , in own case, broken-handled shovel i'm new r , intricacies regular expressions. simple fallback (for me) write 1 gsub statement each of accepted observation types seems unnecessarily arduous needs 50 statements.
i'd use levenshtein.distance
or stringdist
replace offending entries shortest distance string. running z <- (i in length(y)) { z[i] = levenshtein.distance(y[i], x)}
doesn't work it's trying pass (length(x)) results each y[i].
how return result minimum distance? i've seen function(x) x[2]
returns 2nd result in series, how lowest?
you try:
library(stringdist) m <- stringdistmatrix(entries, keys, method = "lv") <- keys[apply(m, 1, which.min)]
if want experiment different algorithm, have @ ?'stringdist-metrics'
or per mentioned @rhertel in comments:
b <- keys[apply(adist(entries, keys), 1, which.min)]
from adist()
documentation:
compute approximate string distance between character vectors. distance generalized levenshtein (edit) distance, giving minimal possibly weighted number of insertions, deletions , substitutions needed transform 1 string another.
the 2 methods yield identical results:
> identical(a, b) #[1] true
Comments
Post a Comment