regex - r stringdist or levenshtein.distance to replace strings -


i have large, dataset ~ 1 million observations, keyed defined observation type. within dataset, there ~900,000 observations malformed observation types, ~850 (incorrect) variations of 50 acceptable observation types.

keys <- c("day", "evening","sunset", "dusk","night", "midnight", "twilight", "dawn","sunrise", "morning")  entries <- c("day", "day", "sunset/dusk", "days", "dayy", "even", "evening", "early dusk", "late day", "nite", "red dawn", "evening sunset", "mid-night", "midnight", "midnite","day", "evening","sunset", "dusk","night", "midnight", "twilight", "dawn","sunrise", "morning") 

using gsub akin digging basement hand shovel, , in own case, broken-handled shovel i'm new r , intricacies regular expressions. simple fallback (for me) write 1 gsub statement each of accepted observation types seems unnecessarily arduous needs 50 statements.

i'd use levenshtein.distance or stringdist replace offending entries shortest distance string. running z <- (i in length(y)) { z[i] = levenshtein.distance(y[i], x)} doesn't work it's trying pass (length(x)) results each y[i].

how return result minimum distance? i've seen function(x) x[2] returns 2nd result in series, how lowest?

you try:

library(stringdist) m <- stringdistmatrix(entries, keys, method = "lv") <- keys[apply(m, 1, which.min)] 

if want experiment different algorithm, have @ ?'stringdist-metrics'


or per mentioned @rhertel in comments:

b <- keys[apply(adist(entries, keys), 1, which.min)] 

from adist() documentation:

compute approximate string distance between character vectors. distance generalized levenshtein (edit) distance, giving minimal possibly weighted number of insertions, deletions , substitutions needed transform 1 string another.

the 2 methods yield identical results:

> identical(a, b) #[1] true 

Comments

Popular posts from this blog

How to show in django cms breadcrumbs full path? -

php - Invalid Cofiguration - yii\base\InvalidConfigException - Yii2 -

ruby on rails - npm error: tunneling socket could not be established, cause=connect ETIMEDOUT -