Generating matches for URL strings in elasticsearch -


i've been trying come right combination of tokenizers/token filters , analyzers leverage elasticsearch match urls.

unfortunately, seems every approach i've taken far misses 1 or 2 edge cases. i'm hoping out there can perhaps shed light on following:

if have following values stored in elasticsearch:

  1. http://www.example111.com
  2. http://www.example111.com/cats
  3. http://www.example111.com/cats?type=tabby
  4. http://www.example111.com/cats/dogs
  5. http://www.example111.com/dogs/cats
  6. http://www.example222.com/cats
  7. http://www.example222.com
  8. http://www.example222.com/cats/dogs
  9. http://www.example333.com/fish

i'm wondering query use generate following search string , result set combinations (ordered relevance score):

the general idea being expressed here results ranked how similar input, way down tld , scheme. results discarded when entire query string doesn't match, or segment doesn't match.

how this:

1). when store urls, url data object looks like:

{     "tld" : "http://www.example111.com",     "path" : "/cats",     "qs" : "?type=birman" } 

i don't think want these analyzed... might require bit more thought.

2). when have query these records, parse url query.

3). concoct query fits requirements - so:

  • tld must match completely
  • paths in results must substring of path in query url - can use query time analyzer give possible prefix substrings of path in query url (so example: given "/cats/dogs", want "/", "/c", "/ca",..., "/cats/dogs") although seems inefficient... perhaps can pieces "/", "/cats", "/cats/dogs" beforehand when creating query , these represent additional clauses in query
  • match query string exactly? not sure full requirements here.

query might (where query url http://www.example111.com/cats/dogs?type=birman):

{     "query" : {         "bool" : {             "must" : [                 {                     "match" : {                         "url.tld" : "http://www.example111.com"                     }                 },                 {                     "match" : {                         "url.qs" : "?type=birman"                     }                 }             ]             "should" : [                 {                     "match" : {                         "url.path" : {                             "query" : "/",                             "boost" : 1                         }                     }                 },                 {                     "match" : {                         "url.path" : {                             "query" : "/cats",                             "boost" : 2                         }                     }                 },                 {                     "match" : {                         "url.path" : {                             "query" : "/cats/dogs",                             "boost" : 3                         }                     }                 }             ]         }     } } 

if have multiple urls per record, nested objects , nested queries.

anyway, 1 possible idea... that's not single convenient quick query might have been hoping for.


Comments

Popular posts from this blog

How to show in django cms breadcrumbs full path? -

php - Invalid Cofiguration - yii\base\InvalidConfigException - Yii2 -

ruby on rails - npm error: tunneling socket could not be established, cause=connect ETIMEDOUT -