Generating matches for URL strings in elasticsearch -
i've been trying come right combination of tokenizers/token filters , analyzers leverage elasticsearch match urls.
unfortunately, seems every approach i've taken far misses 1 or 2 edge cases. i'm hoping out there can perhaps shed light on following:
if have following values stored in elasticsearch:
- http://www.example111.com
- http://www.example111.com/cats
- http://www.example111.com/cats?type=tabby
- http://www.example111.com/cats/dogs
- http://www.example111.com/dogs/cats
- http://www.example222.com/cats
- http://www.example222.com
- http://www.example222.com/cats/dogs
- http://www.example333.com/fish
i'm wondering query use generate following search string , result set combinations (ordered relevance score):
http://www.example111.com/cats/dogs
[4,2,3,1]
the general idea being expressed here results ranked how similar input, way down tld , scheme. results discarded when entire query string doesn't match, or segment doesn't match.
how this:
1). when store urls, url data object looks like:
{ "tld" : "http://www.example111.com", "path" : "/cats", "qs" : "?type=birman" }
i don't think want these analyzed... might require bit more thought.
2). when have query these records, parse url query.
3). concoct query fits requirements - so:
- tld must match completely
- paths in results must substring of path in query url - can use query time analyzer give possible prefix substrings of path in query url (so example: given "/cats/dogs", want "/", "/c", "/ca",..., "/cats/dogs") although seems inefficient... perhaps can pieces "/", "/cats", "/cats/dogs" beforehand when creating query , these represent additional clauses in query
- match query string exactly? not sure full requirements here.
query might (where query url http://www.example111.com/cats/dogs?type=birman):
{ "query" : { "bool" : { "must" : [ { "match" : { "url.tld" : "http://www.example111.com" } }, { "match" : { "url.qs" : "?type=birman" } } ] "should" : [ { "match" : { "url.path" : { "query" : "/", "boost" : 1 } } }, { "match" : { "url.path" : { "query" : "/cats", "boost" : 2 } } }, { "match" : { "url.path" : { "query" : "/cats/dogs", "boost" : 3 } } } ] } } }
if have multiple urls per record, nested objects , nested queries.
anyway, 1 possible idea... that's not single convenient quick query might have been hoping for.
Comments
Post a Comment