6.2.2. 相似轨迹匹配(按匹配条件个数排序)

1. 匹配算法

Lsql的默认打分机制是根据TF/IDF算法,即词频算法。

  • TF:分词项在文档中出现的次数(term frequency)。
  • IDF:代表分词项在多少个文档中出现(inverse document frequency)。

但在进行轨迹匹配时需要按照匹配的条件个数排序,匹配个数多的排在前面。

2. 创建实例

create table geo_match(
id y_string_id,
geoname y_string_id,
geodata y_geopoint_idmp
);

3. 导入数据

测试数据如下所示:

{"tablename":"geo_match","partition":"test","id":1,
"geoname":"102.72242222037083,28.88523027458959 110.72242222037083,28.88523027458959 99.72242222037083,28.88523027458959 ",
"geodata@geo5000":"102.72242222037083,28.88523027458959 110.72242222037083,28.88523027458959 99.72242222037083,28.88523027458959 "}
{"tablename":"geo_match","partition":"test","id":2,
"geoname":"102.72242222037083,28.88523027458959 110.72242222037083,28.88523027458959 90.72242222037083,28.88523027458959 ",
"geodata@geo5000":"102.72242222037083,28.88523027458959 110.72242222037083,28.88523027458959 90.72242222037083,28.88523027458959 "}
{"tablename":"geo_match","partition":"test","id":3,
"geoname":"102.72242222037083,28.88523027458959 120.72242222037083,28.88523027458959 90.72242222037083,28.88523027458959 ",
"geodata@geo5000":"102.72242222037083,28.88523027458959 120.72242222037083,28.88523027458959 90.72242222037083,28.88523027458959 "}
{"tablename":"geo_match","partition":"test","id":4,
"geoname":"100.72242222037083,28.88523027458959 130.72242222037083,28.88523027458959 90.72242222037083,28.88523027458959 ",
"geodata@geo5000":"100.72242222037083,28.88523027458959 130.72242222037083,28.88523027458959 90.72242222037083,28.88523027458959 "}

./load.sh -t geo_match -p test -tp json -local -f /wyh/testMatch.json

导入数据后全表查询结果如下:

select * from geo_match  where partition like '%' limit 10;

4. 轨迹匹配检索

(1) 描点

​ 轨迹匹配,需要对导入的两个经纬度之间,进行描点,以体现出轨迹匹配,lsql默认在两个经纬度点之间从两头向中间最多描100个点。

注意:描完点后的经纬度的值,会很多,膨胀率较高,一般仅用于轨迹匹配,不要做其他用途。另外描点距离越近,精度越高,但膨胀率越高。

​ 如果要对导入的经纬度数据进行进行描点,需要导入数据的时候在列名字上做特殊处理。

l Kafka导入

原先写法:

{
'table':'geo_match'
,'partition':'default'
,'id':'1'
,'geoname':'test'
,'geodata':'100.72242222037083,28.88523027458959 130.40751616748125,31.369586323194188 90.71391535438913,29.393961382032614'
}

按照5000米描点的写法

{
'table':'geo_match '
,'partition':'default'
,'id':'1'
,'geoname':'test'
,'geodata@geo5000':'100.72242222037083,28.88523027458959 130.40751616748125,31.369586323194188 90.71391535438913,29.393961382032614'
}

l 离线导入

json写法与上述Kafka导入类型,不再细述。

分隔符方式的导入以@分隔符为例

原先写法:

./load.sh -t geo_match -p default -tp txt -fl id,geoname,geodata -f /load /1.log -sp @

按照5000米描点写法:

./load.sh -t geo_match -p default -tp txt -fl id,geoname,geodata@geo5000 -f /load/1.log -sp 0x09
(2) 普通排序语法

radius为匹配半径,meters为描点距离,radius通常来说大于等于meters

percent:表示最小匹配的点数,小于该值的记录不会被检索出来

select * from geo_match where partition like '%' and  
    SYS_JSON_QUERY@{
      query_type:"geo"
      ,geo_type:"track"
      ,field:"geodata"
      ,list:["100.72242222037083,28.88523027458959","130.72242222037083,28.88523027458959","90.72242222037083,28.88523027458959"]
      ,radius:10000
      ,meters:5000
      ,percent:20
    }@SYS_JSON_QUERY 
 limit 20;

查询结果如下图所示:

(3) 按匹配条件个数排序语法
select id, geoname,geodata, cl_score_y_double_d from geo_match where partition like '%' and 
 SYS_SCORE_QUERY@
    SYS_JSON_QUERY@{
      query_type:"geo"
      ,geo_type:"track"
      ,field:"geodata"
      ,list:["100.72242222037083,28.88523027458959","130.72242222037083,28.88523027458959","90.72242222037083,28.88523027458959"]
      ,radius:10000
      ,meters:5000
      ,percent:20
    }@SYS_JSON_QUERY 
@SYS_SCORE_QUERY
order by cl_score_y_double_d desc
 limit 20;

查询结果如下图所示:

记录匹配condition个数越多的得分越高,排名越靠前,会首先返回,而普通排序功能无此效果。

Copyright © lucene.xin 2020 all right reserved修改时间: 2021-07-02 11:42:23

results matching ""

    No results matching ""