6.2.2. 相似轨迹匹配(按匹配条件个数排序)
1. 匹配算法
Lsql的默认打分机制是根据TF/IDF算法,即词频算法。
- TF:分词项在文档中出现的次数(term frequency)。
- IDF:代表分词项在多少个文档中出现(inverse document frequency)。
但在进行轨迹匹配时需要按照匹配的条件个数排序,匹配个数多的排在前面。
2. 创建实例
create table geo_match(
id y_string_id,
geoname y_string_id,
geodata y_geopoint_idmp
);
3. 导入数据
测试数据如下所示:
{"tablename":"geo_match","partition":"test","id":1,
"geoname":"102.72242222037083,28.88523027458959 110.72242222037083,28.88523027458959 99.72242222037083,28.88523027458959 ",
"geodata@geo5000":"102.72242222037083,28.88523027458959 110.72242222037083,28.88523027458959 99.72242222037083,28.88523027458959 "}
{"tablename":"geo_match","partition":"test","id":2,
"geoname":"102.72242222037083,28.88523027458959 110.72242222037083,28.88523027458959 90.72242222037083,28.88523027458959 ",
"geodata@geo5000":"102.72242222037083,28.88523027458959 110.72242222037083,28.88523027458959 90.72242222037083,28.88523027458959 "}
{"tablename":"geo_match","partition":"test","id":3,
"geoname":"102.72242222037083,28.88523027458959 120.72242222037083,28.88523027458959 90.72242222037083,28.88523027458959 ",
"geodata@geo5000":"102.72242222037083,28.88523027458959 120.72242222037083,28.88523027458959 90.72242222037083,28.88523027458959 "}
{"tablename":"geo_match","partition":"test","id":4,
"geoname":"100.72242222037083,28.88523027458959 130.72242222037083,28.88523027458959 90.72242222037083,28.88523027458959 ",
"geodata@geo5000":"100.72242222037083,28.88523027458959 130.72242222037083,28.88523027458959 90.72242222037083,28.88523027458959 "}
./load.sh -t geo_match -p test -tp json -local -f /wyh/testMatch.json
导入数据后全表查询结果如下:
select * from geo_match where partition like '%' limit 10;
4. 轨迹匹配检索
(1) 描点
轨迹匹配,需要对导入的两个经纬度之间,进行描点,以体现出轨迹匹配,lsql默认在两个经纬度点之间从两头向中间最多描100个点。
注意:描完点后的经纬度的值,会很多,膨胀率较高,一般仅用于轨迹匹配,不要做其他用途。另外描点距离越近,精度越高,但膨胀率越高。
如果要对导入的经纬度数据进行进行描点,需要导入数据的时候在列名字上做特殊处理。
l Kafka导入
原先写法:
{
'table':'geo_match'
,'partition':'default'
,'id':'1'
,'geoname':'test'
,'geodata':'100.72242222037083,28.88523027458959 130.40751616748125,31.369586323194188 90.71391535438913,29.393961382032614'
}
按照5000米描点的写法
{
'table':'geo_match '
,'partition':'default'
,'id':'1'
,'geoname':'test'
,'geodata@geo5000':'100.72242222037083,28.88523027458959 130.40751616748125,31.369586323194188 90.71391535438913,29.393961382032614'
}
l 离线导入
json写法与上述Kafka导入类型,不再细述。
分隔符方式的导入以@分隔符为例
原先写法:
./load.sh -t geo_match -p default -tp txt -fl id,geoname,geodata -f /load /1.log -sp @
按照5000米描点写法:
./load.sh -t geo_match -p default -tp txt -fl id,geoname,geodata@geo5000 -f /load/1.log -sp 0x09
(2) 普通排序语法
radius为匹配半径,meters为描点距离,radius通常来说大于等于meters
percent:表示最小匹配的点数,小于该值的记录不会被检索出来
select * from geo_match where partition like '%' and
SYS_JSON_QUERY@{
query_type:"geo"
,geo_type:"track"
,field:"geodata"
,list:["100.72242222037083,28.88523027458959","130.72242222037083,28.88523027458959","90.72242222037083,28.88523027458959"]
,radius:10000
,meters:5000
,percent:20
}@SYS_JSON_QUERY
limit 20;
查询结果如下图所示:
(3) 按匹配条件个数排序语法
select id, geoname,geodata, cl_score_y_double_d from geo_match where partition like '%' and
SYS_SCORE_QUERY@
SYS_JSON_QUERY@{
query_type:"geo"
,geo_type:"track"
,field:"geodata"
,list:["100.72242222037083,28.88523027458959","130.72242222037083,28.88523027458959","90.72242222037083,28.88523027458959"]
,radius:10000
,meters:5000
,percent:20
}@SYS_JSON_QUERY
@SYS_SCORE_QUERY
order by cl_score_y_double_d desc
limit 20;
查询结果如下图所示:
记录匹配condition个数越多的得分越高,排名越靠前,会首先返回,而普通排序功能无此效果。