JP2007072752A

JP2007072752A - Similar time series data calculation method, similar time series data calculation device, and similar time series data calculation program

Info

Publication number: JP2007072752A
Application number: JP2005259010A
Authority: JP
Inventors: Yasuhiro Fujiwara; 靖宏藤原; Yasushi Sakurai; 保志櫻井; Masashi Yamamuro; 雅司山室
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-09-07
Filing date: 2005-09-07
Publication date: 2007-03-22

Abstract

【課題】時系列データの類否判定における探索漏れを防ぐとともに過剰探索を防ぐ。また、任意の長さのクエリの時系列データにも対応可能とする。
【解決手段】距離計算部４において、距離の下限値が閾値以下の場合に類似として判断することで探索漏れの発生を防ぐとともに、距離の上限値が閾値以下の場合に類似として判断することで過剰探索の発生を防ぐ。また、次元圧縮部３におけるセグメント管理部３２により、時系列データを指数的に長さの異なる複数のセグメントに分割して、次元圧縮係数計算部３３により各セグメントにおける平均値と標準偏差を求めることで、任意の長さのクエリの時系列データにも対応可能とする。
【選択図】図１
An object of the present invention is to prevent a search omission in determination of similarity of time series data and prevent an excessive search. It is also possible to deal with time-series data of queries having an arbitrary length.
In the distance calculation unit 4, it is determined that the lower limit value of the distance is equal to or less than the threshold value to prevent the occurrence of search omission, and the similarity is determined to be similar when the upper limit value of the distance is equal to or less than the threshold value. Prevent over search. Further, the segment management unit 32 in the dimension compression unit 3 divides the time series data into a plurality of segments having exponentially different lengths, and the dimension compression coefficient calculation unit 33 obtains an average value and a standard deviation in each segment. Thus, it is possible to deal with time-series data of queries of arbitrary length.
[Selection] Figure 1

Description

本発明は、複数の時系列データの中で類似したものを求める方法、装置、プログラムに関する。 The present invention relates to a method, an apparatus, and a program for obtaining similar data among a plurality of time series data.

類似した時系列データを高速に求める処理は様々な分野で利用される。例えば株のオンライントレーディングシステムは大量の株価を監視して、値動きが類似した会社を高速に検索する。移動体位置管理システムは走行している大量の車をセンシングして、似た移動軌跡をしているものを高速に検知する。地震監視システムは大量に配置された地震計からの情報をもとに地震発生時の揺れ方が類似した地点を高速に見つける。 Processing for obtaining similar time-series data at high speed is used in various fields. For example, a stock online trading system monitors a large amount of stock prices and searches for companies with similar price movements at high speed. The moving body position management system senses a large number of traveling vehicles, and detects those having a similar movement track at high speed. The seismic monitoring system finds a point where the way of shaking at the time of an earthquake resembles at high speed based on information from a large number of seismometers.

本願では、複数の時系列データのうち、探索時刻から所定の長さのクエリに対して類似した時系列データの組み合わせを探索する問題を扱う。この問題は、時系列データのデータポイント数をｎ、クエリの長さをｌ、閾値をε、時系列データＸ＝（ｘ₁，ｘ₂，...，ｘ_n）と時系列データＹ＝（ｙ₁，ｙ₂，...，ｙ_n）のユークリッド距離をＤ（Ｘ，Ｙ）とすると以下の条件を満たす時系列データの組み合わせの集合として表現できる。

The present application deals with a problem of searching for a combination of time series data similar to a query having a predetermined length from a search time among a plurality of time series data. The problem is that the number of data points in the time series data is n, the length of the query is l, the threshold is ε, the time series data X = (x ₁ , x ₂ ,..., X _n ) and the time series data Y = If the Euclidean distance of (y ₁ , y ₂ ,..., Y _n ) is D (X, Y), it can be expressed as a set of combinations of time series data satisfying the following conditions.

一般に、Ｄ（Ｘ，Ｙ）を求めるには高い計算コストがかかる。これは、時系列データが多数のデータポイントを有するためである。このため、時系列データの距離を計算する際には次元圧縮手法が用いられてきた。代表的な次元圧縮手法として、離散フーリエ変換（非特許文献１参照）、離散ウェブレット変換（非特許文献２参照）、特異値分解（非特許文献３参照）などがある。これらの次元圧縮手法は、探索漏れ（類似している時系列データを類似していないと判定すること）がない特徴がある。この探索漏れがない性質はlower bounding lemmaより保証される。lower bounding lemmaとは、次元圧縮後の時系列データをＬ（Ｘ’，Ｙ’）としたときに、Ｌ（Ｘ’，Ｙ’）≦Ｄ（Ｘ，Ｙ）が成り立てば探索漏れが発生しないという補助定理である。すなわち、次元圧縮後の距離が次元圧縮前の距離の下限値であれば探索漏れは発生しないことになる。 In general, high calculation cost is required to obtain D (X, Y). This is because time series data has a large number of data points. For this reason, a dimension compression method has been used when calculating the distance of time series data. Typical dimensional compression methods include discrete Fourier transform (see Non-Patent Document 1), discrete weblet transform (see Non-Patent Document 2), singular value decomposition (see Non-Patent Document 3), and the like. These dimension compression methods have a feature that there is no search omission (determining that similar time-series data is not similar). This search-free property is guaranteed by lower bounding lemma. The lower bounding lemma means that when L (X ′, Y ′) ≦ D (X, Y) holds when the time-series data after dimension compression is L (X ′, Y ′), no search leakage occurs. This is the lemma. That is, if the distance after dimensional compression is the lower limit value of the distance before dimensional compression, no search omission occurs.

なお、ここで扱う時系列データは株価のような離散値と、移動軌跡のような連続値がある。時系列データが離散値である場合、時系列データが類似しているか否かは離散値を用いて時系列データ間の距離が閾値内になるか否かを調べればよい。また、時系列データが連続値である場合は、時系列データは処理する際にサンプリングされ離散値になるため、結局は離散値の処理手法を用いることになる。
R.Agrawl, C.Faloutsos, and A.N.Swami. Efficient Similarity Search In Sequence Databases. In Proc. FODO, 1993 K.Chan, A.W.Fu. Efficient Time Series Matching by Wavelets. In Proc. ICDE, 1999 F.Korn, H.V.Jagadish, C.Faloutsos. Efficient Supporting Ad Hoc Queries in Large Datasets of Time Sequences. In Proc. SIGMOD, 1997 The time series data handled here includes discrete values such as stock prices and continuous values such as movement trajectories. If the time-series data is a discrete value, whether the time-series data is similar may be determined by checking whether the distance between the time-series data is within the threshold using the discrete values. Further, when the time series data is a continuous value, the time series data is sampled at the time of processing and becomes a discrete value, so that a discrete value processing method is eventually used.
R. Agrawl, C. Faloutsos, and ANSwami. Efficient Similarity Search In Sequence Databases. In Proc. FODO, 1993 K. Chan, AWFu. Efficient Time Series Matching by Wavelets. In Proc. ICDE, 1999 F. Korn, HVJagadish, C. Faloutsos. Efficient Supporting Ad Hoc Queries in Large Datasets of Time Sequences. In Proc. SIGMOD, 1997

従来の次元圧縮手法は以下の２つの点で問題がある。 The conventional dimension compression method has problems in the following two points.

１つ目は探索漏れがない性質しか有しないことである。時系列データの類似探索の応用を考えれば、過剰探索（類似していないシーケンスを類似していると判定すること）が発生しない性質も必要である。 The first is that it has only the property of no search omission. Considering the application of the similar search of time series data, it is necessary to have the property that an excessive search (determining a dissimilar sequence as similar) does not occur.

２つ目は特定の長さのクエリの時系列データにしか対応できないことである。時系列データの類似探索の応用を考えれば、任意の長さのクエリの時系列データにも対応できる必要がある。 The second is that it can only deal with time-series data of queries of a specific length. Considering the application of time series data similarity search, it is necessary to be able to deal with query time series data of any length.

本発明は、上記に鑑みてなされたものであり、その課題とするところは、第１に、探索漏れを防ぐとともに過剰探索を防ぐことにある。第２に、任意の長さのクエリの時系列データにも対応可能とすることにある。 The present invention has been made in view of the above, and the object of the present invention is firstly to prevent search omission and excessive search. Secondly, it is possible to deal with time-series data of queries having an arbitrary length.

第１の本発明に係る類似時系列データ計算方法は、次元圧縮係数計算手段により、類否判断の対象となる一対の時系列データをメモリから読み出し、それぞれについて次元圧縮係数として平均値と標準偏差を求めてメモリに格納するステップと、上限値・下限値計算手段により、平均値と標準偏差をメモリから読み出し、これらを用いて時系列データの距離の上限値および下限値を求めてメモリに格納するステップと、類否判断手段により、上限値、下限値、閾値をメモリから読み出し、下限値が閾値以下でかつ上限値が閾値以下の場合に一対の時系列データは類似であると判断するステップと、を有することを特徴とする。 In the similar time series data calculation method according to the first aspect of the present invention, a pair of time series data to be subjected to similarity determination is read from a memory by a dimension compression coefficient calculation means, and an average value and a standard deviation are respectively obtained as dimension compression coefficients. The average value and the standard deviation are read from the memory by the step of calculating and storing in the memory and the upper limit / lower limit calculation means, and using these, the upper and lower limits of the distance of the time series data are obtained and stored in the memory. And a step of reading the upper limit value, the lower limit value, and the threshold value from the memory by the similarity determination means, and determining that the pair of time series data is similar when the lower limit value is equal to or less than the threshold value and the upper limit value is equal to or less than the threshold value. It is characterized by having.

本発明にあっては、距離の下限値が閾値以下の場合に類似として判断することで、lower bounding lemmaの補助定理により探索漏れがない性質を保証するとともに、距離の上限値が閾値以下の場合に類似として判断することで、upper bounding lemmaの補助定理により過剰探索がない性質を保証する。 In the present invention, by determining that the lower limit value of the distance is equal to or less than the threshold value, it is ensured that there is no search omission according to the lower bounding lemma lemma, and the upper limit value of the distance is equal to or less than the threshold value. Judging by the upper bounding lemma lemma, we guarantee that there is no excessive search.

上記類似時系列データ計算方法は、前記類否判断手段により、下限値が閾値以下でかつ上限値が閾値よりも大きいと判断された場合に、厳密距離計算手段により、次元圧縮前の一対の時系列データをメモリから読み出し、両者の距離を求めてメモリに格納するステップと、類否判断手段により、次元圧縮前の一対の時系列データの距離、閾値をメモリから読み出し、当該距離が閾値以下の場合に一対の時系列データを類似であると判断するステップと、を有することを特徴とする。 In the similar time series data calculation method, when the similarity determination unit determines that the lower limit value is less than or equal to the threshold value and the upper limit value is greater than the threshold value, the exact distance calculation unit calculates the pair of times before dimension compression. The sequence data is read from the memory, the distance between the two is obtained and stored in the memory, and the similarity determination means reads the distance and threshold of the pair of time series data before dimension compression from the memory, and the distance is less than the threshold. Determining that the pair of time-series data is similar to each other.

本発明にあっては、下限値との関係で類似と判断された場合であっても上限値との関係で非類似と判断された場合には、次元圧縮後の時系列データからでは正確な類否判断ができなかったものとして、次元圧縮前の一対の時系列データを用いて両者の距離を求め、この距離と閾値とを比較することで、より厳密な類否判断を行う。 In the present invention, even if it is determined to be similar in relation to the lower limit value, if it is determined to be dissimilar in relation to the upper limit value, it is accurate from the time-series data after dimension compression. It is assumed that the similarity cannot be determined, the distance between the two is obtained using a pair of time-series data before dimension compression, and the distance is compared with a threshold value, thereby making a more exact similarity determination.

上記類似時系列データ計算方法は、セグメント管理手段により、時系列データを指数的に長さの異なる複数のセグメントに分割するステップと、前記次元圧縮係数計算手段により、時系列データの各セグメントにおける平均値と標準偏差を求めるステップと、を有することを特徴とする。 The similar time series data calculation method includes a step of segmenting time series data into a plurality of segments having exponentially different lengths by a segment management means, and an average for each segment of time series data by the dimension compression coefficient calculation means. Obtaining a value and a standard deviation.

本発明にあっては、時系列データを指数的に長さの異なる複数のセグメントに分割して、各セグメントにおける平均値と標準偏差を求めることで、任意の長さのクエリの時系列データにも対応可能とする。 In the present invention, the time-series data is divided into a plurality of segments having exponentially different lengths, and the average value and the standard deviation in each segment are obtained, whereby the time-series data of a query of an arbitrary length is obtained. Is also available.

上記類似時系列データ計算方法は、セグメント計数手段により、特定の長さのセグメントの個数を計数するステップと、セグメント統合手段により、計数した個数が一定値より大きくなった場合にその長さのセグメントを２つ統合するステップと、前記次元圧縮係数計算手段により、統合後のセグメントにおける平均値と標準偏差を求めるステップと、を有することを特徴とする。 The similar time-series data calculation method includes a step of counting the number of segments having a specific length by the segment counting means, and a segment having the length when the counted number exceeds a predetermined value by the segment integration means. And a step of obtaining an average value and a standard deviation in the segment after the integration by the dimension compression coefficient calculation means.

本発明にあっては、計数した特定の長さのセグメントの個数が一定値より大きい場合にその長さのセグメントを２つ統合し、統合後のセグメントにおける平均値と標準偏差を求めることで、次元圧縮の特徴量の更新を可能とする。 In the present invention, when the number of segments of a specific length counted is larger than a certain value, two segments of that length are integrated, and an average value and a standard deviation in the segment after integration are obtained, The feature quantity of dimension compression can be updated.

上記類似時系列データ計算方法は、セグメント管理手段により、時系列データを均等的な長さの複数のセグメントに分割するステップと、前記次元圧縮係数計算手段により、時系列データの各セグメントにおける平均値と標準偏差を求めるステップと、を有することを特徴とする。 The similar time series data calculation method includes a step of segmenting time series data into a plurality of segments of equal length by segment management means, and an average value in each segment of time series data by the dimension compression coefficient calculation means. And obtaining a standard deviation.

本発明にあっては、時系列データを均等的な長さの複数のセグメントに分割して、各セグメントにおける平均値と標準偏差を求めることで、特定の長さのクエリの時系列データに対応可能とする。 In the present invention, the time series data is divided into a plurality of segments of equal length, and the average value and standard deviation in each segment are obtained, thereby supporting the time series data of a specific length query. Make it possible.

第２の本発明に係る類似時系列データ計算装置は、類否判断の対象となる一対の時系列データをメモリから読み出し、それぞれについて次元圧縮係数として平均値と標準偏差を求めてメモリに格納する次元圧縮係数計算手段と、平均値と標準偏差をメモリから読み出し、これらを用いて時系列データの距離の上限値および下限値を求めてメモリに格納する上限値・下限値計算手段と、上限値、下限値、閾値をメモリから読み出し、下限値が閾値以下でかつ上限値が閾値以下の場合に一対の時系列データは類似であると判断する類否判断手段と、を有することを特徴とする。 The similar time-series data calculation apparatus according to the second aspect of the present invention reads a pair of time-series data to be subjected to similarity determination from a memory, obtains an average value and a standard deviation as a dimension compression coefficient for each, and stores them in the memory. Dimensional compression coefficient calculation means, upper limit / lower limit calculation means for reading the average value and standard deviation from the memory, and using these to obtain the upper and lower limit values of the distance of the time series data and storing them in the memory, and the upper limit value And a similarity determination unit that reads the lower limit value and the threshold value from the memory, and determines that the pair of time series data is similar when the lower limit value is equal to or lower than the threshold value and the upper limit value is equal to or lower than the threshold value. .

第３の本発明に係る類似時系列データ計算プログラムは、類否判断の対象となる一対の時系列データをメモリから読み出し、それぞれについて次元圧縮係数として平均値と標準偏差を求めてメモリに格納する処理と、平均値と標準偏差をメモリから読み出し、これらを用いて時系列データの距離の上限値および下限値を求めてメモリに格納する処理と、上限値、下限値、閾値をメモリから読み出し、下限値が閾値以下でかつ上限値が閾値以下の場合に一対の時系列データは類似であると判断する処理と、をコンピュータに実行させることを特徴とする。 A similar time-series data calculation program according to the third aspect of the present invention reads a pair of time-series data to be subjected to similarity determination from a memory, obtains an average value and a standard deviation as a dimension compression coefficient for each, and stores them in the memory. Processing, reading the average value and standard deviation from the memory, using these to obtain the upper limit value and lower limit value of the distance of the time series data, and storing the upper limit value, lower limit value, threshold value from the memory, When the lower limit value is equal to or less than the threshold value and the upper limit value is equal to or less than the threshold value, the computer is caused to execute a process of determining that the pair of time series data is similar.

本発明によれば、距離の上限値と下限値を用いて類否判定を行うことにより、探索漏れや過剰探索の発生を防止することができる。また、時系列データを指数的な長さのセグメントに分割することで、任意の長さのクエリの時系列データにも対応することができる。 According to the present invention, it is possible to prevent occurrence of search omission and excessive search by performing similarity determination using the upper limit value and the lower limit value of the distance. Further, by dividing the time series data into segments of exponential length, it is possible to deal with time series data of an arbitrary length query.

以下、本発明の実施形態について図面を用いて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１のブロック図に示すように、本実施の形態における類似時系列データ計算装置１は、データ受信部２、次元圧縮部３、距離計算部４、計算結果送信部５を有する構成である。類似時系列データ計算装置１は、演算処理装置、メモリ、記憶装置等のハードウェアを備えたコンピュータであり、各部の処理は、これらのハードウェアと協働して動作するプログラムによって実行される。各部の処理について概要から説明する。 As shown in the block diagram of FIG. 1, the similar time-series data calculation apparatus 1 according to the present embodiment has a configuration including a data reception unit 2, a dimension compression unit 3, a distance calculation unit 4, and a calculation result transmission unit 5. The similar time-series data calculation device 1 is a computer including hardware such as an arithmetic processing device, a memory, and a storage device, and the processing of each unit is executed by a program that operates in cooperation with these hardware. The processing of each part will be described from the overview.

データ受信部２は、外部から類似か否かの判断対象となる２つの時系列データと各時系列データのクエリの長さとを受信してメモリに記憶させる。次元圧縮部３は、メモリから各時系列データを読み出し、次元圧縮してメモリに記憶させる。距離計算部４は、メモリから次元圧縮後の各時系列データを読み出し、時系列データ間の距離を計算し、その距離に基づいて両者の類否を判断する。この判断が困難な場合には、メモリから次元圧縮前の各時系列データを読み出し、この時系列データ間の距離を計算し、この距離に基づいて類否を判断する。計算結果送信部５は類否判断の結果を外部へ送信する。 The data receiving unit 2 receives from the outside two pieces of time-series data to be judged whether they are similar and the query length of each piece of time-series data, and stores them in the memory. The dimension compressing unit 3 reads each time series data from the memory, compresses the dimension, and stores it in the memory. The distance calculation unit 4 reads each time-series data after dimension compression from the memory, calculates the distance between the time-series data, and determines the similarity between the two based on the distance. If this determination is difficult, each time-series data before dimension compression is read from the memory, a distance between the time-series data is calculated, and similarity is determined based on this distance. The calculation result transmission unit 5 transmits the result of similarity determination to the outside.

図２のブロック図に示すように、次元圧縮部３は、データ受信部３１、セグメント管理部３２、次元圧縮係数計算部３３、計算結果送信部３４を有する。 As shown in the block diagram of FIG. 2, the dimension compression unit 3 includes a data reception unit 31, a segment management unit 32, a dimension compression coefficient calculation unit 33, and a calculation result transmission unit 34.

データ受信部３１は、次元圧縮前の時系列データを受信する。セグメント管理部３２は、各セグメントの長さを管理し、時系列データを指数的に長さの異なる複数のセグメント、あるいは均等的な長さの複数のセグメントに分割する。次元圧縮係数計算部３３は、時系列データの各セグメントにおける平均値と標準偏差を次元圧縮の係数として求めることで、次元圧縮を行う。計算結果送信部は次元圧縮後の時系列データを距離計算部４へ送信する。 The data receiving unit 31 receives time-series data before dimension compression. The segment management unit 32 manages the length of each segment, and divides the time series data into a plurality of segments having exponentially different lengths or a plurality of segments having an equal length. The dimension compression coefficient calculation unit 33 performs dimension compression by obtaining an average value and a standard deviation in each segment of the time series data as a dimension compression coefficient. The calculation result transmission unit transmits the time series data after dimension compression to the distance calculation unit 4.

図３のブロック図に示すように、セグメント管理部３２は、データ受信部３２１、セグメント計数部３２２、セグメント統合部３２３、データ送信部３２４を有する。データ受信部３２１は、次元圧縮前の時系列データを受信する。セグメント計数部３２２は特定の長さのセグメントの個数を計数する。セグメント統合部３２３は、計数した個数が一定値より大きい場合に、その長さのセグメントを２つ統合する。データ送信部３２４は、統合後のセグメントの組み合わせを次元圧縮係数計算部３３に送信する。 As shown in the block diagram of FIG. 3, the segment management unit 32 includes a data reception unit 321, a segment counting unit 322, a segment integration unit 323, and a data transmission unit 324. The data receiving unit 321 receives time-series data before dimension compression. The segment counting unit 322 counts the number of segments having a specific length. When the counted number is larger than a certain value, the segment integration unit 323 integrates two segments having the length. The data transmission unit 324 transmits the combined segment combination to the dimension compression coefficient calculation unit 33.

図４のブロック図に示すように、距離計算部４は、データ受信部４１、利用セグメント決定部４２、距離係数計算部４３、上限値計算部４４、下限値計算部４５、類否判定部４６、厳密距離計算部４７、判定結果送信部４８を有する。 As shown in the block diagram of FIG. 4, the distance calculation unit 4 includes a data reception unit 41, a use segment determination unit 42, a distance coefficient calculation unit 43, an upper limit value calculation unit 44, a lower limit value calculation unit 45, and an similarity determination unit 46. , A strict distance calculation unit 47 and a determination result transmission unit 48.

データ受信部４１は、次元圧縮前の時系列データ、クエリの長さ、次元圧縮後の時系列データ（すなわち次元圧縮計数）を受信する。利用セグメント決定部４２は、次元圧縮後の距離を計算するときに用いるセグメントを決定する。距離係数計算部４３は、次元圧縮係数を用いて時系列データ間の距離係数を計算する。上限値計算部４４は、距離係数を用いて距離の上限値を求めてメモリに格納する。下限値計算部４５は、距離係数を用いて距離の下限値を求めてメモリに格納する。類否判定部４６は、上限値、下限値、閾値をメモリから読み出し、下限値が閾値以下でかつ上限値が閾値以下の場合に一対の時系列データを類似であると判断する。 The data receiving unit 41 receives time-series data before dimension compression, query length, and time-series data after dimension compression (that is, dimension compression count). The use segment determination unit 42 determines a segment used when calculating the distance after dimension compression. The distance coefficient calculation unit 43 calculates a distance coefficient between time series data using a dimensional compression coefficient. The upper limit calculator 44 calculates the upper limit of the distance using the distance coefficient and stores it in the memory. The lower limit calculator 45 obtains the lower limit value of the distance using the distance coefficient and stores it in the memory. The similarity determination unit 46 reads the upper limit value, the lower limit value, and the threshold value from the memory, and determines that the pair of time series data is similar when the lower limit value is equal to or less than the threshold value and the upper limit value is equal to or less than the threshold value.

厳密距離計算部４７は、類否判定部４６により下限値が閾値以下であるが上限値が閾値よりも大きいと判断された場合に、次元圧縮前の一対の時系列データをメモリから読み出し、この時系列データ間の距離を求めてメモリに格納する。この場合、類否判定部４６は、この距離と閾値とをメモリから読み出し、この距離が閾値以下の場合に時系列データを類似と判断する。このように厳密な距離を用いることで正確な判断を行う。判定結果送信部４８は類否判断の結果を計算結果送信部５へ送信する。 When the similarity determination unit 46 determines that the lower limit value is less than or equal to the threshold value but the upper limit value is greater than the threshold value, the exact distance calculation unit 47 reads a pair of time series data before dimension compression from the memory, The distance between the time series data is obtained and stored in the memory. In this case, the similarity determination unit 46 reads the distance and the threshold value from the memory, and determines that the time series data is similar when the distance is equal to or less than the threshold value. Thus, an accurate determination is made by using a strict distance. The determination result transmission unit 48 transmits the result of similarity determination to the calculation result transmission unit 5.

次元圧縮部３で時系列データを次元圧縮しても過剰探索が発生しないことはupper bounding lemmaにより保証される。upper bounding lemmaとは、次元圧縮後のシーケンスの距離をＵ（Ｘ’，Ｙ’）としたときにＤ（Ｘ，Ｙ）≦Ｕ（Ｘ’，Ｙ’）が成立すれば過剰探索が発生しないとする補助定理である。すなわち、次元圧縮後の距離が次元圧縮前の距離の上限値であれば過剰探索は発生しないことになる。 The upper bounding lemma guarantees that no excessive search occurs even if the time series data is dimensionally compressed by the dimension compression unit 3. Upper bounding lemma means that excessive search does not occur if D (X, Y) ≦ U (X ′, Y ′) holds when the distance of the sequence after dimension compression is U (X ′, Y ′). Is an auxiliary theorem. That is, if the distance after dimensional compression is the upper limit value of the distance before dimensional compression, excessive search will not occur.

また、探索漏れが発生しないことはlower bounding lemmaより保証される。lower bounding lemmaとは、次元圧縮後の時系列データをＬ（Ｘ’，Ｙ’）としたときに、Ｌ（Ｘ’，Ｙ’）≦Ｄ（Ｘ，Ｙ）が成り立てば探索漏れが発生しないという補助定理である。すなわち、次元圧縮後の距離が次元圧縮前の距離の下限値であれば探索漏れは発生しないことになる。 Moreover, it is guaranteed by lower bounding lemma that no search omission occurs. The lower bounding lemma means that when L (X ′, Y ′) ≦ D (X, Y) holds when the time-series data after dimension compression is L (X ′, Y ′), no search leakage occurs. This is the lemma. That is, if the distance after dimensional compression is the lower limit value of the distance before dimensional compression, no search omission occurs.

次に、類似時系列データ計算装置１における各部の処理について詳細に説明する。 Next, processing of each unit in the similar time series data calculation apparatus 1 will be described in detail.

まず、次元圧縮部３では、セグメント管理部３２により、シーケンスＸ（ｎデータポイント）をＮ個の特定の長さのセグメントに分割してｉ番目のセグメントをσ_iとしたとき、次元圧縮係数計算部３３では、σ_iの長さをｌ（σ_i）、平均をａｖ（σ_i）、標準偏差をｓｄ（σ_i）として次元圧縮後のシーケンスＸ’を次のように３つの係数のタプルで表現する。

First, in the dimension compression unit 3, when the segment management unit 32 divides the sequence X (n data points) into N specific length segments and sets the i-th segment as σ _i , the dimension compression coefficient calculation is performed. In section 33, the length of σ _i is l (σ _i ), the average is av (σ _i ), the standard deviation is sd (σ _i ), and the sequence X ′ after dimension compression is a tuple of three coefficients as follows: It expresses with.

ここでｌ（σ_i）については以下の式が成り立つ。

Here, the following equation holds for l (σ _i ).

またｘ_i:jをσ_iのｊ番目のデータポイントとすると、次元圧縮係数計算部３３は、ａｖ（σ_i）とｓｄ（σ_i）を次のように計算する。

If x _{i: j} is the j-th data point of σ _i , the dimension compression coefficient calculation unit 33 calculates av (σ _i ) and sd (σ _i ) as follows.

続いて、距離計算部４での処理について図５のフローチャートを参照しながら説明する。この処理では、任意の長さのクエリに対して、距離の下限値を計算するのか、上限値を計算するのかにより、次元圧縮後の距離計算に用いるセグメントが異なる。すなわち、ステップ１（図５において「Ｓ１」と示す。以下同じ）において、利用セグメント決定部４２は、図６に示すように、下限値を求めるときにはクエリより短いシーケンスの中で最も長いシーケンスにおけるセグメントを利用セグメントとして決定し、上限値を求めるときにはクエリより長いシーケンスの中で最も短いシーケンスにおけるセグメントを利用セグメントとして決定する。各シーケンスは、複数のセグメントを集約して構成される。下限値の計算に用いるセグメントの中で最も番号の小さいものをＮ_l、上限値の計算に用いるセグメントの中で最も番号の小さいものをＮ_uとすると、Ｎ_lとＮ_uは以下の式を満たすセグメントのうちそれぞれ最大のものと最小のものである。

Next, processing in the distance calculation unit 4 will be described with reference to the flowchart in FIG. In this process, a segment used for distance calculation after dimension compression differs depending on whether a lower limit value or an upper limit value is calculated for a query having an arbitrary length. That is, in step 1 (shown as “S1” in FIG. 5; the same applies to the following), as shown in FIG. 6, the use segment determination unit 42 determines the segment in the longest sequence among sequences shorter than the query when obtaining the lower limit value. Is determined as a utilization segment, and when the upper limit value is obtained, the segment in the shortest sequence among the sequences longer than the query is determined as the utilization segment. Each sequence is configured by aggregating a plurality of segments. Assuming that the segment with the smallest number among the segments used for the calculation of the lower limit value is N _l and the segment with the smallest number among the segments used for the calculation of the upper limit value are N _u , N _l and N _u can be expressed as The largest and smallest segments to satisfy.

ステップ２において、距離係数計算部４３は、次式により距離係数Ｄ１，Ｄ２，Ｄ３を計算する。

In step 2, the distance coefficient calculation unit 43 calculates distance coefficients D1, D2, and D3 by the following equations.

ステップ３において、下限値計算部４５は、次式により距離の下限値Ｌ（Ｘ’，Ｙ’）を計算する。

In step 3, the lower limit calculator 45 calculates the lower limit L (X ′, Y ′) of the distance by the following formula.

ここでＬ（Ｘ’，Ｙ’）は以下のように書き換えることができ、Ｄ１，Ｄ２，Ｄ３を予め計算しておくことで、それらをＬ（Ｘ’，Ｙ’）の計算に利用できる。

Here, L (X ′, Y ′) can be rewritten as follows, and by calculating D1, D2, and D3 in advance, they can be used for calculating L (X ′, Y ′).

ステップ４において、類否判定部４６は、下限値と閾値を比較し、下限値が閾値より大きい場合には、時系列データは類似ではないと判断して処理を終了する（ステップ５）。一方、下限値が閾値以下の場合には、ステップ６へ進む。 In step 4, the similarity determination unit 46 compares the lower limit value with the threshold value, and if the lower limit value is larger than the threshold value, the time series data is determined not to be similar, and the process is terminated (step 5). On the other hand, if the lower limit value is less than or equal to the threshold value, the process proceeds to step 6.

ステップ６では、上限値計算部４４が、次式により距離の上限値Ｕ（Ｘ’，Ｙ’）を計算する。

In step 6, the upper limit calculation unit 44 calculates the upper limit value U (X ′, Y ′) of the distance by the following equation.

ここで、Ｎ_u≦Ｎ_lなのでＵ（Ｘ’，Ｙ’）は以下のように書き換えることができ、Ｄ１，Ｄ２，Ｄ３を予め計算しておくことで、それらをＵ（Ｘ’，Ｙ’）の計算にも利用できる。

Here, since N _u ≦ N _l, U (X ′, Y ′) can be rewritten as follows, and by calculating D1, D2, D3 in advance, they can be rewritten as U (X ′, Y ′). ).

ステップ７において、類否判定部４６は、上限値と閾値を比較し、上限値が閾値以下の場合には、時系列データは類似と判断して処理を終了する（ステップ８）。一方、上限値が閾値よりも大きい場合には、時系列データは類似ではないと判断してステップ９へ進む。 In step 7, the similarity determination unit 46 compares the upper limit value with the threshold value. If the upper limit value is less than or equal to the threshold value, the time series data is determined to be similar, and the process ends (step 8). On the other hand, when the upper limit value is larger than the threshold value, it is determined that the time series data is not similar, and the process proceeds to step 9.

ステップ９では、厳密距離計算部４７により、次元圧縮前の時系列データを用いて両者のユークリッド距離を求め、その距離を類否判定部４６へ送る。 In step 9, the exact distance calculation unit 47 calculates the Euclidean distance between the two using time-series data before dimension compression, and sends the distance to the similarity determination unit 46.

ステップ１０において、類否判定部４６は、厳密距離計算部４７が計算した距離と閾値とを比較し、その距離が閾値以下の場合には時系列データは類似と判断し（ステップ１１）、それ以外の場合には非類似と判断して処理を終了する（ステップ１２）。 In step 10, the similarity determination unit 46 compares the distance calculated by the strict distance calculation unit 47 with a threshold value, and determines that the time series data is similar if the distance is equal to or less than the threshold value (step 11). Otherwise, it is determined that they are dissimilar and the process is terminated (step 12).

次に、上記のように求めた下限値Ｌ（Ｘ’，Ｙ’）、上限値Ｕ（Ｘ’，Ｙ’）を用いることで、Ｌ（Ｘ’，Ｙ’）≦Ｄ（Ｘ，Ｙ）≦Ｕ（Ｘ’，Ｙ’）となることを証明する。まずＬ（Ｘ’，Ｙ’）≦Ｄ（Ｘ，Ｙ）となることを示す。 Next, by using the lower limit value L (X ′, Y ′) and the upper limit value U (X ′, Y ′) obtained as described above, L (X ′, Y ′) ≦ D (X, Y) It proves that ≦ U (X ′, Y ′). First, it is shown that L (X ′, Y ′) ≦ D (X, Y).

Ｄ（Ｘ，Ｙ）はユークリッド距離なのでｋ≧ｌとすると明らかに以下のことが成り立つ。

Since D (X, Y) is an Euclidean distance, the following is clearly true if k ≧ l.

ｋにおいては以下の式が成り立つとする。

Assume that the following equation holds for k.

ここでΔｘ_i:j＝ａｖ（σ_i）−ｘ_i:jとすると

Where Δx _{i: j} = av (σ _i ) −x _{i: j}

ここで

here

であるため

Because

となる。ここでΔｘ_i:jをベクトルとみなしたとき、‖Δｘ_i:j‖をベクトルΔｘ_i:jの大きさとする。すると内積の定義式より

It becomes. Here, when Δx _{i: j} is regarded as a vector, ‖Δx _{i: j} ‖ is the size of the vector Δx _{i: j} . Then, from the inner product definition formula

となる。なお、ここでθ_iはベクトルΔｘ_i:jとベクトルΔｙ_i:jのなす角とする。cosθ_i≦１であるので、

It becomes. Here, θ _i is an angle formed by the vector Δx _{i: j} and the vector Δy _{i: j} . Since cosθ _i ≦ 1,

となる。また同様に、ｋ≦１、cosθ_i≧１とすれば、Ｄ（Ｘ，Ｙ）≦Ｕ（Ｘ’，Ｙ’）となることも示せる。 It becomes. Similarly, if k ≦ 1 and cos θ _i ≧ 1, it can be shown that D (X, Y) ≦ U (X ′, Y ′).

次に、セグメント管理部３２において指数的に長さの異なるセグメントを管理する処理について詳細に説明する。セグメント管理部３２は、レベルｈのセグメントの長さを２^hとし、図７の模式図に示すように、処理時間によってレベルｈを引き上げることで、セグメントの長さが指数的に大きくなるようにセグメント分割を行う。このようにするのは、任意の長さのクエリに対応するためである。これにより、クエリが短い場合も長い場合も、下限値と上限値を計算したときの相対的な誤差は同等になる。 Next, processing for managing segments having exponentially different lengths in the segment management unit 32 will be described in detail. The segment management unit 32 sets the length of the segment of level h to 2 ^h, and increases the level h according to the processing time so that the length of the segment increases exponentially as shown in the schematic diagram of FIG. Perform segmentation. This is because it corresponds to a query having an arbitrary length. As a result, whether the query is short or long, the relative error when the lower limit value and the upper limit value are calculated is the same.

セグメント統合部３２３は、各レベルにおいてセグメントの個数がキャパシティｃより大きくなった場合に、そのレベルにおける２つのセグメントを統合し、次元圧縮の特徴量の更新をインクリメンタル的に行う。 When the number of segments at each level becomes larger than the capacity c, the segment integration unit 323 integrates the two segments at that level and incrementally updates the dimension compression feature quantity.

例えばｃ＝２としたときの具体的な更新処理について図８の模式図および図９のフローチャートを用いて説明する。 For example, specific update processing when c = 2 will be described with reference to the schematic diagram of FIG. 8 and the flowchart of FIG.

ステップ２１（図９で「Ｓ２１」と示す。以下同じ）で、セグメント計数部３２２は、レベルｈを初期値０に設定する。そして、ステップ２２で、レベルｈ＝０におけるセグメントの個数を計数する。ここでは、セグメントの個数を、レベルｈ＝０で２、ｈ＝１で２、ｈ＝２で１とする。ここでシーケンスデータが流入してくるとｈ＝０のセグメントは３個になる。セグメントの個数がｃ（＝２）よりも大きくなったので（ステップ２４）、ステップ２５で、セグメント統合部３２３は、セグメントを統合してｈ＝１のセグメントを作成する。するとレベルｈ＝１のセグメントも３個になるので、ステップ２６でレベルｈを１インクリメントして同様に処理する。この処理をレベルｈが時系列データの最大のレベルとなるまで繰り返す（ステップ２３）。結果として、更新後のセグメントの個数はレベルｈ＝０で１、レベルｈ＝１で１、レベルｈ＝２で２となる。 In step 21 (shown as “S21” in FIG. 9; the same applies hereinafter), the segment counting unit 322 sets the level h to the initial value 0. In step 22, the number of segments at level h = 0 is counted. Here, the number of segments is 2 at level h = 0, 2 at h = 1, and 1 at h = 2. Here, when sequence data flows in, there are three segments with h = 0. Since the number of segments is larger than c (= 2) (step 24), in step 25, the segment integration unit 323 integrates the segments and creates a segment with h = 1. Then, since there are three segments of level h = 1, the level h is incremented by 1 in step 26 and the same processing is performed. This process is repeated until the level h reaches the maximum level of the time series data (step 23). As a result, the number of segments after the update is 1 at level h = 0, 1 at level h = 1, and 2 at level h = 2.

セグメントσ_iとσ_i+1を統合して新たなセグメントσ′_iとするときの次元圧縮の特徴量については、以下のように更新する。

The dimension compression feature quantity when the segments σ _i and σ _{i + 1} are integrated into a new segment σ ′ _i is updated as follows.

次に、セグメント管理部３２において均等的な長さのセグメントを管理する処理について詳細に説明する。ここでは、任意の長さのクエリではなく、固定長のクエリに対して類似時系列データを探索することを想定する。この場合は一番前のセグメントは逐次更新し、一番後のセグメントはキャパシティｃ毎に更新する。次元圧縮係数計算部３３は、キャパシティｃの値ごとにまとめた均等的な長さの各セグメントにおける時系列データの平均値と標準偏差を求めてメモリに保持する。 Next, processing for managing segments of equal length in the segment management unit 32 will be described in detail. Here, it is assumed that similar time series data is searched for a query of a fixed length, not a query of an arbitrary length. In this case, the first segment is updated sequentially, and the last segment is updated for each capacity c. The dimension compression coefficient calculation unit 33 calculates the average value and standard deviation of the time series data in each segment of equal length collected for each value of the capacity c, and stores the average value and the standard deviation in the memory.

ｃ＝２であるときの具体的な更新処理について図１０の模式図を用いて説明する。次元圧縮の特徴量は４つであり、更新前（ｔ＝１）においてセグメントの長さは１番目が４、２番目が４、３番目が４、４番目が０である。ここでシーケンスデータが流入してくると（ｔ＝２）、セグメントの長さは１番目が４、２番目が４、３番目が４、４番目が１となる。次のシーケンスデータが流入してくると（ｔ＝３）、セグメントの長さは１番目が２、２番目が４、３番目が４、４番目が２となる。 A specific update process when c = 2 will be described with reference to the schematic diagram of FIG. There are four dimension compression feature quantities, and the segment length is 4 for the first, 4 for the second, 4 for the third, 4 for the fourth, and 0 before the update (t = 1). Here, when the sequence data flows in (t = 2), the length of the segment is 4 for the first, 4 for the second, 4 for the third, and 1 for the fourth. When the next sequence data flows in (t = 3), the length of the segment is 2 for the first, 4 for the second, 4 for the third, and 2 for the fourth.

セグメントは逐次またはキャパシティｃごとに更新するが、このときの次元圧縮の係数は長さｌ_i＋ｌ_i+1のセグメントをσ’_iとし、長さｌ_iのセグメントをσ_iとし、長さｌ_i+1のセグメントをσ_i+1としたとき以下の式を用いて更新する。

Segment is sequentially updated or per capacity c, but the coefficient of dimensional compression at this time and length l _i + l i _{+ 1} segment sigma _'i, a segment of length l _i and sigma _i, the length When the segment of l _{i + 1} is σ _{i + 1} , it is updated using the following formula.

以上、説明したように、本実施の形態によれば、距離計算部４において、距離の下限値が閾値以下の場合に類似として判断することで、lower bounding lemmaの補助定理により探索漏れがない性質を保証するとともに、距離の上限値が閾値以下の場合に類似として判断することで、upper bounding lemmaの補助定理により過剰探索がない性質を保証する。これにより、探索漏れの発生および過剰探索の発生を防ぐことができる。 As described above, according to the present embodiment, the distance calculation unit 4 determines that the lower limit value of the distance is similar to the threshold value or less so that there is no search omission due to the lower bounding lemma theorem. And guaranteeing the property that there is no excessive search by the upper bounding lemma theorem by determining that the upper limit of the distance is similar to the threshold value. Thereby, the occurrence of search omission and the occurrence of excessive search can be prevented.

本実施の形態によれば、セグメント管理部３２により、時系列データを指数的に長さの異なる複数のセグメントに分割して、次元圧縮係数計算部３３が各セグメントにおける平均値と標準偏差を求めることで、任意の長さのクエリの時系列データにも対応することができる。 According to the present embodiment, the segment management unit 32 divides the time series data into a plurality of segments having exponentially different lengths, and the dimension compression coefficient calculation unit 33 obtains an average value and a standard deviation in each segment. Thus, it is possible to deal with time-series data of queries having an arbitrary length.

本実施の形態によれば、類否判定部４６により、下限値との関係で類似と判断された場合であっても上限値との関係で非類似と判断された場合には、次元圧縮後の時系列データからでは正確な類否判断ができなかったものとして、厳密距離計算部４７により、次元圧縮前の一対の時系列データを用いて両者のユークリッド距離を求め、再度類否判定部４６によりユークリッド距離と閾値との比較を行うことで、より厳密な類否判断を行うことができる。 According to the present embodiment, even if it is determined that the similarity determination unit 46 is similar in relation to the lower limit value, if it is determined to be dissimilar in relation to the upper limit value, after dimension compression Therefore, the exact distance calculation unit 47 obtains the Euclidean distance between the pair of time series data before the dimension compression, and the similarity determination unit 46 again. Thus, by comparing the Euclidean distance with the threshold value, it is possible to make a stricter similarity determination.

本実施の形態によれば、セグメント計数部３２２により計数した特定の長さのセグメントの個数が一定値より大きくなった場合に、セグメント統合部３２３により、その長さのセグメントを２つ統合し、次元圧縮係数計算部３３により、統合後のセグメントにおける平均値と標準偏差を求めることで、次元圧縮の特徴量の更新を可能にすることができる。 According to the present embodiment, when the number of segments of a specific length counted by the segment counting unit 322 becomes larger than a certain value, the segment integration unit 323 integrates two segments of that length, By obtaining the average value and the standard deviation in the segment after integration by the dimension compression coefficient calculation unit 33, it is possible to update the feature quantity of dimension compression.

本実施の形態によれば、セグメント管理部３２により、時系列データを均等的な長さの複数のセグメントに分割して、次元圧縮係数計算部３３により、各セグメントにおける平均値と標準偏差を求めることで、特定の長さのクエリの時系列データに対応することができる。 According to the present embodiment, the segment management unit 32 divides the time series data into a plurality of segments of equal length, and the dimension compression coefficient calculation unit 33 obtains the average value and standard deviation in each segment. Thus, it is possible to deal with time-series data of a query having a specific length.

本実施の形態によれば、時系列データを指数的または均等的な長さのセグメントに分割することにより、相対的に小さい誤差で時系列データの類否判断を行うことができる。 According to the present embodiment, by dividing the time series data into segments of exponential or equal length, it is possible to determine the similarity of the time series data with a relatively small error.

一実施の形態における類似時系列データ計算装置の構成を示すブロック図である。It is a block diagram which shows the structure of the similar time series data calculation apparatus in one embodiment. 上記類似時系列データ計算装置における次元圧縮部の構成を示すブロック図である。It is a block diagram which shows the structure of the dimension compression part in the said similar time series data calculation apparatus. 上記次元圧縮部におけるセグメント管理部の構成を示すブロック図である。It is a block diagram which shows the structure of the segment management part in the said dimension compression part. 上記類似時系列データ計算装置における距離計算部の構成を示すブロック図である。It is a block diagram which shows the structure of the distance calculation part in the said similar time series data calculation apparatus. 距離計算部における処理を示すフローチャートである。It is a flowchart which shows the process in a distance calculation part. 下限値の計算に用いるセグメントと上限値の計算に用いるセグメントを説明するための模式図である。It is a schematic diagram for demonstrating the segment used for calculation of the lower limit, and the segment used for calculation of an upper limit. 任意の長さのクエリに対して相対的な誤差は変わらないことを説明するための模式図である。It is a schematic diagram for demonstrating that a relative error does not change with respect to a query of an arbitrary length. 特徴量を更新するときの処理を説明するための模式図である。It is a schematic diagram for demonstrating the process at the time of updating a feature-value. 特徴量の更新の処理を示すフローチャートである。It is a flowchart which shows the update process of the feature-value. クエリが固定長の場合における特徴量を更新するときの処理を説明するための模式図である。It is a schematic diagram for demonstrating the process at the time of updating the feature-value in case a query is fixed length.

Explanation of symbols

１…類似時系列データ計算装置
２…データ受信部
３…次元圧縮部
４…距離計算部
５…計算結果送信部
３１…データ受信部
３２…セグメント管理部
３３…次元圧縮係数計算部
３４…計算結果送信部
４１…データ受信部
４２…利用セグメント決定部
４３…距離係数計算部
４４…上限値計算部
４５…下限値計算部
４６…類否判定部
４７…厳密距離計算部
４８…判定結果送信部
３２１…データ受信部
３２２…セグメント計数部
３２３…セグメント統合部
３２４…データ送信部
DESCRIPTION OF SYMBOLS 1 ... Similar time series data calculation apparatus 2 ... Data reception part 3 ... Dimension compression part 4 ... Distance calculation part 5 ... Calculation result transmission part 31 ... Data reception part 32 ... Segment management part 33 ... Dimension compression coefficient calculation part 34 ... Calculation result Transmission unit 41 ... Data reception unit 42 ... Use segment determination unit 43 ... Distance coefficient calculation unit 44 ... Upper limit value calculation unit 45 ... Lower limit value calculation unit 46 ... Similarity determination unit 47 ... Strict distance calculation unit 48 ... Determination result transmission unit 321 ... Data reception unit 322 ... Segment counting unit 323 ... Segment integration unit 324 ... Data transmission unit

Claims

A step of reading out a pair of time series data subject to similarity determination from the memory by the dimension compression coefficient calculating means, obtaining an average value and a standard deviation as a dimension compression coefficient for each, and storing them in the memory;
The average value and the standard deviation are read from the memory by the upper limit / lower limit calculation means, and the upper limit value and the lower limit value of the distance of the time series data are obtained and stored in the memory using these,
A step of reading the upper limit value, the lower limit value, and the threshold value from the memory by the similarity determination means, and determining that the pair of time series data is similar when the lower limit value is equal to or less than the threshold value and the upper limit value is equal to or less than the threshold value;
A similar time series data calculation method characterized by comprising:

When the similarity determination means determines that the lower limit value is less than or equal to the threshold value and the upper limit value is greater than the threshold value, the exact distance calculation means reads a pair of time series data before dimension compression from the memory, Determining the distance and storing it in memory;
A step of reading the distance and threshold of the pair of time series data before dimension compression from the memory by the similarity determination means, and determining that the pair of time series data is similar when the distance is equal to or less than the threshold;
The similar time series data calculation method according to claim 1, wherein:

Dividing the time-series data into a plurality of segments exponentially different in length by the segment management means;
Obtaining a mean value and a standard deviation in each segment of the time series data by the dimension compression coefficient calculating means;
The similar time series data calculation method according to claim 1, wherein:

Counting the number of segments of a specific length by the segment counting means;
A step of integrating two segments of the length when the counted number becomes larger than a certain value by the segment integration means;
Obtaining a mean value and a standard deviation in the segment after integration by the dimension compression coefficient calculating means;
The similar time series data calculation method according to claim 3, wherein:

Dividing the time series data into a plurality of segments of equal length by segment management means;
Obtaining a mean value and a standard deviation in each segment of the time series data by the dimension compression coefficient calculating means;
The similar time series data calculation method according to claim 1, wherein:

A dimensional compression coefficient calculation means for reading a pair of time series data to be subjected to similarity determination from the memory, obtaining an average value and a standard deviation as a dimensional compression coefficient for each, and storing them in the memory;
An upper limit / lower limit calculation means for reading an average value and a standard deviation from the memory, and using these to obtain an upper limit value and a lower limit value of the distance of the time series data and storing them in the memory,
An similarity determination unit that reads the upper limit value, the lower limit value, and the threshold value from the memory, and determines that the pair of time series data is similar when the lower limit value is equal to or less than the threshold value and the upper limit value is equal to or less than the threshold value;
A similar time-series data calculation device comprising:

A process of reading a pair of time series data to be subjected to similarity determination from the memory, obtaining an average value and a standard deviation as a dimensional compression coefficient for each, and storing them in the memory;
A process of reading an average value and a standard deviation from the memory, and using these to obtain an upper limit value and a lower limit value of the distance of the time series data and storing them in the memory,
A process of reading the upper limit value, the lower limit value, and the threshold value from the memory, and determining that the pair of time series data is similar when the lower limit value is equal to or less than the threshold value and the upper limit value is equal to or less than the threshold value;
A similar time series data calculation program characterized by causing a computer to execute.