JP2021093012A

JP2021093012A - Compilation program and information processing device

Info

Publication number: JP2021093012A
Application number: JP2019223768A
Authority: JP
Inventors: 正寿原口; Masatoshi Haraguchi; 俊鎌塚; Shun Kamatsuka; 渡辺　健介; Kensuke Watanabe; 健介渡辺
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2021-06-17

Abstract

【課題】適切な手法でループ処理を最適化できるようにする。【解決手段】情報処理装置１０は、ループ命令列１に基づいて、マスクで有効とした１以上の要素に対してＳＩＭＤ命令によって並列で演算を実行させる処理を繰り返すループ処理が記述された第１ＳＩＭＤ化命令列２を生成する。次に情報処理装置１０は、配列内の不連続の複数の要素をＳＩＭＤレジスタ内の連続の領域へ格納して、ＳＩＭＤレジスタ内の複数の要素に対してＳＩＭＤ命令により並列で演算を実行する処理を繰り返すループ処理が記述された第２ＳＩＭＤ化命令列３を生成する。そして情報処理装置１０は、第１ＳＩＭＤ化命令列２と第２ＳＩＭＤ化命令列３と処理時間に基づいて、コンパイル後のプログラムに含める命令列を選択する。【選択図】図１PROBLEM TO BE SOLVED: To optimize loop processing by an appropriate method. SOLUTION: An information processing apparatus 10 describes a first SIMD in which a loop process of repeating a process of executing operations in parallel by a SIMD instruction for one or more elements enabled by a mask based on a loop instruction sequence 1 is described. Generate instruction sequence 2. Next, the information processing device 10 stores a plurality of discontinuous elements in the array in a continuous area in the SIMD register, and executes operations in parallel with the SIMD instructions for the plurality of elements in the SIMD register. The second SIMD instruction sequence 3 in which the loop processing that repeats the above is described is generated. Then, the information processing apparatus 10 selects an instruction sequence to be included in the compiled program based on the first SIMD instruction sequence 2, the second SIMD instruction sequence 3, and the processing time. [Selection diagram] Fig. 1

Description

本発明は、コンパイルプログラム、および情報処理装置に関する。 The present invention relates to a compilation program and an information processing device.

コンピュータで実行するプログラムは、高級言語で作成されたソースプログラムを、コンパイラを用いて実行形式のコード列に変換することで生成される。コンパイラを実行するコンピュータは、生成するプログラムを実行させるプロセッサのマイクロアーキテクチャに合わせて最適化したプログラムを生成する。マイクロアーキテクチャは、プロセッサの回路の論理的構造（レジスタ、バス、ＳＩＭＤ（Single Instruction Multiple Data）、パイプラインなど構造）を定義したものである。 A program to be executed on a computer is generated by converting a source program written in a high-level language into an executable code string using a compiler. The computer that runs the compiler produces a program that is optimized for the microarchitecture of the processor that executes the program to be generated. The microarchitecture defines the logical structure of a processor circuit (structures such as registers, buses, SIMD (Single Instruction Multiple Data), and pipelines).

なおＳＩＭＤとは、１つの命令を同時に複数のデータに適用することで並列処理を実現する命令の実行方式である。ＳＩＭＤによる並列処理の実行を指示する命令をＳＩＭＤ命令と呼ぶ。またループ処理を、ＳＩＭＤ命令を用いて並列化する処理に置き換えることを、ＳＩＭＤ化と呼ぶ。 SIMD is an instruction execution method that realizes parallel processing by applying one instruction to a plurality of data at the same time. An instruction that instructs the execution of parallel processing by SIMD is called a SIMD instruction. Replacing the loop processing with processing that parallelizes using the SIMD instruction is called SIMD conversion.

コンピュータは、ソースプログラムをコンパイルする際、ループ処理を最適化することで、プログラムの処理効率を格段に向上させることができる。例えばループ処理には、メモリ上の不連続の領域に格納された複数の要素（例えば所定長のデータ）それぞれに対して同じ演算を行う処理がある。このようなループ処理の最適化方法としては、マスク付きＳＩＭＤ命令を利用する命令列に書き換える方法と、gather命令とscatter命令を利用する命令列に書き換える方法とがある。以下、gather命令とscatter命令との両方を指して、gather／scatter命令と呼ぶことがある。 When compiling a source program, a computer can significantly improve the processing efficiency of the program by optimizing the loop processing. For example, in the loop process, there is a process of performing the same operation for each of a plurality of elements (for example, data having a predetermined length) stored in a discontinuous area on the memory. As a method of optimizing such loop processing, there are a method of rewriting to an instruction sequence using a masked SIMD instruction and a method of rewriting to an instruction sequence using a gather instruction and a scatter instruction. Hereinafter, both the gather instruction and the scatter instruction may be referred to as a gather / scatter instruction.

マスク付きＳＩＭＤ命令とは、ＳＩＭＤ命令用のレジスタ（ＳＩＭＤレジスタ）に格納された複数の要素のうち、マスクデータによって指定された一部の要素に対してのみ演算を並列で実行することを指示する命令である。マスク付きＳＩＭＤ命令を用いれば、不連続の領域に格納された複数の要素であっても、ＳＩＭＤによる並列処理が可能となる。 The masked SIMD instruction indicates that the operation is executed in parallel only for some of the elements specified by the mask data among the plurality of elements stored in the SIMD instruction register (SIMD register). It is an instruction. By using the masked SIMD instruction, parallel processing by SIMD is possible even for a plurality of elements stored in a discontinuous area.

gather命令とは、メモリ上の不連続の領域に格納された複数の要素を、１つのＳＩＭＤレジスタに格納する命令である。scatter命令とは、１つのレジスタに格納された複数の要素を、メモリ上の不連続の領域に書き出す命令である。gather命令により、メモリ上の不連続の領域から複数の要素をＳＩＭＤレジスタに格納し、それらの要素にＳＩＭＤ命令により並列で同じ演算を実行し、演算結果の要素をscatter命令で、メモリ上の不連続の領域に書き出すことができる。これにより、ループ処理の効率化が可能となる。 The gather instruction is an instruction for storing a plurality of elements stored in a discontinuous area on the memory in one SIMD register. The scatter instruction is an instruction to write a plurality of elements stored in one register to a discontinuous area in the memory. By the gather instruction, multiple elements are stored in the SIMD register from the discontinuous area in the memory, the same operation is executed in parallel by the SIMD instruction in those elements, and the element of the operation result is discontinuous in the memory by the scatter instruction. Can be exported to a continuous area. This makes it possible to improve the efficiency of loop processing.

コンパイラに関する技術としては、例えば、並列ＳＩＭＤネイティブ・ソース・コードの生成を自動化する方法が開示されている。また同一演算の繰り返し処理を効率的に実行させることができるコード生成方法も開示されている。さらに、生成されるオブジェクトプログラムの命令数を削減し、実行性能を向上させるコンパイル方法も開示されている。 As a technique related to a compiler, for example, a method of automating the generation of parallel SIMD native source code is disclosed. Further, a code generation method capable of efficiently executing the iterative processing of the same operation is also disclosed. Further, a compilation method for reducing the number of instructions of the generated object program and improving the execution performance is also disclosed.

特開２０１０−１８６４６８号公報Japanese Unexamined Patent Publication No. 2010-186468 特開２０１８−１２４８７７号公報JP-A-2018-124877 特開２０１８−４９４６１号公報JP-A-2018-49461

マスク付きＳＩＭＤ命令を利用するループ処理では、ＳＩＭＤレジスタに格納された要素のうち、マスクされていない一部の要素に対して同じ演算を並列で実行できる。他方、gather／scatter命令を利用するループ処理では、ＳＩＭＤレジスタに格納された複数の要素すべてに対する同じ演算を並列で実行できる。そのため多くの場合、マスク付きＳＩＭＤ命令を利用する命令列を用いたループ処理よりも、gather／scatter命令を利用する命令列を用いたループ処理の方が大幅な性能向上が期待できる。 In the loop processing using the masked SIMD instruction, the same operation can be executed in parallel for some of the unmasked elements stored in the SIMD register. On the other hand, in the loop processing using the gather / scatter instruction, the same operation can be executed in parallel for all the plurality of elements stored in the SIMD register. Therefore, in many cases, a significant performance improvement can be expected in the loop processing using the instruction sequence using the gather / scatter instruction rather than the loop processing using the instruction sequence using the masked SIMD instruction.

ただし、一般的には、gather／scatter命令の実行には、マスク付きＳＩＭＤ命令を利用する際のＳＩＭＤレジスタへのload／store命令の実行よりも時間がかかる。そのため、gather／scatter命令の実行速度によっては、gather／scatter命令を利用する命令列を用いたループ処理よりもマスク付きＳＩＭＤ命令を利用するループ処理の方が、処理が速い可能性がある。 However, in general, the execution of the gather / scatter instruction takes longer than the execution of the load / store instruction to the SIMD register when using the masked SIMD instruction. Therefore, depending on the execution speed of the gather / scatter instruction, the loop processing using the masked SIMD instruction may be faster than the loop processing using the instruction sequence using the gather / scatter instruction.

従来は、マスク付きＳＩＭＤ命令を利用するループ処理とgather／scatter命令を利用するループ処理のどちらの方が効率的なのかについて適切な判断が行われておらず、ループ処理が適切な手法で最適化されていない可能性がある。 In the past, it has not been properly determined whether loop processing using a masked SIMD instruction or loop processing using a gather / scatter instruction is more efficient, and loop processing is optimal with an appropriate method. It may not have been converted.

１つの側面では、本件は、適切な手法でループ処理を最適化できるようにすることを目的とする。 On the one hand, this case aims to enable the loop processing to be optimized in an appropriate manner.

１つの案では、コンピュータに以下の処理を実行させるコンパイルプログラムが提供される。
コンピュータは、コンパイル対象のプログラムから、配列内の一部の要素に対する演算を指示するループ命令列を抽出する。次にコンピュータは、ループ命令列に基づいて、配列内の複数の要素のうち演算対象の要素のみを有効にするマスクを用い、マスクで有効とした１以上の要素に対してＳＩＭＤ命令によって並列で演算を実行させる処理を繰り返すループ処理が記述された第１ＳＩＭＤ化命令列を生成する。またコンピュータは、ループ命令列に基づいて、配列内の不連続の複数の要素をＳＩＭＤレジスタ内の連続の領域へ格納して、ＳＩＭＤレジスタ内の複数の要素に対してＳＩＭＤ命令により並列で演算を実行する処理を繰り返すループ処理が記述された第２ＳＩＭＤ化命令列を生成する。さらにコンピュータは、複数の命令それぞれの実行に要する時間を表す数値が設定された実行時間情報に基づいて、第１ＳＩＭＤ化命令列の処理時間と第２ＳＩＭＤ化命令列の処理時間とを算出する。そしてコンピュータは、第１ＳＩＭＤ化命令列の処理時間と第２ＳＩＭＤ化命令列の処理時間とに基づいて、第１ＳＩＭＤ化命令列と第２ＳＩＭＤ化命令列とのうち、コンパイル後のプログラムに含める命令列を選択する。 One idea is to provide a compilation program that causes the computer to perform the following operations:
The computer extracts from the program to be compiled a loop instruction sequence that directs operations on some elements in the array. Next, the computer uses a mask that enables only the element to be calculated out of the multiple elements in the array based on the loop instruction sequence, and in parallel with the SIMD instruction for one or more elements enabled by the mask. A first SIMD instruction sequence is generated in which a loop process for repeating a process for executing an operation is described. In addition, the computer stores a plurality of discontinuous elements in the array in a continuous area in the SIMD register based on the loop instruction sequence, and performs operations in parallel with the SIMD instruction for the plurality of elements in the SIMD register. A second SIMD instruction sequence is generated in which a loop process that repeats the process to be executed is described. Further, the computer calculates the processing time of the first SIMD instruction sequence and the processing time of the second SIMD instruction sequence based on the execution time information in which the numerical value indicating the time required for executing each of the plurality of instructions is set. Then, the computer determines the instruction sequence to be included in the compiled program from the first SIMD instruction sequence and the second SIMD instruction sequence based on the processing time of the first SIMD instruction sequence and the processing time of the second SIMD instruction sequence. select.

１態様によれば、適切な手法でループ処理を最適化できる。 According to one aspect, the loop processing can be optimized by an appropriate method.

第１の実施の形態に係る情報処理装置によるコンパイル方法の一例を示す図である。It is a figure which shows an example of the compilation method by the information processing apparatus which concerns on 1st Embodiment. コンパイルの実行に用いるコンピュータのハードウェアの一構成例を示す図である。It is a figure which shows one configuration example of the hardware of the computer used for executing compilation. コンピュータの機能を示すブロック図である。It is a block diagram which shows the function of a computer. 命令スケジューリング用テーブルの一例を示す図である。It is a figure which shows an example of the instruction scheduling table. ループ処理を記述する命令列の例を示す図である。It is a figure which shows the example of the instruction sequence which describes the loop processing. マスク付きＳＩＭＤ命令を利用するループ処理の内容を示す図である。It is a figure which shows the content of the loop processing which uses the masked SIMD instruction. gather／scatter命令を利用するループ処理の内容を示す図である。It is a figure which shows the content of the loop processing which uses the gather / scatter instruction. 採用するループ命令列の判定例を示す図である。It is a figure which shows the determination example of the loop instruction sequence to be adopted. ループ最適化処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of a loop optimization process. ループタイプ判定処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of a loop type determination process. 等価ループ生成処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of the equivalent loop generation processing. 「ＴＹＰＥ１」のオリジナルループから「ＴＹＰＥ２」の等価ループへの変換例を示す図である。It is a figure which shows the conversion example from the original loop of "TYPE1" to the equivalent loop of "TYPE2". 「ＴＹＰＥ２」のオリジナルループから「ＴＹＰＥ１」の等価ループへの変換例を示す図である。It is a figure which shows the conversion example from the original loop of "TYPE2" to the equivalent loop of "TYPE1". 命令スケジューリング処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of instruction scheduling processing. ループコスト算出処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of a loop cost calculation process. マスク付きＳＩＭＤ命令におけるマスクの固定値化の例を示す図である。It is a figure which shows the example of the fixed value of a mask in the SIMD instruction with a mask. 偶数判定の場合のループ命令列の一例を示す図である。It is a figure which shows an example of the loop instruction sequence in the case of an even number determination. ストライド幅が「３」の場合のループ命令列の一例を示す図である。It is a figure which shows an example of the loop instruction sequence when the stride width is "3". ストライド幅が「４」の場合のループ命令列の一例を示す図である。It is a figure which shows an example of the loop instruction sequence when the stride width is "4".

以下、本実施の形態について図面を参照して説明する。なお各実施の形態は、矛盾のない範囲で複数の実施の形態を組み合わせて実施することができる。
〔第１の実施の形態〕
図１は、第１の実施の形態に係る情報処理装置によるコンパイル方法の一例を示す図である。情報処理装置１０は、例えばコンパイルプログラムを実行することにより、図１に示すコンパイル方法を実施することができる。 Hereinafter, the present embodiment will be described with reference to the drawings. It should be noted that each embodiment can be implemented by combining a plurality of embodiments within a consistent range.
[First Embodiment]
FIG. 1 is a diagram showing an example of a compilation method by the information processing apparatus according to the first embodiment. The information processing apparatus 10 can implement the compilation method shown in FIG. 1 by executing, for example, a compilation program.

情報処理装置１０は、記憶部１１と処理部１２とを有する。記憶部１１は、例えば情報処理装置１０が有するメモリ、またはストレージ装置である。処理部１２は、例えば情報処理装置１０が有するプロセッサ、または演算回路である。 The information processing device 10 has a storage unit 11 and a processing unit 12. The storage unit 11 is, for example, a memory or a storage device included in the information processing device 10. The processing unit 12 is, for example, a processor or an arithmetic circuit included in the information processing device 10.

記憶部１１は、例えばコンパイル対象のプログラム１１ａと実行時間情報１１ｂとを記憶する。プログラム１１ａは、例えば高級言語で命令が記述されたソースファイルである。実行時間情報１１ｂは、複数の命令それぞれの実行に要する時間を表す数値が設定されている。実行に要する時間を表す数値は、例えば、複数の命令それぞれの実行に要するプロセッサのクロックのサイクル数である。この場合、ループ処理の処理時間は、プロセッサのクロックの１サイクルの時間を単位時間として表されることとなる。 The storage unit 11 stores, for example, the program 11a to be compiled and the execution time information 11b. The program 11a is, for example, a source file in which instructions are described in a high-level language. The execution time information 11b is set with a numerical value representing the time required to execute each of the plurality of instructions. The numerical value representing the time required for execution is, for example, the number of clock cycles of the processor required to execute each of the plurality of instructions. In this case, the processing time of the loop processing is represented by the time of one cycle of the clock of the processor as a unit time.

処理部１２は、プログラム１１ａをコンパイルする処理の中で、以下のループ最適化処理を実行する。
処理部１２は、プログラム１１ａから、配列内の一部の要素に対する演算を指示するループ命令列１を抽出する（ステップＳ１）。ここでループ命令列１に示されるループ処理のループ回転数（繰り返し回数）をｎ（ｎは１以上の整数）とする。 The processing unit 12 executes the following loop optimization processing in the processing of compiling the program 11a.
The processing unit 12 extracts from the program 11a a loop instruction sequence 1 that instructs an operation on some elements in the array (step S1). Here, the loop rotation speed (number of repetitions) of the loop processing shown in the loop instruction sequence 1 is set to n (n is an integer of 1 or more).

次に処理部１２は、ループ命令列１に基づいて、第１ＳＩＭＤ化命令列２を生成する（ステップＳ２）。第１ＳＩＭＤ化命令列２は、配列内の複数の要素のうち演算対象の要素のみを有効にするマスクを用い、マスクで有効とした１以上の要素に対してＳＩＭＤ命令によって並列で演算を実行させる処理を繰り返すループ処理が記述された命令列である。第１ＳＩＭＤ化命令列２に示されるループ処理内には、例えば４つの命令「命令＃１１」〜「命令＃１４」が含まれる。「命令＃１１」は、マスクを生成する命令である。「命令＃１２」は、マスク付きで要素をロードする命令である。「命令＃１３」は、マスク付きでＳＩＭＤ演算を行う命令である。「命令＃１４」は、マスク付きで演算結果をストアする命令である。 Next, the processing unit 12 generates the first SIMD instruction sequence 2 based on the loop instruction sequence 1 (step S2). The first SIMD instruction sequence 2 uses a mask that enables only the element to be calculated among a plurality of elements in the array, and causes one or more elements enabled by the mask to execute operations in parallel by the SIMD instruction. This is an instruction sequence in which loop processing that repeats processing is described. For example, four instructions "instruction # 11" to "instruction # 14" are included in the loop processing shown in the first SIMD instruction sequence 2. "Instruction # 11" is an instruction to generate a mask. "Instruction # 12" is an instruction to load an element with a mask. "Instruction # 13" is an instruction that performs a SIMD operation with a mask. "Instruction # 14" is an instruction to store the operation result with a mask.

次に処理部１２は、ループ命令列１に基づいて、第２ＳＩＭＤ化命令列３を生成する（ステップＳ３）。第２ＳＩＭＤ化命令列３は、配列内の不連続の複数の要素をＳＩＭＤレジスタ内の連続の領域へ格納して、ＳＩＭＤレジスタ内の複数の要素に対してＳＩＭＤ命令により並列で演算を実行する処理を繰り返すループ処理が記述された命令列である。第２ＳＩＭＤ化命令列３に示されるループ処理内には、例えば４つの命令「命令＃２１」〜「命令＃２４」が含まれる。「命令＃２１」は、処理対象のインデックス値のリストの作成命令である。「命令＃２２」は、配列内の不連続の要素をＳＩＭＤレジスタ内の連続の記憶領域へロードするgather命令である。「命令＃２３」は、ＳＩＭＤ演算を行う命令である。「命令＃２４」は、演算結果を配列の不連続の要素の値としてストアするscatter命令である。 Next, the processing unit 12 generates the second SIMD instruction sequence 3 based on the loop instruction sequence 1 (step S3). The second SIMD instruction sequence 3 stores a plurality of discontinuous elements in the array in a continuous area in the SIMD register, and executes operations in parallel for the plurality of elements in the SIMD register by the SIMD instruction. This is an instruction sequence in which loop processing that repeats is described. For example, four instructions "instruction # 21" to "instruction # 24" are included in the loop processing shown in the second SIMD instruction sequence 3. "Instruction # 21" is an instruction for creating a list of index values to be processed. "Instruction # 22" is a gather instruction that loads discontinuous elements in the array into the continuous storage area in the SIMD register. "Instruction # 23" is an instruction that performs a SIMD operation. "Instruction # 24" is a scatter instruction that stores the operation result as the value of the discontinuous element of the array.

次に処理部１２は、複数の命令それぞれの実行に要するサイクル数が設定された実行時間情報１１ｂに基づいて、第１ＳＩＭＤ化命令列２の処理時間と第２ＳＩＭＤ化命令列３の処理時間とを算出する（ステップＳ４）。例えば処理部１２は、実行時間情報１１ｂを参照し、第１ＳＩＭＤ化命令列２と第２ＳＩＭＤ化命令列３とのそれぞれに含まれる命令の実行に要するサイクル数を認識する。そして処理部１２は、第１ＳＩＭＤ化命令列２のループ処理１回転当りの処理の実行に要するサイクル数に、第１ＳＩＭＤ化命令列２のループ回転数を乗算した値を、第１ＳＩＭＤ化命令列２の処理時間とする。同様に処理部１２は、第２ＳＩＭＤ化命令列３のループ処理１回転当りの処理の実行に要するサイクル数に、第２ＳＩＭＤ化命令列３のループ回転数を乗算した値を、第２ＳＩＭＤ化命令列３の処理時間とする。 Next, the processing unit 12 sets the processing time of the first SIMD instruction sequence 2 and the processing time of the second SIMD instruction sequence 3 based on the execution time information 11b in which the number of cycles required for executing each of the plurality of instructions is set. Calculate (step S4). For example, the processing unit 12 refers to the execution time information 11b and recognizes the number of cycles required to execute the instructions included in the first SIMD instruction sequence 2 and the second SIMD instruction sequence 3. Then, the processing unit 12 multiplies the number of cycles required to execute the processing per rotation of the loop processing of the first SIMD instruction sequence 2 by the loop rotation number of the first SIMD instruction sequence 2 to obtain the value obtained by the first SIMD instruction sequence 2 The processing time of. Similarly, the processing unit 12 multiplies the number of cycles required to execute the processing per rotation of the loop processing of the second SIMD instruction sequence 3 by the loop rotation number of the second SIMD instruction sequence 3 to obtain the second SIMD instruction sequence. The processing time is 3.

なお処理部１２は、例えばループ命令列１におけるループ処理の回転数を、ＳＩＭＤレジスタに格納可能な要素数で除算した値を、第１ＳＩＭＤ化命令列２のループ回転数とする。また処理部１２は、ループ命令列１におけるループ処理の回転数を、ＳＩＭＤレジスタに格納可能な要素数と演算対象の要素を示すインデックス値のループ処理１回転ごとの増分値とで除算した値を、第２ＳＩＭＤ化命令列３のループ回転数とする。 The processing unit 12 uses, for example, the loop rotation speed of the first SIMD instruction sequence 2 as a value obtained by dividing the rotation speed of the loop processing in the loop instruction sequence 1 by the number of elements that can be stored in the SIMD register. Further, the processing unit 12 divides the number of rotations of the loop processing in the loop instruction sequence 1 by the number of elements that can be stored in the SIMD register and the increment value for each rotation of the loop processing of the index value indicating the element to be calculated. , The loop rotation speed of the second SIMD instruction sequence 3.

そして処理部１２は、第１ＳＩＭＤ化命令列２の処理時間と第２ＳＩＭＤ化命令列３の処理時間に基づいて、第１ＳＩＭＤ化命令列２と第２ＳＩＭＤ化命令列３とのうち、コンパイル後のプログラムに含める命令列を選択する（ステップＳ５）。例えば処理部１２は、処理時間が短い方の命令列を、コンパイル後のプログラムに含める命令列として選択する。 Then, the processing unit 12 is a program after compilation of the first SIMD instruction sequence 2 and the second SIMD instruction sequence 3 based on the processing time of the first SIMD instruction sequence 2 and the processing time of the second SIMD instruction sequence 3. The instruction sequence to be included in is selected (step S5). For example, the processing unit 12 selects the instruction sequence having the shorter processing time as the instruction sequence to be included in the compiled program.

図１の例では、第１ＳＩＭＤ化命令列２では、ループ処理の１回転の処理に「命令＃１１」〜「命令＃１４」の４つの命令が含まれる。「命令＃１１」の実行に要するサイクル数は「４」であり、「命令＃１２」の実行に要するサイクル数は「１０」であり、「命令＃１３」の実行に要するサイクル数は「９」であり、「命令＃１４」の実行に要するサイクル数は「１０」である。ターゲットとするマイクロアーキテクチャにおいて、ＳＩＭＤレジスタに格納可能な要素数が「８」の場合、マスク付きＳＩＭＤ命令を利用する場合のループ回転数は「ｎ／８」となる。すると第１ＳＩＭＤ化命令列２の実行に要する時間は、「４．１ｎ（ｃｙｃｌｅｓ）」となる。 In the example of FIG. 1, in the first SIMD instruction sequence 2, four instructions "instruction # 11" to "instruction # 14" are included in the processing of one rotation of the loop processing. The number of cycles required to execute "instruction # 11" is "4", the number of cycles required to execute "instruction # 12" is "10", and the number of cycles required to execute "instruction # 13" is "9". , And the number of cycles required to execute "instruction # 14" is "10". In the target microarchitecture, when the number of elements that can be stored in the SIMD register is "8", the loop rotation speed when using the masked SIMD instruction is "n / 8". Then, the time required to execute the first SIMD instruction sequence 2 becomes "4.1n (cycles)".

また第２ＳＩＭＤ化命令列３では、ループ処理の１回転の処理に「命令＃２１」〜「命令＃２４」の４つの命令が含まれる。「命令＃２１」の実行に要するサイクル数は「４」であり、「命令＃２２」の実行に要するサイクル数は「２０」であり、「命令＃２３」の実行に要するサイクル数は「９」であり、「命令＃２４」の実行に要するサイクル数は「４０」である。ここでＳＩＭＤレジスタに格納可能な要素数が「８」であり、ループ命令列１におけるループ制御変数のループ回転ごとの増分値が「２」であるものとする。この場合、gather／scatter命令を利用する場合のループ回転数は「ｎ／１６」となる。すると第２ＳＩＭＤ化命令列３の実行に要する時間は、「４．６ｎ（ｃｙｃｌｅｓ）」となる。 Further, in the second SIMD instruction sequence 3, four instructions "instruction # 21" to "instruction # 24" are included in the processing of one rotation of the loop processing. The number of cycles required to execute "instruction # 21" is "4", the number of cycles required to execute "instruction # 22" is "20", and the number of cycles required to execute "instruction # 23" is "9". , And the number of cycles required to execute "instruction # 24" is "40". Here, it is assumed that the number of elements that can be stored in the SIMD register is "8", and the increment value of the loop control variable in the loop instruction sequence 1 for each loop rotation is "2". In this case, the loop rotation speed when using the gather / scatter instruction is "n / 16". Then, the time required to execute the second SIMD instruction sequence 3 is "4.6n (cycles)".

この場合、第１ＳＩＭＤ化命令列２の方が第２ＳＩＭＤ化命令列３よりも実行に要する時間が短くなる。従って処理部１２は、コンパイル後のプログラムに含める命令列として、第１ＳＩＭＤ化命令列２を選択する。すなわちマスク付きＳＩＭＤ命令を利用するループ命令列が選択される。 In this case, the time required for execution of the first SIMD instruction sequence 2 is shorter than that of the second SIMD instruction sequence 3. Therefore, the processing unit 12 selects the first SIMD instruction sequence 2 as the instruction sequence to be included in the compiled program. That is, a loop instruction sequence that uses the masked SIMD instruction is selected.

このような情報処理装置１０によれば、プログラム１１ａをコンパイルする際に、ＳＩＭＤ命令を用いた演算の並列化が可能なループ命令列１について、より処理効率が高い手法で最適化することができる。 According to such an information processing apparatus 10, when compiling the program 11a, it is possible to optimize the loop instruction sequence 1 capable of parallelizing operations using SIMD instructions by a method with higher processing efficiency. ..

なお記憶部１１には、複数のマイクロアーキテクチャそれぞれに対応する実行時間情報１１ｂを記憶させておくことができる。この場合、処理部１２は、命令列を選択する際に、コンパイル時に指定されたマイクロアーキテクチャに対応する実行時間情報を参照して、第１ＳＩＭＤ化命令列２と第２ＳＩＭＤ化命令列３との処理時間を算出する。これにより、マイクロアーキテクチャに応じた適切な処理時間を算出することができる。 The storage unit 11 can store the execution time information 11b corresponding to each of the plurality of microarchitectures. In this case, when selecting the instruction sequence, the processing unit 12 refers to the execution time information corresponding to the microarchitecture specified at compile time, and processes the first SIMD instruction sequence 2 and the second SIMD instruction sequence 3. Calculate the time. This makes it possible to calculate an appropriate processing time according to the microarchitecture.

なお処理部１２は、第１ＳＩＭＤ化命令列２を生成する際、マスクの生成処理をループ処理の開始前に実行させるような第１ＳＩＭＤ化命令列２を生成することができる。これにより第１ＳＩＭＤ化命令列２の実行に要する処理時間がさらに短くなる。 When generating the first SIMD instruction sequence 2, the processing unit 12 can generate the first SIMD instruction sequence 2 that executes the mask generation process before the start of the loop process. As a result, the processing time required to execute the first SIMD instruction sequence 2 is further shortened.

また処理部１２は、処理時間の比較をするまでもなく第２ＳＩＭＤ化命令列３の方が、処理時間が短くなることが分かっている場合、一部の処理の実行を抑止することができる。例えば処理部１２は、ループ命令列１の１回のループ処理ごとのループ制御変数の増分値が所定の閾値を超えるか否かを判断する。処理部１２は、増分値が閾値を超える場合、第１ＳＩＭＤ化命令列２の生成処理、および処理時間の算出処理の実行を抑止する。そして処理部１２は、第１ＳＩＭＤ化命令列２の処理時間と第２ＳＩＭＤ化命令列３の処理時間との比較をせずに、コンパイル後のプログラムに含める命令列として第２ＳＩＭＤ化命令列３を選択する。 Further, the processing unit 12 can suppress the execution of a part of the processing when it is known that the processing time of the second SIMD instruction sequence 3 is shorter without comparing the processing times. For example, the processing unit 12 determines whether or not the increment value of the loop control variable for each loop processing of the loop instruction sequence 1 exceeds a predetermined threshold value. When the incremental value exceeds the threshold value, the processing unit 12 suppresses the execution of the generation processing of the first SIMD instruction sequence 2 and the processing time calculation processing. Then, the processing unit 12 selects the second SIMD instruction sequence 3 as the instruction sequence to be included in the compiled program without comparing the processing time of the first SIMD instruction sequence 2 with the processing time of the second SIMD instruction sequence 3. To do.

このようにループ制御変数の増分値が十分に大きいと、第２ＳＩＭＤ化命令列３におけるループ回転数が第１ＳＩＭＤ化命令列２に比べて非常に少なくなる。そのため、第２ＳＩＭＤ化命令列３の方が、処理時間が短くなる。このような場合、第１ＳＩＭＤ化命令列２の生成などの処理をせずに、コンパイル後のプログラムに含める命令列として第２ＳＩＭＤ化命令列３を選択することで、コンパイル処理の効率化を図ることができる。 When the increment value of the loop control variable is sufficiently large in this way, the loop rotation speed in the second SIMD instruction sequence 3 becomes much smaller than that in the first SIMD instruction sequence 2. Therefore, the processing time of the second SIMD instruction sequence 3 is shorter. In such a case, the efficiency of the compilation process can be improved by selecting the second SIMD instruction sequence 3 as the instruction sequence to be included in the compiled program without performing the process such as generating the first SIMD instruction sequence 2. Can be done.

〔第２の実施の形態〕
次に第２の実施の形態について説明する。第２の実施の形態は、Ｃ，Ｃ＋＋，Ｆｏｒｔｒａｎを代表とする高級言語を、ターゲットのマシン向けのより高速なオブジェクト（実行モジュール、オブジェクトファイル、アセンブリ言語のプログラムファイル）に最適化するコンパイル技術である。 [Second Embodiment]
Next, the second embodiment will be described. The second embodiment is a compilation technique that optimizes high-level languages such as C, C ++, and Fortran to faster objects (execution modules, object files, assembly language program files) for the target machine. is there.

図２は、コンパイルの実行に用いるコンピュータのハードウェアの一構成例を示す図である。コンピュータ１００は、プロセッサ１０１によって装置全体が制御されている。プロセッサ１０１には、バス１０９を介してメモリ１０２と複数の周辺機器が接続されている。プロセッサ１０１は、マルチプロセッサであってもよい。プロセッサ１０１は、例えばＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）、またはＤＳＰ（Digital Signal Processor）である。プロセッサ１０１がプログラムを実行することで実現する機能の少なくとも一部を、ＡＳＩＣ（Application Specific Integrated Circuit）、ＰＬＤ（Programmable Logic Device）などの電子回路で実現してもよい。 FIG. 2 is a diagram showing a configuration example of computer hardware used for executing compilation. In the computer 100, the entire device is controlled by the processor 101. A memory 102 and a plurality of peripheral devices are connected to the processor 101 via a bus 109. The processor 101 may be a multiprocessor. The processor 101 is, for example, a CPU (Central Processing Unit), an MPU (Micro Processing Unit), or a DSP (Digital Signal Processor). At least a part of the functions realized by the processor 101 executing a program may be realized by an electronic circuit such as an ASIC (Application Specific Integrated Circuit) or a PLD (Programmable Logic Device).

またプロセッサ１０１は、ＳＩＭＤレジスタ群１０１ａを有している。ＳＩＭＤレジスタ群１０１ａは、ＳＩＭＤ拡張命令を格納できるだけのデータ幅を有するレジスタの集合である。プロセッサ１０１がＳＩＭＤレジスタ群１０１ａを有していることで、コンピュータ１００が、プロセッサ１０１をターゲットマシンとしたコンパイルを実施する場合には、ループ処理を、ＳＩＭＤレジスタを用いた並列処理に最適化することができる。 Further, the processor 101 has a SIMD register group 101a. The SIMD register group 101a is a set of registers having a data width sufficient to store SIMD extension instructions. Since the processor 101 has the SIMD register group 101a, when the computer 100 performs compilation with the processor 101 as the target machine, the loop processing is optimized for parallel processing using the SIMD registers. Can be done.

メモリ１０２は、コンピュータ１００の主記憶装置として使用される。メモリ１０２には、プロセッサ１０１に実行させるＯＳ（Operating System）のプログラムやアプリケーションプログラムの少なくとも一部が一時的に格納される。また、メモリ１０２には、プロセッサ１０１による処理に利用する各種データが格納される。メモリ１０２としては、例えばＲＡＭ（Random Access Memory）などの揮発性の半導体記憶装置が使用される。 The memory 102 is used as the main storage device of the computer 100. At least a part of an OS (Operating System) program or an application program to be executed by the processor 101 is temporarily stored in the memory 102. Further, various data used for processing by the processor 101 are stored in the memory 102. As the memory 102, for example, a volatile semiconductor storage device such as a RAM (Random Access Memory) is used.

バス１０９に接続されている周辺機器としては、ストレージ装置１０３、グラフィック処理装置１０４、入力インタフェース１０５、光学ドライブ装置１０６、機器接続インタフェース１０７およびネットワークインタフェース１０８がある。 Peripheral devices connected to the bus 109 include a storage device 103, a graphic processing device 104, an input interface 105, an optical drive device 106, a device connection interface 107, and a network interface 108.

ストレージ装置１０３は、内蔵した記録媒体に対して、電気的または磁気的にデータの書き込みおよび読み出しを行う。ストレージ装置１０３は、コンピュータの補助記憶装置として使用される。ストレージ装置１０３には、ＯＳのプログラム、アプリケーションプログラム、および各種データが格納される。なお、ストレージ装置１０３としては、例えばＨＤＤ（Hard Disk Drive）やＳＳＤ（Solid State Drive）を使用することができる。 The storage device 103 electrically or magnetically writes and reads data to and from the built-in recording medium. The storage device 103 is used as an auxiliary storage device for a computer. The storage device 103 stores an OS program, an application program, and various data. As the storage device 103, for example, an HDD (Hard Disk Drive) or an SSD (Solid State Drive) can be used.

グラフィック処理装置１０４には、モニタ２１が接続されている。グラフィック処理装置１０４は、プロセッサ１０１からの命令に従って、画像をモニタ２１の画面に表示させる。モニタ２１としては、有機ＥＬ（Electro Luminescence）を用いた表示装置や液晶表示装置などがある。 A monitor 21 is connected to the graphic processing device 104. The graphic processing device 104 causes the image to be displayed on the screen of the monitor 21 in accordance with the instruction from the processor 101. The monitor 21 includes a display device using an organic EL (Electro Luminescence), a liquid crystal display device, and the like.

入力インタフェース１０５には、キーボード２２とマウス２３とが接続されている。入力インタフェース１０５は、キーボード２２やマウス２３から送られてくる信号をプロセッサ１０１に送信する。なお、マウス２３は、ポインティングデバイスの一例であり、他のポインティングデバイスを使用することもできる。他のポインティングデバイスとしては、タッチパネル、タブレット、タッチパッド、トラックボールなどがある。 A keyboard 22 and a mouse 23 are connected to the input interface 105. The input interface 105 transmits signals sent from the keyboard 22 and the mouse 23 to the processor 101. The mouse 23 is an example of a pointing device, and other pointing devices can also be used. Other pointing devices include touch panels, tablets, touchpads, trackballs and the like.

光学ドライブ装置１０６は、レーザ光などを利用して、光ディスク２４に記録されたデータの読み取りを行う。光ディスク２４は、光の反射によって読み取り可能なようにデータが記録された可搬型の記録媒体である。光ディスク２４には、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）などがある。 The optical drive device 106 reads the data recorded on the optical disk 24 by using a laser beam or the like. The optical disk 24 is a portable recording medium on which data is recorded so that it can be read by reflection of light. The optical disk 24 includes a DVD (Digital Versatile Disc), a DVD-RAM, a CD-ROM (Compact Disc Read Only Memory), a CD-R (Recordable) / RW (ReWritable), and the like.

機器接続インタフェース１０７は、コンピュータ１００に周辺機器を接続するための通信インタフェースである。例えば機器接続インタフェース１０７には、メモリ装置２５やメモリリーダライタ２６を接続することができる。メモリ装置２５は、機器接続インタフェース１０７との通信機能を搭載した記録媒体である。メモリリーダライタ２６は、メモリカード２７へのデータの書き込み、またはメモリカード２７からのデータの読み出しを行う装置である。メモリカード２７は、カード型の記録媒体である。 The device connection interface 107 is a communication interface for connecting peripheral devices to the computer 100. For example, a memory device 25 or a memory reader / writer 26 can be connected to the device connection interface 107. The memory device 25 is a recording medium equipped with a communication function with the device connection interface 107. The memory reader / writer 26 is a device that writes data to or reads data from the memory card 27. The memory card 27 is a card-type recording medium.

ネットワークインタフェース１０８は、ネットワーク２０に接続されている。ネットワークインタフェース１０８は、ネットワーク２０を介して、他のコンピュータまたは通信機器との間でデータの送受信を行う。 The network interface 108 is connected to the network 20. The network interface 108 transmits / receives data to / from another computer or communication device via the network 20.

コンピュータ１００は、以上のようなハードウェア構成によって、第２の実施の形態の処理機能を実現することができる。なお、第１の実施の形態に示した情報処理装置１０も、図３に示したコンピュータ１００と同様のハードウェアにより実現することができる。 The computer 100 can realize the processing function of the second embodiment by the hardware configuration as described above. The information processing device 10 shown in the first embodiment can also be realized by the same hardware as the computer 100 shown in FIG.

コンピュータ１００は、例えばコンピュータ読み取り可能な記録媒体に記録されたプログラムを実行することにより、第２の実施の形態の処理機能を実現する。コンピュータ１００に実行させる処理内容を記述したプログラムは、様々な記録媒体に記録しておくことができる。例えば、コンピュータ１００に実行させるプログラムをストレージ装置１０３に格納しておくことができる。プロセッサ１０１は、ストレージ装置１０３内のプログラムの少なくとも一部をメモリ１０２にロードし、プログラムを実行する。またコンピュータ１００に実行させるプログラムを、光ディスク２４、メモリ装置２５、メモリカード２７などの可搬型記録媒体に記録しておくこともできる。可搬型記録媒体に格納されたプログラムは、例えばプロセッサ１０１からの制御により、ストレージ装置１０３にインストールされた後、実行可能となる。またプロセッサ１０１が、可搬型記録媒体から直接プログラムを読み出して実行することもできる。 The computer 100 realizes the processing function of the second embodiment, for example, by executing a program recorded on a computer-readable recording medium. A program that describes the processing content to be executed by the computer 100 can be recorded on various recording media. For example, a program to be executed by the computer 100 can be stored in the storage device 103. The processor 101 loads at least a part of the program in the storage device 103 into the memory 102 and executes the program. Further, the program to be executed by the computer 100 can be recorded on a portable recording medium such as an optical disk 24, a memory device 25, and a memory card 27. The program stored in the portable recording medium can be executed after being installed in the storage device 103 under the control of the processor 101, for example. The processor 101 can also read and execute the program directly from the portable recording medium.

次に、コンピュータ１００がプログラムをコンパイルするために利用する機能について説明する。
図３は、コンピュータの機能を示すブロック図である。コンピュータ１００は、記憶部１１０とコンパイラ１２０とを有する。 Next, the functions used by the computer 100 to compile the program will be described.
FIG. 3 is a block diagram showing the functions of the computer. The computer 100 has a storage unit 110 and a compiler 120.

記憶部１１０は、コンパイルに利用するデータと、コンパイルによって生成されたデータとを記憶する。例えば記憶部１１０には、複数の命令スケジューリング用テーブル１１１ａ，１１１ｂ，・・・、ソースファイル１１２、オブジェクトファイル１１３、実行プログラムモジュール１１４、アセンブリプログラムファイル１１５などが格納される。記憶部１１０は、例えばメモリ１０２またはストレージ装置１０３の記憶領域の少なくとも一部を用いて実現される。 The storage unit 110 stores the data used for compilation and the data generated by compilation. For example, the storage unit 110 stores a plurality of instruction scheduling tables 111a, 111b, ..., Source files 112, object files 113, execution program modules 114, assembly program files 115, and the like. The storage unit 110 is realized by using, for example, at least a part of the storage area of the memory 102 or the storage device 103.

複数の命令スケジューリング用テーブル１１１ａ，１１１ｂ，・・・は、それぞれ異なるマイクロアーキテクチャに対応する。命令スケジューリング用テーブル１１１ａ，１１１ｂ，・・・は、対応するマイクロアーキテクチャで使用可能な命令のレイテンシなどの情報が登録されたデータテーブルである。命令スケジューリング用テーブル１１１ａ，１１１ｂ，・・・は、第１の実施の形態に示した実行時間情報１１ｂの一例である。 The plurality of instruction scheduling tables 111a, 111b, ... Correspond to different microarchitectures. The instruction scheduling tables 111a, 111b, ... Are data tables in which information such as instruction latency that can be used in the corresponding microarchitecture is registered. The instruction scheduling tables 111a, 111b, ... Are examples of the execution time information 11b shown in the first embodiment.

ソースファイル１１２は、Ｃ，Ｃ＋＋，Ｆｏｒｔｒａｎなどの高級言語で記述されたプログラムを含むファイルである。オブジェクトファイル１１３は、コンパイラ１２０によって生成され、機械語で命令が記述されたプログラムファイルである。実行プログラムモジュール１１４は、コンパイラ１２０によって生成され、直接実行可能なように機械語の命令が記述されたプログラムファイルである。アセンブリプログラムファイル１１５は、コンパイラ１２０によって生成され、アセンブリ言語で命令が記述されたプログラムファイルである。 The source file 112 is a file containing a program written in a high-level language such as C, C ++, or Fortran. The object file 113 is a program file generated by the compiler 120 and in which instructions are described in machine language. The execution program module 114 is a program file generated by the compiler 120 and in which machine language instructions are described so that it can be directly executed. The assembly program file 115 is a program file generated by the compiler 120 and in which instructions are described in assembly language.

コンパイラ１２０は、ソースファイル１１２に記述されたソースコードを機械語に変換し、オブジェクトファイル１１３を生成する。コンパイラ１２０は、例えば解析部１２１、最適化部１２２、およびコード出力部１２３を有する。 The compiler 120 converts the source code described in the source file 112 into machine language and generates the object file 113. The compiler 120 has, for example, an analysis unit 121, an optimization unit 122, and a code output unit 123.

解析部１２１は、ソースファイル１１２に記述されているソースコードを解析する。解析部１２１は、例えば高級言語で記述されたソースコードを中間言語に変換する。
最適化部１２２は、中間言語で示される命令列の最適化を行う。例えば最適化部１２２は、ループ処理について、ＳＩＭＤ命令を用いた並列処理に変更する。最適化部１２２は、ループ処理を最適化するために、ループ最適化部１２２ａとＳＩＭＤ化部１２２ｂとを有する。 The analysis unit 121 analyzes the source code described in the source file 112. The analysis unit 121 converts, for example, a source code written in a high-level language into an intermediate language.
The optimization unit 122 optimizes the instruction sequence indicated in the intermediate language. For example, the optimization unit 122 changes the loop processing to parallel processing using the SIMD instruction. The optimizing unit 122 has a loop optimizing unit 122a and a SIMD optimizing unit 122b in order to optimize the loop processing.

ループ最適化部１２２ａは、ループ処理の最適化を行う。例えばループ最適化部１２２ａは、中間言語に変換された命令列の中から、マスク付きＳＩＭＤ命令を用いた最適化、またはgather／scatter命令を用いた最適化が可能なループ処理を記述する命令列（ループ命令列）を抽出する。次にループ最適化部１２２ａは、抽出したループ命令列について、マスク付きＳＩＭＤ命令を利用して最適化した場合のループ処理の実行時間と、gather／scatter命令を用いて最適化した場合のループ処理の実行時間とを比較する。そしてループ最適化部１２２ａは、実行時間が短くなる手法によってループ処理が実行されるように、ループ命令列を最適化する。 The loop optimization unit 122a optimizes the loop processing. For example, the loop optimizing unit 122a describes an instruction sequence that describes loop processing that can be optimized using a masked SIMD instruction or an optimization using a gather / scatter instruction from an instruction sequence converted into an intermediate language. Extract (loop instruction sequence). Next, the loop optimization unit 122a determines the execution time of the loop processing when the extracted loop instruction sequence is optimized by using the masked SIMD instruction, and the loop processing when the extracted loop instruction sequence is optimized by using the gather / scatter instruction. Compare with the execution time of. Then, the loop optimizing unit 122a optimizes the loop instruction sequence so that the loop processing is executed by the method of shortening the execution time.

ＳＩＭＤ化部１２２ｂは、ＳＩＭＤ命令で並列処理が可能なループ命令列を、ＳＩＭＤ命令を用いた処理に変換する。なおＳＩＭＤ化部１２２ｂは、ループ処理のＳＩＭＤ化を、ループ最適化部１２２ａによるループ最適化処理より後に実行する。 The SIMD conversion unit 122b converts a loop instruction sequence capable of parallel processing by the SIMD instruction into processing using the SIMD instruction. The SIMD conversion unit 122b executes SIMD conversion of the loop processing after the loop optimization processing by the loop optimization unit 122a.

コード出力部１２３は、最適化された中間言語の命令を機械語のコードに変換し、機械語で記述されたオブジェクトファイル１１３を出力する。例えばコード出力部１２３は、オブジェクトファイル１１３を、記憶部１１０に格納する。 The code output unit 123 converts the optimized intermediate language instruction into the machine language code, and outputs the object file 113 described in the machine language. For example, the code output unit 123 stores the object file 113 in the storage unit 110.

なお、図３に示した各要素（解析部１２１、最適化部１２２、ループ最適化部１２２ａ、ＳＩＭＤ化部１２２ｂ、コード出力部１２３）間を接続する線は通信経路の一部を示すものであり、図示した通信経路以外の通信経路も設定可能である。また、図３に示した各要素の機能は、例えば、その要素に対応するプログラムモジュールをコンピュータ１００に実行させることで実現することができる。 The line connecting each element (analysis unit 121, optimization unit 122, loop optimization unit 122a, SIMD conversion unit 122b, code output unit 123) shown in FIG. 3 indicates a part of the communication path. Yes, it is possible to set a communication path other than the communication path shown in the figure. Further, the function of each element shown in FIG. 3 can be realized, for example, by causing the computer 100 to execute a program module corresponding to the element.

次に、記憶部１１０に格納されている命令スケジューリング用テーブル１１１ａ，１１１ｂ，・・・について詳細に説明する。
図４は、命令スケジューリング用テーブルの一例を示す図である。命令スケジューリング用テーブル１１１ａには、中間言語命令、レイテンシ情報、パイプライン情報、および非パイプライン命令フラグの欄が設けられている。 Next, the instruction scheduling tables 111a, 111b, ... Stored in the storage unit 110 will be described in detail.
FIG. 4 is a diagram showing an example of an instruction scheduling table. The instruction scheduling table 111a is provided with columns for intermediate language instructions, latency information, pipeline information, and non-pipeline instruction flags.

中間言語命令の欄には、中間言語として用いられる命令（中間言語命令）が設定される。例えば中間言語命令の欄には、コンパイル時に使用可能なすべての中間言語命令が設定されている。 Instructions used as an intermediate language (intermediate language instructions) are set in the intermediate language instruction column. For example, in the intermediate language instruction column, all intermediate language instructions that can be used at compile time are set.

レイテンシ情報の欄には、対応する中間言語命令を実行する際のレイテンシ（命令の実行が開始されてから実行が終了するまでの待ち時間）が設定される。例えばレイテンシは、プロセッサの動作クロックのサイクル数で表される。なお中間言語命令のレイテンシは、その中間言語命令を実行する際の引数や他の条件に応じて変わることがある。例えばgather命令であれば、取得する要素間のメモリ上での間隔が広いほど、メモリの大きく異なる位置から各要素のデータを読み出すこととなり、レイテンシが長期化する。そのためレイテンシ情報の欄には、例えば中間言語命令の標準的なレイテンシ（平均、最頻値などの代表値）が設定される。 In the latency information field, the latency (waiting time from the start of instruction execution to the end of execution) when executing the corresponding intermediate language instruction is set. Latency, for example, is represented by the number of cycles of the processor's operating clock. The latency of an intermediate language instruction may change depending on the arguments used when executing the intermediate language instruction and other conditions. For example, in the case of the gather instruction, the wider the distance between the elements to be acquired in the memory, the more the data of each element is read from the positions in the memory, and the latency becomes longer. Therefore, for example, the standard latency (representative value such as average and mode) of the intermediate language instruction is set in the latency information column.

パイプライン情報の欄には、対応する中間言語命令の実行に利用する１以上の演算器の名称が設定される。なおパイプライン情報の欄において、複数の演算器の名称が「，」で接続されている場合、「，」の両側の演算器の両方を使用することを示している。例えば中間言語命令のパイプライン情報が「ＡＧＰ，ＰＲＰ」の場合、その中間言語命令の実行には、演算器「ＡＧＰ」（アドレス計算用演算器）と演算器「ＰＲＰ」（predicate用演算器）との両方が使用される。なおpredicate用演算器は、各命令について実行時に出力するか否かの制御回路である。パイプライン情報の欄において、複数の演算器の名称が「｜」で接続されている場合、「｜」の両側の演算器のいずれか一方を使用することを示している。例えば中間言語命令のパイプライン情報が「ＦＰＡ｜ＦＰＢ」の場合、その中間言語命令の実行には、演算器「ＦＰＡ」と「ＦＰＢ」（共に浮動小数点演算用演算器）のいずれか一方が使用される。 In the pipeline information column, the names of one or more arithmetic units used to execute the corresponding intermediate language instructions are set. In the pipeline information column, when the names of multiple arithmetic units are connected by ",", it is indicated that both arithmetic units on both sides of "," are used. For example, when the pipeline information of the intermediate language instruction is "AGP, PRP", the arithmetic unit "AGP" (address calculation arithmetic unit) and the arithmetic unit "PRP" (predicate arithmetic unit) are used to execute the intermediate language instruction. And both are used. The predicate arithmetic unit is a control circuit for whether or not to output each instruction at the time of execution. In the pipeline information column, when the names of a plurality of arithmetic units are connected by "|", it is indicated that one of the arithmetic units on both sides of "|" is used. For example, when the pipeline information of an intermediate language instruction is "FPA | FPB", one of the arithmetic units "FPA" and "FPB" (both are floating-point arithmetic arithmetic units) is used to execute the intermediate language instruction. Will be done.

非パイプライン命令フラグの欄には、対応する中間言語命令が非パイプライン命令か否かを示すフラグが設定される。非パイプライン命令とは、レイテンシの分だけ特定の演算器を占有する命令である。中間言語命令が非パイプライン命令である場合、非パイプライン命令フラグの欄に、非パイプライン命令であることを示すフラグ（図４では丸印で表記）が設定される。 In the non-pipeline instruction flag field, a flag indicating whether or not the corresponding intermediate language instruction is a non-pipeline instruction is set. A non-pipeline instruction is an instruction that occupies a specific arithmetic unit by the amount of latency. When the intermediate language instruction is a non-pipeline instruction, a flag indicating that it is a non-pipeline instruction (indicated by a circle in FIG. 4) is set in the non-pipeline instruction flag field.

なお、命令スケジューリング用テーブル１１１ａ以外の命令スケジューリング用テーブル１１１ｂ，・・・にも、命令スケジューリング用テーブル１１１ａと同種の情報が設定されている。なお、コンパイル時にどの命令スケジューリング用テーブルを使用するのかは、コンパイル時に指定されたマイクロアーキテクチャによって決定される。すなわちコンパイルの実行指示では、ターゲットとするマイクロアーキテクチャが指定される。ループ最適化部１２２ａは、指定されたマイクロアーキテクチャに対応する命令スケジューリング用テーブルを参照して、ループ最適化処理を実行する。 Information similar to that of the instruction scheduling table 111a is also set in the instruction scheduling tables 111b, ... Other than the instruction scheduling table 111a. Which instruction scheduling table is used at compile time is determined by the microarchitecture specified at compile time. That is, the target microarchitecture is specified in the compile execution instruction. The loop optimizing unit 122a refers to the instruction scheduling table corresponding to the specified microarchitecture and executes the loop optimizing process.

次に、コンパイラ１２０によるループ処理の最適化方式について具体的に説明する。ループ処理の最適化方式としては、マスク付きＳＩＭＤ命令を利用する最適化と、gather／scatter命令を利用する最適化とがある。 Next, the optimization method of the loop processing by the compiler 120 will be specifically described. As an optimization method for loop processing, there are an optimization using a masked SIMD instruction and an optimization using a gather / scatter instruction.

図５は、ループ処理を記述する命令列の例を示す図である。図５には、配列ｂのインデクス値が奇数の要素に、配列ｃの同じインデックス値の要素を加算し、加算結果を配列ａの同じインデックス値の要素として格納する処理の例が示されている。このような処理は、マスク付きＳＩＭＤ命令を利用するループ処理の最適化と、gather／scatter命令を利用するループ処理の最適化との両方が可能である。 FIG. 5 is a diagram showing an example of an instruction sequence describing loop processing. FIG. 5 shows an example of a process of adding an element having the same index value in the array c to an element having an odd index value in the array b and storing the addition result as an element having the same index value in the array a. .. Such processing can be both optimization of loop processing using the masked SIMD instruction and optimization of loop processing using the gather / scatter instruction.

マスク付きＳＩＭＤ命令を利用してループ処理を実行させる場合、図５の下段の左側に示すループ命令列４１が生成される。gather／scatter命令を利用してループ処理を実行させる場合、図５の下段の右側に示すループ命令列４２が生成される。 When the loop processing is executed by using the masked SIMD instruction, the loop instruction sequence 41 shown on the left side of the lower part of FIG. 5 is generated. When the loop processing is executed by using the gather / scatter instruction, the loop instruction sequence 42 shown on the lower right side of FIG. 5 is generated.

配列ｂ，配列ｃそれぞれには、ｎ個の要素があるものとする。この場合、マスク付きＳＩＭＤ命令を利用する場合のループ命令列４１は、ループ制御変数ｉを１ずつカウントアップさせながら、ｉ＝ｎとなるまでループ（ｄｏｉ＝１，ｎ）するような記述となる。なおループ命令列４１では、命令「ｉｆ（ｍｏｄ（ｉ，２）＝＝１）ｔｈｅｎ」により、インデックス値が奇数（ループ制御変数ｉの値を２で除算したときの剰余が１）の場合に、演算「ａ（ｉ）＝ｂ（ｉ）＋ｃ（ｉ）」を行うことが記述されている。この剰余を用いた条件分岐が、ＳＩＭＤ命令に付与するマスクによって実現される。 It is assumed that each of the array b and the array c has n elements. In this case, the loop instruction sequence 41 when using the masked SIMD instruction is described so as to loop (do i = 1, n) until i = n while counting up the loop control variable i by 1. Become. In the loop instruction sequence 41, when the index value is odd (the remainder when the value of the loop control variable i is divided by 2 is 1) by the instruction "if (mod (i, 2) == 1) the". , It is described that the operation "a (i) = b (i) + c (i)" is performed. Conditional branching using this remainder is realized by the mask given to the SIMD instruction.

またgather／scatter命令を利用する場合のループ命令列４２は、ループ制御変数ｉを２ずつカウントアップさせながらｉ＝ｎとなるまでループ（ｄｏｉ＝１，ｎ，２）するような記述となる。このような配列上の一定間隔おきのインデックス値の要素に対する処理の場合、配列上のとびとびの領域に存在する複数の要素の同時取得がgather命令によって実現される。また、演算結果の、配列上のとびとびの領域への格納が、scatter命令によって実現される。なお、ループ命令列４２のように、取得する要素を示すループ制御変数ｉの値を１ずつではなく、２以上ずつ増加させるようなループを、strideループと呼ぶことがある。 Further, the loop instruction sequence 42 when the gather / scatter instruction is used is described so as to loop (do i = 1, n, 2) until i = n while counting up the loop control variable i by 2. .. In the case of processing for the elements of the index value at regular intervals on the array, simultaneous acquisition of a plurality of elements existing in the discrete areas on the array is realized by the gather instruction. In addition, the calculation result is stored in the discrete area on the array by the scatter instruction. A loop such as the loop instruction sequence 42 in which the value of the loop control variable i indicating the element to be acquired is increased by 2 or more instead of 1 is sometimes called a stride loop.

図６は、マスク付きＳＩＭＤ命令を利用するループ処理の内容を示す図である。図６の例では、コンパイルにおけるターゲットのマシンのプロセッサが、幅５１２ビットのＳＩＭＤレジスタ５１，５２，５３を有しているものとする。これらのＳＩＭＤレジスタ５１，５２，５３それぞれには、倍精度浮動小数点（８バイト）の要素を８つ同時に格納できる。 FIG. 6 is a diagram showing the contents of loop processing using the masked SIMD instruction. In the example of FIG. 6, it is assumed that the processor of the target machine in compilation has SIMD registers 51, 52, 53 with a width of 512 bits. Eight double-precision floating-point (8-byte) elements can be stored in each of these SIMD registers 51, 52, and 53 at the same time.

ループ命令列４１では、インデックス値が奇数の要素のみを演算対象とする。そのため配列ｂの要素を読み込む際には、該当要素を「Ｔ」（真）、他の要素を「Ｆ」（偽）とするマスク５４を用いたロード（load with mask）が行われる。このロードでは、ＳＩＭＤレジスタ５１と同じ幅の配列の要素に対してロードが実行されるが、マスク５４において「Ｔ」と設定されたインデックス値の要素のみが、ＳＩＭＤレジスタ５１にロードされる。図６の例では、インデックス値「１」〜「８」の８つの要素に対するロード命令が行われ、マスク５４で「Ｔ」の４つの要素のみがＳＩＭＤレジスタ５１に格納される。同様に配列ｃの要素を読み込む際には、マスク５５を用いたロード（load with mask）が行われ、マスク５５で「Ｔ」の４つの要素のみがＳＩＭＤレジスタ５２に格納される。 In the loop instruction sequence 41, only the elements having an odd index value are calculated. Therefore, when reading the elements of the array b, a load (load with mask) is performed using a mask 54 in which the corresponding element is “T” (true) and the other elements are “F” (false). In this load, the elements in the array having the same width as the SIMD register 51 are loaded, but only the elements having the index value set as "T" in the mask 54 are loaded into the SIMD register 51. In the example of FIG. 6, a load instruction is given to the eight elements of the index values “1” to “8”, and only the four elements of “T” are stored in the SIMD register 51 with the mask 54. Similarly, when reading the elements of the array c, a load with mask is performed, and only the four elements of "T" in the mask 55 are stored in the SIMD register 52.

そしてＳＩＭＤレジスタ５１，５２の同じインデックス値（奇数のインデックス値に限る）の要素同士の加算が行われ、加算結果がＳＩＭＤレジスタ５３に格納される。ＳＩＭＤレジスタ５３に格納された値は、インデックス値が奇数の要素を「Ｔ」、他の要素を「Ｆ」とするマスク５６を用いたストア（store with mask）によって、配列ａに格納される。これにより、配列ａには、インデックス値が奇数の要素にのみ演算結果が格納される。なお、マスク付きＳＩＭＤ命令利用時のロード命令とストア命令とを合わせて、以下load／store命令と呼ぶ。 Then, elements having the same index value (limited to an odd index value) of the SIMD registers 51 and 52 are added, and the addition result is stored in the SIMD register 53. The value stored in the SIMD register 53 is stored in the array a by a store with mask using a mask 56 in which an element having an odd index value is "T" and another element is "F". As a result, the operation result is stored in the array a only in the elements having odd index values. The load instruction and the store instruction when using the masked SIMD instruction are collectively referred to as a load / store instruction below.

このように、ループ命令列４１に対してＳＩＭＤ化による最適化が行われると、配列において連続する８つの要素をループ１回転での処理対象とし、そのうちの演算対象の要素を並列で演算を実行する処理手順となる。そのためループの回転数は、ｎの１／８となる。すなわちＳＩＭ化前にループ回転数＝ｎであったものが、ＳＩＭＤ化後にループ回転数＝ｎ／８となる。 In this way, when the loop instruction sequence 41 is optimized by SIMD, eight consecutive elements in the array are targeted for processing in one rotation of the loop, and the elements to be calculated are executed in parallel. It becomes a processing procedure to be performed. Therefore, the rotation speed of the loop is 1/8 of n. That is, what was loop rotation speed = n before SIM conversion becomes loop rotation speed = n / 8 after SIMD conversion.

図７は、gather／scatter命令を利用するループ処理の内容を示す図である。ループ命令列４２では、インデックス値が奇数の要素のみを演算対象とする。そのため配列ｂの要素を読み込む際には、gather命令によって、配列ｂ内の非連続の領域の８つの要素が同時に読み出され、ＳＩＭＤレジスタ５１に格納される。同様に配列ｃの要素を読み込む際には、gather命令によって、配列ｃ内の非連続の領域の８つの要素が同時に読み出され、ＳＩＭＤレジスタ５２に格納される。 FIG. 7 is a diagram showing the contents of loop processing using the gather / scatter instruction. In the loop instruction sequence 42, only the elements having an odd index value are calculated. Therefore, when reading the elements of the array b, eight elements of the discontinuous region in the array b are simultaneously read by the gather instruction and stored in the SIMD register 51. Similarly, when reading the elements of the array c, eight elements of the discontinuous region in the array c are simultaneously read by the gather instruction and stored in the SIMD register 52.

そしてＳＩＭＤレジスタ５１，５２の同じインデックス値の要素同士の加算が行われ、加算結果がＳＩＭＤレジスタ５３に格納される。ＳＩＭＤレジスタ５３に格納された値は、scatter命令によって、配列ａ内の非連続の領域に格納される。 Then, elements having the same index value in the SIMD registers 51 and 52 are added, and the addition result is stored in the SIMD register 53. The value stored in the SIMD register 53 is stored in a discontinuous area in the array a by the scatter instruction.

このようにgather／scatter命令を利用できれば、幅５１２ビットのＳＩＭＤレジスタ５１〜５３内の倍精度浮動小数点で８要素分の領域を、すべて有効利用できる。その結果、ＳＩＭＤ化前のループ命令列４２ではループ回転数＝ｎ／２であったものが、ＳＩＭＤ化後にはループ回転数＝ｎ／２／８となる。 If the gather / scatter instruction can be used in this way, all the areas of eight elements in double-precision floating point in the SIMD registers 51 to 53 having a width of 512 bits can be effectively used. As a result, in the loop instruction sequence 42 before SIMD conversion, the loop rotation speed = n / 2, but after SIMD conversion, the loop rotation speed = n / 2/8.

図６と図７とを比較すると、gather／scatter命令を利用するループ処理の方が、ループ回転数が半分で済むことが分かる。ここで、gather／scatter命令を利用する場合と、マスク付きＳＩＭＤにおけるload／store命令とで、要素のロードとストアに要する時間が同じであれば、gather／scatter命令を利用するループ処理を採用することで大幅な性能向上が期待できる。 Comparing FIG. 6 and FIG. 7, it can be seen that the loop processing using the gather / scatter instruction requires only half the loop rotation speed. Here, if the time required to load and store the element is the same between the case of using the gather / scatter instruction and the load / store instruction in the masked SIMD, the loop processing using the gather / scatter instruction is adopted. This can be expected to significantly improve performance.

ただし、gather／scatter命令では不連続の領域へのアクセスが行われるため、一般的には、その処理に要する時間は、マスク付きＳＩＭＤ命令におけるload／store命令に要する時間よりも長い。そのため、gather／scatter命令の処理に要する時間によっては、例えループ処理の回転数が倍になっても連続領域にアクセスするマスク付きＳＩＭＤ命令を利用した方が、全体として処理が速い可能性がある。 However, since the gather / scatter instruction accesses a discontinuous area, the time required for the processing is generally longer than the time required for the load / store instruction in the masked SIMD instruction. Therefore, depending on the time required to process the gather / scatter instruction, it may be faster to use the masked SIMD instruction that accesses the continuous area even if the rotation speed of the loop processing is doubled. ..

なお、gather／scatter命令を使うか、マスク付きＳＩＭＤ命令を使うかは、マイクロアーキテクチャの各命令のレイテンシ情報だけで判断できるものではない。例えばループ１回転分（１イタレーション）の演算に要する時間が長ければ、gather／scatter命令が遅いマイクロアークチャでも、gather／scatter命令を利用する価値が高まる。それに対して、ループ１回転分の演算に要する時間が短い場合、gather／scatter命令のスピードがそのままそのループの性能に大きく影響するため、gather／scatter命令を利用する価値は下がる。このように実行性能の向上を図るには、それらの選択余地があるループの効率化手法に対して、それぞれの手法のループ命令列を使ったときにどちらのループ命令列を使えば速くなるかを適切に判断することが重要となる。 Whether to use the gather / scatter instruction or the masked SIMD instruction cannot be determined only by the latency information of each instruction of the microarchitecture. For example, if the time required for the calculation for one rotation of the loop (1 iteration) is long, the value of using the gather / scatter instruction increases even in the microarccha where the gather / scatter instruction is slow. On the other hand, when the time required for the calculation for one rotation of the loop is short, the speed of the gather / scatter instruction directly affects the performance of the loop, so that the value of using the gather / scatter instruction decreases. In order to improve the execution performance in this way, which loop instruction sequence should be used faster when using the loop instruction sequence of each method for the loop efficiency method that has a choice. It is important to make an appropriate judgment.

そこでコンピュータ１００が有するコンパイラ１２０では、最適化部１２２に含まれるループ最適化部１２２ａが、以下の計算により、どちらのループ命令列を利用するかを決定する。 Therefore, in the compiler 120 of the computer 100, the loop optimizing unit 122a included in the optimizing unit 122 determines which loop instruction sequence to use by the following calculation.

［手順１］ループ最適化部１２２ａは、ＩＦ文を含むループ命令列（マスク付きＳＩＭＤ命令利用）の１イタレーション分のサイクル数にループ回転数（例えばｎ／８）を掛け合わせ、ループのコストを算出する。 [Procedure 1] The loop optimizing unit 122a multiplies the number of cycles for one iteration of the loop instruction sequence including the IF statement (using a masked SIMD instruction) by the number of loop rotations (for example, n / 8), and costs the loop. Is calculated.

［手順２］ループ最適化部１２２ａは、strideループ（gather／scatter利用）の１イタレーション分のサイクル数にループ回転数（例えばｎ／１６）を掛け合わせ、ループコストを算出する。 [Procedure 2] The loop optimization unit 122a calculates the loop cost by multiplying the number of cycles for one iteration of the stride loop (using gather / scatter) by the loop rotation speed (for example, n / 16).

［手順３］ループ最適化部１２２ａは、手順１で算出したコストと手順２で算出したコストとの比較により、gather／scatter命令を利用するループ命令列を採用するか、マスク付きＳＩＭＤ命令を利用するループ命令列を採用するかを判定する。 [Procedure 3] The loop optimizing unit 122a adopts a loop instruction sequence that uses the gather / scatter instruction or uses a masked SIMD instruction by comparing the cost calculated in procedure 1 with the cost calculated in step 2. Determines whether to adopt the loop instruction sequence.

以下、採用するループ命令列の判定例について説明する。
図８は、採用するループ命令列の判定例を示す図である。図８には、マスク付きＳＩＭＤ命令とgather／scatter命令の実行時間に差があるケースの例を示している。 Hereinafter, an example of determining the loop instruction sequence to be adopted will be described.
FIG. 8 is a diagram showing a determination example of the loop instruction sequence to be adopted. FIG. 8 shows an example of a case where there is a difference in execution time between the masked SIMD instruction and the gather / scatter instruction.

例えばマスク付きＳＩＭＤ命令を利用するループ命令列４１の１イタレーション内の命令を中間言語命令で書き換えると、中間言語命令列６１となる。中間言語命令列６１は、第１の実施の形態に示す第１ＳＩＭＤ化命令列２の一例である。 For example, if the instruction in one iteration of the loop instruction sequence 41 using the masked SIMD instruction is rewritten with the intermediate language instruction, the intermediate language instruction sequence 61 is obtained. The intermediate language instruction sequence 61 is an example of the first SIMD instruction sequence 2 shown in the first embodiment.

中間言語命令列６１の１行目は、マスクを設定する命令である。「（ｍｏｄ（ｉ：ｉ＋７，２）＝＝（１，１，１，１，１，１，１，１）」によって、ループ制御変数ｉの現在値からｉ＋７までの整数それぞれを２で除算したときの剰余が１であれば「Ｔ」、１でなければ「Ｆ」のマスクが生成される。生成されたマスクは、レジスタ「ｐ１」に設定される。 The first line of the intermediate language instruction column 61 is an instruction for setting a mask. By "(mod (i: i + 7,2) == (1,1,1,1,1,1,1,1)", each integer from the current value of the loop control variable i to i + 7 was divided by 2. If the remainder is 1, a mask of "T" is generated, and if it is not 1, a mask of "F" is generated. The generated mask is set in the register "p1".

中間言語命令列６１の２行目は、マスクを用いたロード命令である。「ｖｌｏａｄ＿ｗｉｔｈ＿ｍａｓｋ（ｂ（ｉ：ｉ＋７），ｐ１）」によって、ループ制御変数ｉの現在値からｉ＋７までの整数それぞれをインデックス値とする配列ｂの要素のうち、マスク内の対応する値が「Ｔ」である要素が、メモリからロードされる。ロードされた要素は、レジスタ「ｖ１」に設定される。また「ｖｌｏａｄ＿ｗｉｔｈ＿ｍａｓｋ（ｃ（ｉ：ｉ＋７），ｐ１）」によって、ループ制御変数ｉの現在値からｉ＋７までの整数それぞれをインデックス値とする配列ｃの要素のうち、マスク内の対応する値が「Ｔ」である要素が、メモリからロードされる。ロードされた要素は、レジスタ「ｖ２」に設定される。なお、２行目に記載されている２つのロード命令は、並列に実行可能である。 The second line of the intermediate language instruction column 61 is a load instruction using a mask. By "vroad_with_mask (b (i: i + 7), p1)", the corresponding value in the mask is "T" among the elements of the array b whose index values are the integers from the current value of the loop control variable i to i + 7. The element that is is loaded from memory. The loaded element is set in the register "v1". Further, by "vroad_with_mask (c (i: i + 7), p1)", among the elements of the array c having each integer from the current value of the loop control variable i to i + 7 as an index value, the corresponding value in the mask is "T". The element that is "is loaded from memory. The loaded element is set in the register "v2". The two load instructions described on the second line can be executed in parallel.

中間言語命令列６１の３行目の「ｖａｄｄ＿ｗｉｔｈ＿ｍａｓｋ（ｖ１，ｖ２，ｐ１）」は、レジスタ「ｖ１」とレジスタ「ｖ２」との、マスク内の対応する値が「Ｔ」である要素同士の加算命令である。加算結果は、レジスタ「ｖ３」に設定される。 The third line "badd_with_mask (v1, v2, p1)" of the intermediate language instruction string 61 is the addition of the elements of the register "v1" and the register "v2" whose corresponding values in the mask are "T". It is an instruction. The addition result is set in the register "v3".

中間言語命令列６１の４行目は、マスクを用いたストア命令である。「ｖｓｔｏｒｅ＿ｗｉｔｈ＿ｍａｓｋ（ａ（ｉ：ｉ＋７），ｖ３，ｐ１）」によって、マスク内の対応する値が「Ｔ」であるレジスタ「ｖ３」内の値が、ループ制御変数ｉの現在値からｉ＋７までの整数それぞれをインデックス値とする配列ａの要素に設定される。 The fourth line of the intermediate language instruction column 61 is a store instruction using a mask. By "vstore_with_mask (a (i: i + 7), v3, p1)", the value in the register "v3" whose corresponding value in the mask is "T" is an integer from the current value of the loop control variable i to i + 7. It is set in the element of the array a having each as the index value.

ループ最適化部１２２ａは、命令スケジューリング用テーブル１１１ａを参照し、中間言語命令列６１の各行のレイテンシを特定する。なお中間言語命令列６１の１行目のマスクを設定する命令は、図４に示す命令スケジューリング用テーブル１１１ａの中間言語命令「ｖｃｍｐ」に対応する。中間言語命令列６１の２行目〜４行目の各命令は、命令スケジューリング用テーブル１１１ａの同名の中間言語命令に対応する。 The loop optimizing unit 122a refers to the instruction scheduling table 111a and specifies the latency of each row of the intermediate language instruction column 61. The instruction for setting the mask in the first row of the intermediate language instruction column 61 corresponds to the intermediate language instruction "vcmp" in the instruction scheduling table 111a shown in FIG. Each instruction in the second to fourth rows of the intermediate language instruction column 61 corresponds to an intermediate language instruction having the same name in the instruction scheduling table 111a.

すると中間言語命令列６１の１行目の命令のレイテンシは「４」、２行目のレイテンシは「１０」、３行目のレイテンシは「９」、４行目のレイテンシは「１０」となる。その結果、中間言語命令列６１に示される１イタレーションの実行時間は、「４＋１０＋９＋１０＝３３（ｃｙｃｌｅｓ）」となる。マスク付きＳＩＭＤ命令を利用するループ命令列４１のループ回転数は「ｎ／８」であるため、実行時間にループ回転数を乗算した値「４．１ｎ（ｃｙｃｌｅｓ）」が、ＳＩＭＤ命令を利用するループ命令列４１の実行時間となる。 Then, the latency of the instruction in the first row of the intermediate language instruction column 61 is "4", the latency of the second row is "10", the latency of the third row is "9", and the latency of the fourth row is "10". .. As a result, the execution time of one iteration shown in the intermediate language instruction sequence 61 is "4 + 10 + 9 + 10 = 33 (cycles)". Since the loop rotation speed of the loop instruction sequence 41 using the masked SIMD instruction is "n / 8", the value "4.1n (cycles)" obtained by multiplying the execution time by the loop rotation speed uses the SIMD instruction. This is the execution time of the loop instruction sequence 41.

gather／scatter命令を利用するループ命令列４２の１イタレーション内の命令を中間言語命令で書き換えると、中間言語命令列６２となる。中間言語命令列６２は、第１の実施の形態に示す第２ＳＩＭＤ化命令列３の一例である。 When the instruction in one iteration of the loop instruction sequence 42 using the gather / scatter instruction is rewritten with the intermediate language instruction, the intermediate language instruction sequence 62 is obtained. The intermediate language instruction sequence 62 is an example of the second SIMD instruction sequence 3 shown in the first embodiment.

中間言語命令列６２の１行目は、演算処理対象の要素のインデックス値を設定する命令である。「ｖｉｎｄｅｘ（ｉ，２）」によって、ループ制御変数ｉから２つとびのインデックス値のリストが生成される。なおリストには、レジスタ「ｖ０」のサイズに応じた数の要素（図８の例では８要素）のインデックス値が登録される。生成されたリストは、レジスタ「ｖ０」に設定される。 The first line of the intermediate language instruction column 62 is an instruction for setting the index value of the element to be processed. By "vinex (i, 2)", a list of two index values is generated from the loop control variable i. In the list, index values of a number of elements (8 elements in the example of FIG. 8) corresponding to the size of the register "v0" are registered. The generated list is set in the register "v0".

中間言語命令列６２の２行目は、gather命令である。「ｇａｔｈｅｒ（ｂ（ｖ０））」によって、レジスタ「ｖ０」に設定されたインデックス値の配列ｂの要素が、メモリからロードされる。ロードされた要素は、レジスタ「ｖ１」に設定される。また「ｇａｔｈｅｒ（ｃ（ｖ０））」によって、レジスタ「ｖ０」に設定されたインデックス値の配列ｃの要素が、メモリからロードされる。ロードされた要素は、レジスタ「ｖ２」に設定される。なお、２行目に記載されている２つのgather命令は、並列に実行可能である。 The second line of the intermediate language instruction column 62 is a gather instruction. By "gather (b (v0))", the elements of the array b of the index values set in the register "v0" are loaded from the memory. The loaded element is set in the register "v1". Further, the element of the array c of the index values set in the register "v0" is loaded from the memory by "gather (c (v0))". The loaded element is set in the register "v2". The two gather instructions described on the second line can be executed in parallel.

中間言語命令列６２の３行目の「ｖａｄｄ（ｖ１，ｖ２）」は、レジスタ「ｖ１」とレジスタ「ｖ２」との、対応する要素同士の加算命令である。加算結果は、レジスタ「ｖ３」に設定される。 The third line "vadd (v1, v2)" of the intermediate language instruction column 62 is an addition instruction between the corresponding elements of the register "v1" and the register "v2". The addition result is set in the register "v3".

中間言語命令列６２の４行目は、scatter命令である。「ｓｃａｔｔｅｒ（ａ（ｖ０），ｖ３）」によって、レジスタ「ｖ３」内の値が、レジスタ「ｖ０」に設定されたインデックス値の配列ａの要素に設定される。 The fourth line of the intermediate language instruction column 62 is a scatter instruction. By "scatter (a (v0), v3)", the value in the register "v3" is set in the element of the array a of the index values set in the register "v0".

ループ最適化部１２２ａは、命令スケジューリング用テーブル１１１ａを参照し、中間言語命令列６２の各行のレイテンシを特定する。例えば中間言語命令列６２の４行目のマスク付きストアを命令する「ｓｃａｔｔｅｒ」のレイテンシは、図４に示す命令スケジューリング用テーブル１１１ａの中間言語命令「ｓｃａｔｔｅｒ」のレイテンシ情報に示されている。中間言語命令列６２の１行目〜３行目の各命令は、命令スケジューリング用テーブル１１１ａの同名の中間言語命令に対応する。 The loop optimizing unit 122a refers to the instruction scheduling table 111a and specifies the latency of each row of the intermediate language instruction column 62. For example, the latency of the "scatter" that instructs the masked store in the fourth row of the intermediate language instruction column 62 is shown in the latency information of the intermediate language instruction "scatter" in the instruction scheduling table 111a shown in FIG. Each instruction in the first to third rows of the intermediate language instruction column 62 corresponds to an intermediate language instruction having the same name in the instruction scheduling table 111a.

すると中間言語命令列６２の１行目の命令のレイテンシは「４」、２行目のレイテンシは「２０」、３行目のレイテンシは「９」、４行目のレイテンシは「４０」となる。その結果、中間言語命令列６２に示される１イタレーションの実行時間は、「４＋２０＋９＋４０＝７３（ｃｙｃｌｅｓ）」となる。gather／scatter命令を利用するループ命令列４２のループ回転数は「ｎ／１６」であるため、実行時間にループ回転数を乗算した値「４．６ｎ（ｃｙｃｌｅｓ）」が、gather／scatter命令を利用するループ命令列４２の実行時間となる。 Then, the latency of the instruction in the first row of the intermediate language instruction column 62 is "4", the latency of the second row is "20", the latency of the third row is "9", and the latency of the fourth row is "40". .. As a result, the execution time of one iteration shown in the intermediate language instruction sequence 62 is "4 + 20 + 9 + 40 = 73 (cycles)". Since the loop rotation speed of the loop instruction sequence 42 using the gather / scatter instruction is "n / 16", the value "4.6n (cycles)" obtained by multiplying the execution time by the loop rotation speed is the gather / scatter instruction. This is the execution time of the loop instruction sequence 42 to be used.

ループ最適化部１２２ａは、ＳＩＭＤ命令を利用するループ命令列４１の実行時間とgather／scatter命令を利用するループ命令列４２の実行時間とを比較し、時間が短い方のループ命令列を採用する。図８の例では、ＳＩＭＤ命令を利用するループ命令列４１が採用される。すなわち、図８の例では、回転数が倍になっても、マスク付きＳＩＭＤ命令を利用した方が効率がよい。 The loop optimizing unit 122a compares the execution time of the loop instruction sequence 41 using the SIMD instruction with the execution time of the loop instruction sequence 42 using the gather / scatter instruction, and adopts the loop instruction sequence having the shorter time. .. In the example of FIG. 8, the loop instruction sequence 41 using the SIMD instruction is adopted. That is, in the example of FIG. 8, it is more efficient to use the masked SIMD instruction even if the rotation speed is doubled.

ループ最適化部１２２ａは、このようなループ処理の最適化方式を、ループごとに判定する。これにより、ソースファイル１１２に含まれる多数のループ処理が、個別に適切な方式で最適化される。 The loop optimization unit 122a determines such an optimization method for loop processing for each loop. As a result, a large number of loop processes included in the source file 112 are individually optimized in an appropriate manner.

以下、ループ最適化処理の手順を、フローチャートを参照して詳細に説明する。
図９は、ループ最適化処理の手順の一例を示すフローチャートである。以下、図９に示す処理をステップ番号に沿って説明する。 Hereinafter, the procedure of the loop optimization process will be described in detail with reference to the flowchart.
FIG. 9 is a flowchart showing an example of the procedure of the loop optimization process. Hereinafter, the process shown in FIG. 9 will be described along with the step numbers.

［ステップＳ１０１］ループ最適化部１２２ａは、解析されたソースファイル１１２内のソースコード中のループ群から、ループ命令列を１つずつ抽出する。以下、ソースコードから抽出したループ命令列をオリジナルループと呼ぶこととする。 [Step S101] The loop optimizing unit 122a extracts loop instruction sequences one by one from the loop group in the source code in the analyzed source file 112. Hereinafter, the loop instruction sequence extracted from the source code will be referred to as an original loop.

［ステップＳ１０２］ループ最適化部１２２ａは、ステップＳ１０１においてループ命令列が抽出できたか否かを判断する。例えばソースファイル１１２に記述されたすべてのループ命令列を抽出し終えたら、処理を終了する。ループ最適化部１２２ａは、抽出したループ命令列に対し、処理をステップＳ１０３に進める。またループ最適化部１２２ａは、ループ命令列が抽出できなかった場合、ループ最適化処理を終了する。 [Step S102] The loop optimizing unit 122a determines whether or not the loop instruction sequence can be extracted in step S101. For example, when all the loop instruction sequences described in the source file 112 have been extracted, the process ends. The loop optimizing unit 122a advances the process to step S103 for the extracted loop instruction sequence. Further, the loop optimizing unit 122a ends the loop optimizing process when the loop instruction sequence cannot be extracted.

［ステップＳ１０３］ループ最適化部１２２ａは、抽出したループ命令列のループタイプを判定する。ループタイプとは、ループ命令列で示されるループ処理の種別である。
ループタイプとしては、最適化するとマスク付きＳＩＭＤ命令を利用することとなるループ処理（ＴＹＰＥ１）、最適化するとgather／scatter命令を利用することとなるループ処理（ＴＹＰＥ２）、およびその他のループ処理（ＴＹＰＥ０）がある。「ＴＹＰＥ０」のループ処理は、ループ最適化部１２２ａによる最適化の対象外のループ処理である。「ＴＹＰＥ１」のループ処理は、例えば配列内の連続する要素のうち、条件分岐（ＩＦ文など）によって演算対象の要素を選択する処理である。「ＴＹＰＥ２」のループ処理は、配列内のインデックス値が不連続の要素（例えば一定のストライド幅間隔のインデックス値を有する要素）に対して演算を実行する処理である。ループタイプ判定処理の詳細は後述する（図１０参照）。 [Step S103] The loop optimizing unit 122a determines the loop type of the extracted loop instruction sequence. The loop type is a type of loop processing indicated by a loop instruction sequence.
Loop types include loop processing (TYPE1) that uses masked SIMD instructions when optimized, loop processing (TYPE2) that uses gather / scatter instructions when optimized, and other loop processing (TYPE0). ). The loop processing of "TYPE0" is a loop processing that is not the target of optimization by the loop optimizing unit 122a. The loop processing of "TYPE 1" is, for example, a processing of selecting an element to be calculated by a conditional branch (IF statement or the like) among continuous elements in an array. The loop process of "TYPE 2" is a process of executing an operation on an element having a discontinuous index value in the array (for example, an element having an index value of a constant stride width interval). Details of the loop type determination process will be described later (see FIG. 10).

［ステップＳ１０４］ループ最適化部１２２ａは、ループ命令列に示されるループ処理のループタイプが「ＴＹＰＥ１」または「ＴＹＰＥ２」か否かを判断する。ループ最適化部１２２ａは、ループタイプが「ＴＹＰＥ１」または「ＴＹＰＥ２」であれば処理をステップＳ１０５に進める。またループ最適化部１２２ａは、ループタイプが「ＴＹＰＥ０」であれば処理をステップＳ１０１に進める。 [Step S104] The loop optimizing unit 122a determines whether or not the loop type of the loop processing shown in the loop instruction sequence is "TYPE1" or "TYPE2". If the loop type is "TYPE1" or "TYPE2", the loop optimizing unit 122a advances the process to step S105. If the loop type is "TYPE0", the loop optimizing unit 122a advances the process to step S101.

［ステップＳ１０５］ループ最適化部１２２ａは、ステップＳ１０１で抽出したオリジナルループと等価の等価ループの生成処理を行う。等価ループとは、抽出したループ命令列と同じ処理結果を得る異なるループタイプのループ命令列である。例えばループ最適化部１２２ａは、抽出したループ命令列のループタイプが「ＴＹＰＥ１」であれば同じ処理結果を得る「ＴＹＰＥ２」のループ命令列を生成する。またループ最適化部１２２ａは、抽出したループ命令列のループタイプが「ＴＹＰＥ２」であれば同じ処理結果を得る「ＴＹＰＥ１」のループ命令列を生成する。等価ループ生成処理の詳細は後述する（図１１参照）。 [Step S105] The loop optimizing unit 122a performs a process of generating an equivalent loop equivalent to the original loop extracted in step S101. The equivalent loop is a loop instruction sequence of a different loop type that obtains the same processing result as the extracted loop instruction sequence. For example, the loop optimizing unit 122a generates a loop instruction sequence of "TYPE2" that obtains the same processing result if the loop type of the extracted loop instruction sequence is "TYPE1". Further, the loop optimizing unit 122a generates a loop instruction sequence of "TYPE1" that obtains the same processing result if the loop type of the extracted loop instruction sequence is "TYPE2". The details of the equivalent loop generation process will be described later (see FIG. 11).

［ステップＳ１０６］ループ最適化部１２２ａは、オリジナルループと等価ループとのそれぞれについて、命令スケジューリング処理を行う。命令スケジューリング処理では、例えば処理効率が向上するように命令の順序の入れ替えが行われる。例えばループ最適化部１２２ａは、ＳＩＭＤ命令で並列処理が可能なループ処理について、ＳＩＭＤ命令を用いた命令列に書き換える。命令スケジューリング処理の詳細は後述する（図１４参照）。 [Step S106] The loop optimizing unit 122a performs instruction scheduling processing for each of the original loop and the equivalent loop. In the instruction scheduling process, for example, the order of instructions is changed so as to improve the processing efficiency. For example, the loop optimizing unit 122a rewrites the loop processing that can be processed in parallel by the SIMD instruction into an instruction sequence using the SIMD instruction. Details of the instruction scheduling process will be described later (see FIG. 14).

［ステップＳ１０７］ループ最適化部１２２ａは、オリジナルループと等価ループとのそれぞれについて、ループコスト算出処理を行う。ループコスト算出処理は、ループ命令列に応じたループ処理全体の実行に要する時間（１イタレーションの実行時間×回転数）を、コストとして換算する処理である。ループコスト算出処理の詳細は後述する（図１５参照）。 [Step S107] The loop optimizing unit 122a performs loop cost calculation processing for each of the original loop and the equivalent loop. The loop cost calculation process is a process of converting the time (execution time of one iteration x number of revolutions) required for executing the entire loop process according to the loop instruction sequence as a cost. The details of the loop cost calculation process will be described later (see FIG. 15).

［ステップＳ１０８］ループ最適化部１２２ａは、オリジナルループと等価ループとのループコストを比較し、オリジナルループのループコストの方が等価ループのループコストより大きいか否かを判断する。ループ最適化部１２２ａは、オリジナルループのループコストの方が大きい場合、処理をステップＳ１０９に進める。またループ最適化部１２２ａは、ループコストが等しいか、あるいは等価ループの方が大きい場合、処理をステップＳ１０１に進める。 [Step S108] The loop optimizing unit 122a compares the loop costs of the original loop and the equivalent loop, and determines whether or not the loop cost of the original loop is larger than the loop cost of the equivalent loop. If the loop cost of the original loop is larger, the loop optimizing unit 122a advances the process to step S109. If the loop optimization unit 122a has the same loop cost or the equivalent loop is larger, the loop optimizing unit 122a advances the process to step S101.

［ステップＳ１０９］ループ最適化部１２２ａは、オリジナルループのループ命令列を、等価ループのループ命令列に入れ替える。
［ステップＳ１１０］ループ最適化部１２２ａは、不採用となったループ命令列を破棄し、処理をステップＳ１０１に進める。 [Step S109] The loop optimizing unit 122a replaces the loop instruction sequence of the original loop with the loop instruction sequence of the equivalent loop.
[Step S110] The loop optimizing unit 122a discards the rejected loop instruction sequence and proceeds to the process in step S101.

このようにして、ループ最適化部１２２ａは、マスク付きＳＩＭＤ命令を利用するループ命令列とgather／scatter命令を利用するループ命令列とのどちらでも実施可能なループ処理について、処理時間が短くなる方のループ命令列を採用することができる。 In this way, the loop optimizing unit 122a shortens the processing time for loop processing that can be executed by both the loop instruction sequence that uses the masked SIMD instruction and the loop instruction sequence that uses the gather / scatter instruction. The loop instruction sequence of can be adopted.

次にループタイプ判定処理について詳細に説明する。
図１０は、ループタイプ判定処理の手順の一例を示すフローチャートである。以下、図１０に示す処理をステップ番号に沿って説明する。 Next, the loop type determination process will be described in detail.
FIG. 10 is a flowchart showing an example of the procedure of the loop type determination process. Hereinafter, the process shown in FIG. 10 will be described along with the step numbers.

［ステップＳ１２１］ループ最適化部１２２ａは、オリジナルループの中から、ループ制御変数を認識したか否かを判断する。ループ制御変数とは、「ｄｏｉ＝ｓ，ｎ，ｋ」における変数「ｉ」のようにループの回転ごとに更新される変数である。ループ最適化部１２２ａは、例えばループの回転で１回転（１イタレーション）当り１回更新される変数をループ制御変数として認識する。ループ最適化部１２２ａは、１イタレーション当りの更新回数が２回以上ある変数は、その変数を最適化対象としないために、ループ制御変数ではないと認識する。 [Step S121] The loop optimizing unit 122a determines whether or not the loop control variable is recognized from the original loop. The loop control variable is a variable that is updated every rotation of the loop, such as the variable "i" in "do i = s, n, k". The loop optimizing unit 122a recognizes, for example, a variable that is updated once per rotation (1 iteration) by the rotation of the loop as a loop control variable. The loop optimizing unit 122a recognizes that a variable having two or more updates per iteration is not a loop control variable because the variable is not optimized.

ループ最適化部１２２ａは、ループ制御変数を認識した場合、処理をステップＳ１２２に進める。またループ最適化部１２２ａは、ループ制御変数を認識できなかった場合、処理をステップＳ１２５に進める。 When the loop optimizing unit 122a recognizes the loop control variable, the loop optimizing unit 122a advances the process to step S122. If the loop optimizing unit 122a cannot recognize the loop control variable, the loop optimizing unit 122a advances the process to step S125.

［ステップＳ１２２］ループ最適化部１２２ａは、ループ制御変数の増分値を認識する。増分値とは、ループ制御変数の１イタレーション当りの増加量である。例えば「ｄｏｉ＝ｓ，ｎ，ｋ」における定数「ｋ」が増分値である。増分値は、演算対象となる要素のインデックス値のストライド幅を表している。 [Step S122] The loop optimizing unit 122a recognizes the incremental value of the loop control variable. The incremental value is the amount of increase in the loop control variable per iteration. For example, the constant "k" in "do i = s, n, k" is an incremental value. The increment value represents the stride width of the index value of the element to be calculated.

なおループ処理は「ｄｏｉ＝ｓ，ｎ」と記述することもできる。この記述は、デフォルト値として定義されている増分値「１」の記載が省略されたものである。「ｄｏｉ＝ｓ，ｎ」と記述されていた場合、ループ最適化部１２２ａは、この記述が「ｄｏｉ＝ｓ，ｎ，１」と等価であると判断し、増分値を「１」と認識する。 The loop processing can also be described as "do i = s, n". In this description, the description of the increment value "1" defined as the default value is omitted. When "do i = s, n" is described, the loop optimization unit 122a determines that this description is equivalent to "do i = s, n, 1", and sets the increment value to "1". recognize.

［ステップＳ１２３］ループ最適化部１２２ａは、ループ制御変数の初期値ｓ（「ｄｏｉ＝ｓ，ｎ，ｋ」における「ｓ」）が、０以上の定数か否かを判断する。ループ最適化部１２２ａは、初期値ｓが０以上の定数の場合、処理をステップＳ１２４に進める。またループ最適化部１２２ａは、初期値ｓが０以上の定数ではない場合、処理をステップＳ１２５に進める。 [Step S123] The loop optimizing unit 122a determines whether or not the initial value s of the loop control variable (“s” in “do i = s, n, k”) is a constant of 0 or more. When the initial value s is a constant of 0 or more, the loop optimization unit 122a advances the process to step S124. If the initial value s is not a constant of 0 or more, the loop optimizing unit 122a advances the process to step S125.

［ステップＳ１２４］ループ最適化部１２２ａは、増分値ｋが定数か否かを判断する。ループ最適化部１２２ａは、増分値ｋが定数の場合、処理をステップＳ１２６に進める。またループ最適化部１２２ａは、増分値ｋが定数ではない場合、処理をステップＳ１２５に進める。 [Step S124] The loop optimization unit 122a determines whether or not the increment value k is a constant. When the increment value k is a constant, the loop optimizing unit 122a advances the process to step S126. If the increment value k is not a constant, the loop optimizing unit 122a advances the process to step S125.

［ステップＳ１２５］ループ最適化部１２２ａは、ループタイプ判定処理の戻り値を「ＴＹＰＥ０」として、ループタイプ判定処理を終了する。すなわちループ最適化部１２２ａは、オリジナルループのループタイプを「ＴＹＰＥ０」に決定する。 [Step S125] The loop optimizing unit 122a ends the loop type determination process with the return value of the loop type determination process set to "TYPE0". That is, the loop optimizing unit 122a determines the loop type of the original loop to be "TYPE0".

［ステップＳ１２６］ループ最適化部１２２ａは、増分値ｋ＝１か否かを判断する。ループ最適化部１２２ａは、増分値ｋ＝１であれば、処理をステップＳ１２７に進める。またループ最適化部１２２ａは、増分値ｋ＝１でなければ、処理をステップＳ１２９に進める。 [Step S126] The loop optimization unit 122a determines whether or not the increment value k = 1. If the increment value k = 1, the loop optimization unit 122a advances the process to step S127. Further, the loop optimization unit 122a advances the process to step S129 unless the increment value k = 1.

［ステップＳ１２７］ループ最適化部１２２ａは、オリジナルループがＩＦマスクタイプの条件を満たすか否かを判断する。ＩＦマスクタイプの条件は、例えば以下の５つの条件である。
・ループ回転分岐以外の分岐（ＩＦ文）は１つである。
・ＩＦ文の条件節が「ｍｏｄ（ループ制御変数，閾値）＝＝定数ＶＡＬ」である。
・定数ＶＡＬが「０≦定数ＶＡＬ＜増分値ｋ」を満たす。
・ＩＦ文配下の演算が一定幅のデータに対して実行される。
・ＩＦ条件節とＩＦ分岐、ループ繰り返し判定以外の命令列はＩＦ文配下である。 [Step S127] The loop optimizing unit 122a determines whether or not the original loop satisfies the IF mask type condition. The IF mask type conditions are, for example, the following five conditions.
-There is one branch (IF statement) other than the loop rotation branch.
-The conditional clause of the IF statement is "mod (loop control variable, threshold) == constant VAL".
-The constant VAL satisfies "0 ≤ constant VAL <incremental value k".
-The operation under the IF statement is executed for data of a certain width.
-The instruction sequence other than the IF conditional clause, IF branch, and loop repetition judgment is under the IF statement.

ループ最適化部１２２ａは、これらの５つの条件のすべてが満たされる場合、ＩＦマスクタイプの条件を満たすと判断し、処理をステップＳ１２８に進める。またループ最適化部１２２ａは、これらの５つの条件の少なくとも１つが満たされない場合、ＩＦマスクタイプの条件を満たさないと判断し、処理をステップＳ１３０に進める。 If all of these five conditions are satisfied, the loop optimization unit 122a determines that the IF mask type condition is satisfied, and proceeds to step S128. If at least one of these five conditions is not satisfied, the loop optimization unit 122a determines that the IF mask type condition is not satisfied, and proceeds to step S130.

［ステップＳ１２８］ループ最適化部１２２ａは、ループタイプ判定処理の戻り値を「ＴＹＰＥ１」として、ループタイプ判定処理を終了する。すなわちループ最適化部１２２ａは、オリジナルループのループタイプを「ＴＹＰＥ１」に決定する。 [Step S128] The loop optimizing unit 122a ends the loop type determination process with the return value of the loop type determination process set to "TYPE1". That is, the loop optimizing unit 122a determines the loop type of the original loop to be "TYPE1".

［ステップＳ１２９］ループ最適化部１２２ａは、増分値ｋが、予め設定された閾値より大きいか否かを判断する。閾値は、ユーザが任意に設定することができる。例えば、インデックス値が奇数または偶数の要素に対して演算を行うループ処理のみを最適化するのであれば、ユーザは、閾値として「２」を設定すればよい。ループ最適化部１２２ａは、増分値ｋが閾値より大きければ、処理をステップＳ１３０に進める。またループ最適化部１２２ａは、増分値ｋが閾値以下であれば、処理をステップＳ１３１に進める。 [Step S129] The loop optimization unit 122a determines whether or not the increment value k is larger than the preset threshold value. The threshold value can be arbitrarily set by the user. For example, if only the loop processing that performs an operation on an element having an odd or even index value is optimized, the user may set "2" as the threshold value. If the increment value k is larger than the threshold value, the loop optimization unit 122a advances the process to step S130. If the increment value k is equal to or less than the threshold value, the loop optimization unit 122a advances the process to step S131.

［ステップＳ１３０］ループ最適化部１２２ａは、ループタイプ判定処理の戻り値を「ＴＹＰＥ０」として、ループタイプ判定処理を終了する。すなわちループ最適化部１２２ａは、オリジナルループのループタイプを「ＴＹＰＥ０」に決定する。 [Step S130] The loop optimizing unit 122a ends the loop type determination process with the return value of the loop type determination process set to "TYPE0". That is, the loop optimizing unit 122a determines the loop type of the original loop to be "TYPE0".

［ステップＳ１３１］ループ最適化部１２２ａは、ループタイプ判定処理の戻り値を「ＴＹＰＥ２」として、ループタイプ判定処理を終了する。すなわちループ最適化部１２２ａは、オリジナルループのループタイプを「ＴＹＰＥ２」に決定する。 [Step S131] The loop optimizing unit 122a ends the loop type determination process with the return value of the loop type determination process set to "TYPE2". That is, the loop optimizing unit 122a determines the loop type of the original loop to be "TYPE2".

このようにして、ループ命令列に示されるループ処理の内容に応じて、オリジナルループのループタイプが判定される。そして「ＴＹＰＥ１」または「ＴＹＰＥ２」のループ命令列が、ループ最適化部１２２ａによる最適化対象となる。ループタイプの判定が終了すると、等価ループの生成処理が実行される。 In this way, the loop type of the original loop is determined according to the content of the loop processing shown in the loop instruction sequence. Then, the loop instruction sequence of "TYPE1" or "TYPE2" is the object of optimization by the loop optimizing unit 122a. When the loop type determination is completed, the equivalent loop generation process is executed.

図１１は、等価ループ生成処理の手順の一例を示すフローチャートである。以下、図１１に示す処理をステップ番号に沿って説明する。
［ステップＳ１４１］ループ最適化部１２２ａは、オリジナルループのループタイプが「ＴＹＰＥ１」か否かを判断する。ループ最適化部１２２ａは、ループタイプが「ＴＹＰＥ１」であれば、処理をステップＳ１４２に進める。またループ最適化部１２２ａは、ループタイプが「ＴＹＰＥ２」であれば、処理をステップＳ１４３に進める。 FIG. 11 is a flowchart showing an example of the procedure of the equivalent loop generation process. Hereinafter, the process shown in FIG. 11 will be described along with the step numbers.
[Step S141] The loop optimizing unit 122a determines whether or not the loop type of the original loop is "TYPE1". If the loop type is "TYPE1", the loop optimizing unit 122a advances the process to step S142. If the loop type is "TYPE2", the loop optimizing unit 122a advances the process to step S143.

［ステップＳ１４２］ループ最適化部１２２ａは、オリジナルループと同じ処理結果が得られる「ＴＹＰＥ２」のループ命令列（gather／scatter命令を利用するループ命令列）を、等価ループとして生成する。その後、ループ最適化部１２２ａは処理をステップＳ１４４に進める。 [Step S142] The loop optimizing unit 122a generates a loop instruction sequence (loop instruction sequence using the gather / scatter instruction) of "TYPE2" that can obtain the same processing result as the original loop as an equivalent loop. After that, the loop optimization unit 122a advances the process to step S144.

［ステップＳ１４３］ループ最適化部１２２ａは、オリジナルループと同じ処理結果が得られる「ＴＹＰＥ１」のループ命令列（マスク付きＳＩＭＤ命令を利用するループ命令列）を生成する。 [Step S143] The loop optimizing unit 122a generates a loop instruction sequence (loop instruction sequence using a masked SIMD instruction) of "TYPE1" that can obtain the same processing result as the original loop.

［ステップＳ１４４］ループ最適化部１２２ａは、等価ループであるループ命令列において定数にできる部分について、最適化を行う。そしてループ最適化部１２２ａは、該当部分の記述を定数に変換する。例えば「ｍｏｄ（ｓ，ｓｔｒｉｄｅ）」のｓが定数であり、増分値を示すｓｔｒｉｄｅも定数の場合、ループ最適化部１２２ａは、「ｍｏｄ（ｓ，ｓｔｒｉｄｅ）」の定数計算結果が「１」となる場合には、「ｍｏｄ（ｓ，ｓｔｒｉｄｅ）」を定数「１」に変換する。 [Step S144] The loop optimizing unit 122a optimizes a portion of the loop instruction sequence that is an equivalent loop that can be made a constant. Then, the loop optimizing unit 122a converts the description of the corresponding portion into a constant. For example, when s of "mod (s, stride)" is a constant and the stride indicating the increment value is also a constant, the loop optimizing unit 122a states that the constant calculation result of "mod (s, stride)" is "1". If so, the "mod (s, stride)" is converted to the constant "1".

［ステップＳ１４５］ループ最適化部１２２ａは、定数化できない不変式の最適化を行う。例えばループ最適化部１２２ａは、ｍｏｄ（ｓ，ｓｔｒｉｄｅ）のｓが定数でなくとも、ｓｔｒｉｄｅが定数の場合にはｍｏｄ（ｓ，ｓｔｒｉｄｅ）をループ外に追い出す。 [Step S145] The loop optimizing unit 122a optimizes invariants that cannot be made constant. For example, the loop optimizing unit 122a expels the mod (s, stride) out of the loop when the stride is a constant, even if the s of the mod (s, stride) is not a constant.

［ステップＳ１４６］ループ最適化部１２２ａは、生成した等価ループを戻り値として出力し、等価ループ生成処理を終了する。
このようにして、オリジナルループと同じ処理結果を得ることができる等価ループが生成される。ループ最適化部１２２ａは、生成されたオリジナルループと等価ループとをメモリ１０２に格納する。 [Step S146] The loop optimizing unit 122a outputs the generated equivalent loop as a return value, and ends the equivalent loop generation process.
In this way, an equivalent loop that can obtain the same processing result as the original loop is generated. The loop optimizing unit 122a stores the generated original loop and the equivalent loop in the memory 102.

図１２は、「ＴＹＰＥ１」のオリジナルループから「ＴＹＰＥ２」の等価ループへの変換例を示す図である。図１２に示すオリジナルループ７１では、ＩＦ文の中で「ｍｏｄ（ｉ，ｓｔｒｉｄｅ）．ｅｑ．ｖａｌ」という条件が示されている。これはループ制御変数ｉを定数「ｓｔｒｉｄｅ」で除算したときの剰余が定数「ｖａｌ」と等しいという条件である。このようなループ処理内で条件が定義されているループ処理は、ＳＩＭＤ化部１２２ｂにより、マスクＳＩＭＤ命令を利用する処理に最適化されることとなる。 FIG. 12 is a diagram showing an example of conversion from the original loop of “TYPE 1” to the equivalent loop of “TYPE 2”. In the original loop 71 shown in FIG. 12, the condition “mod (i, stride) .eq. Val” is shown in the IF statement. This is a condition that the remainder when the loop control variable i is divided by the constant "stride" is equal to the constant "val". The loop processing in which the conditions are defined in such a loop processing is optimized by the SIMD conversion unit 122b for the processing using the mask SIMD instruction.

オリジナルループ７１に基づいて生成された等価ループ７２では、ループ処理内で「ｉ＋＝ｓｔｒｉｄｅ」によって、一定間隔でループ制御変数を増加させている。このような一定のストライド幅でのループ制御変数の増加を伴うループ処理は、ＳＩＭＤ化部１２２ｂにより、gather／scatter命令を利用する処理に最適化されることとなる。また等価ループ７２では、オリジナルループ７１のＩＦ文内の条件を最初に満たすループ制御変数ｉの値を、ループ処理の前の２行の命令で算出している。これによりループ処理における１イタレーション当りの処理時間を短縮することができる。 In the equivalent loop 72 generated based on the original loop 71, the loop control variable is increased at regular intervals by “i + = stride” in the loop processing. The loop processing accompanied by the increase of the loop control variable with such a constant stride width is optimized by the SIMD conversion unit 122b for the processing using the gather / scatter instruction. Further, in the equivalent loop 72, the value of the loop control variable i that first satisfies the condition in the IF statement of the original loop 71 is calculated by the two lines of instructions before the loop processing. As a result, the processing time per iteration in the loop processing can be shortened.

図１３は、「ＴＹＰＥ２」のオリジナルループから「ＴＹＰＥ１」の等価ループへの変換例を示す図である。図１３に示すオリジナルループ７３では、ループ処理内で「ｉ＋＝ｓｔｒｉｄｅ」によって、一定間隔でループ制御変数を増加させている。このような一定のストライド幅でのループ制御変数の増加を伴うループ処理は、ＳＩＭＤ化部１２２ｂにより、gather／scatter命令を利用する処理に最適化されることとなる。 FIG. 13 is a diagram showing an example of conversion from the original loop of “TYPE 2” to the equivalent loop of “TYPE 1”. In the original loop 73 shown in FIG. 13, the loop control variable is increased at regular intervals by “i + = stride” in the loop processing. The loop processing accompanied by the increase of the loop control variable with such a constant stride width is optimized by the SIMD conversion unit 122b for the processing using the gather / scatter instruction.

オリジナルループ７３に基づいて生成された等価ループ７４では、ＩＦ文の中で「ｍｏｄ（ｉ，ｓｔｒｉｄｅ）．ｅｑ．ｍｏｄ（ｓ，ｓｔｒｉｄｅ）」という条件が示されている。このようなループ処理内で条件が定義されているループ処理は、ＳＩＭＤ化部１２２ｂにより、マスクＳＩＭＤ命令を利用する処理に最適化されることとなる。 In the equivalent loop 74 generated based on the original loop 73, the condition “mod (i, stride) .eq. Mod (s, stride)” is shown in the IF statement. The loop processing in which the conditions are defined in such a loop processing is optimized by the SIMD conversion unit 122b for the processing using the mask SIMD instruction.

オリジナルループに基づいて等価ループが生成されることにより、「ＴＹＰＥ１」のループ命令列と「ＴＹＰＥ２」のループ命令列との組が生成される。そしてループ最適化部１２２ａは、ループ命令列の組に対して、命令スケジューリング処理を行う。 By generating the equivalent loop based on the original loop, a pair of the loop instruction sequence of "TYPE1" and the loop instruction sequence of "TYPE2" is generated. Then, the loop optimizing unit 122a performs instruction scheduling processing on the set of loop instruction sequences.

図１４は、命令スケジューリング処理の手順の一例を示すフローチャートである。以下、図１４に示す処理をステップ番号に沿って説明する。
［ステップＳ１５１］ループ最適化部１２２ａは、「ＴＹＰＥ１」のループ命令列をメモリ１０２から取得する。 FIG. 14 is a flowchart showing an example of the instruction scheduling process procedure. Hereinafter, the process shown in FIG. 14 will be described along with the step numbers.
[Step S151] The loop optimizing unit 122a acquires the loop instruction sequence of "TYPE1" from the memory 102.

［ステップＳ１５２］ループ最適化部１２２ａは、「ＴＹＰＥ１」のループ命令列の命令スケジューリングを行う。例えばループ最適化部１２２ａは、命令スケジューリング用テーブル１１１ａから、各命令に対するレイテンシ情報、利用する演算器情報を参照して、命令スケジューリングを実施する。例えばループ最適化部１２２ａは、効率的にループ処理が実行されるように、演算結果に影響を与えない範囲で命令の順序を入れ替える。ループ最適化部１２２ａは、命令スケジューリングの手法としては、例えばリストスケジューリング、ソフトウェアパイプラインなどの手法を用いることができる。 [Step S152] The loop optimizing unit 122a performs instruction scheduling of the loop instruction sequence of "TYPE1". For example, the loop optimizing unit 122a performs instruction scheduling by referring to the latency information for each instruction and the arithmetic unit information to be used from the instruction scheduling table 111a. For example, the loop optimizing unit 122a rearranges the order of instructions within a range that does not affect the calculation result so that the loop processing can be executed efficiently. As the instruction scheduling method, the loop optimizing unit 122a can use, for example, a method such as list scheduling or software pipeline.

［ステップＳ１５３］ループ最適化部１２２ａは、「ＴＹＰＥ２」のループ命令列をメモリ１０２から取得する。
［ステップＳ１５４］ループ最適化部１２２ａは、命令スケジューリング用テーブル１１１ａを参照し、ステップＳ１５２と同様に、「ＴＹＰＥ２」のループ命令列の命令スケジューリングを行う。 [Step S153] The loop optimizing unit 122a acquires the loop instruction sequence of "TYPE 2" from the memory 102.
[Step S154] The loop optimizing unit 122a refers to the instruction scheduling table 111a and performs instruction scheduling of the loop instruction sequence of "TYPE 2" in the same manner as in step S152.

［ステップＳ１５５］ループ最適化部１２２ａは、命令スケジューリング後の「ＴＹＰＥ１」と「ＴＹＰＥ２」とのループ命令列の組を戻り値として出力し、命令スケジューリング処理を終了する。命令スケジューリング処理後のループ命令列は、例えば中間言語命令によって記述されている。ループ最適化部１２２ａは、命令スケジューリング後の「ＴＹＰＥ１」と「ＴＹＰＥ２」とのループ命令列の組をメモリ１０２に格納する。 [Step S155] The loop optimizing unit 122a outputs a set of loop instruction sequences of "TYPE1" and "TYPE2" after instruction scheduling as a return value, and ends the instruction scheduling process. The loop instruction sequence after the instruction scheduling process is described by, for example, an intermediate language instruction. The loop optimizing unit 122a stores in the memory 102 a set of loop instruction sequences of "TYPE1" and "TYPE2" after instruction scheduling.

その後、ループ最適化部１２２ａは、「ＴＹＰＥ１」と「ＴＹＰＥ２」それぞれのループ命令列のループコストを算出する。
図１５は、ループコスト算出処理の手順の一例を示すフローチャートである。以下、図１５に示す処理をステップ番号に沿って説明する。 After that, the loop optimizing unit 122a calculates the loop cost of each of the loop instruction sequences of "TYPE1" and "TYPE2".
FIG. 15 is a flowchart showing an example of the procedure of the loop cost calculation process. Hereinafter, the process shown in FIG. 15 will be described along with the step numbers.

［ステップＳ１６１］ループ最適化部１２２ａは、「ＴＹＰＥ１」の命令スケジューリング後のループ命令列をメモリ１０２から取得する。
［ステップＳ１６２］ループ最適化部１２２ａは、「ＴＹＰＥ１」のループ命令列の１回転（１イタレーション）におけるクリティカルパスの命令列を実行するのに要するプロセッサの処理のサイクル数Ｃ１を算出する。クリティカルパスとは、ループ内の命令列の連鎖の中で、最も処理時間が長くなる処理ルートである。算出したサイクル数に、プログラムを実行させるプロセッサの動作クロックの周期を乗算すれば、そのプロセッサにおける処理時間となる。 [Step S161] The loop optimizing unit 122a acquires the loop instruction sequence after instruction scheduling of "TYPE1" from the memory 102.
[Step S162] The loop optimizing unit 122a calculates the number of processing cycles C1 of the processor required to execute the instruction sequence of the critical path in one rotation (1 iteration) of the loop instruction sequence of "TYPE1". The critical path is the processing route with the longest processing time in the chain of instructions in the loop. Multiplying the calculated number of cycles by the period of the operating clock of the processor that executes the program gives the processing time in that processor.

［ステップＳ１６３］ループ最適化部１２２ａは、「ＴＹＰＥ１」の場合のＳＩＭＤ化後のループ回転数×サイクル数Ｃ１を計算し、「ＴＹＰＥ１」のループコストＴＣ１とする。例えばＳＩＭＤレジスタに８つの要素を格納可能な場合、「ＴＹＰＥ１」の場合のＳＩＭＤ化後のループ回転数は、ＳＩＭＤ化前のループ回転数の８分の１である。 [Step S163] The loop optimizing unit 122a calculates the loop rotation speed × the number of cycles C1 after SIMD in the case of “TYPE1”, and sets it as the loop cost TC1 of “TYPE1”. For example, when eight elements can be stored in the SIMD register, the loop rotation speed after SIMD conversion in the case of "TYPE 1" is one-eighth of the loop rotation speed before SIMD conversion.

［ステップＳ１６４］ループ最適化部１２２ａは、「ＴＹＰＥ２」の命令スケジューリング後のループ命令列をメモリ１０２から取得する。
［ステップＳ１６５］ループ最適化部１２２ａは、「ＴＹＰＥ２」のループ命令列の１回転（１イタレーション）におけるクリティカルパスの命令列を実行するのに要するプロセッサの処理のサイクル数Ｃ２を算出する。 [Step S164] The loop optimizing unit 122a acquires the loop instruction sequence after instruction scheduling of "TYPE2" from the memory 102.
[Step S165] The loop optimizing unit 122a calculates the number of processing cycles C2 of the processor required to execute the instruction sequence of the critical path in one rotation (1 iteration) of the loop instruction sequence of "TYPE2".

［ステップＳ１６６］ループ最適化部１２２ａは、「ＴＹＰＥ２」の場合のＳＩＭＤ化後のループ回転数×サイクル数Ｃ２を計算し、「ＴＹＰＥ２」のループコストＴＣ２とする。例えばＳＩＭＤレジスタに８つの要素を格納可能であり、ループ制御変数ｉの増分値が２の場合、「ＴＹＰＥ２」の場合のＳＩＭＤ化後のループ回転数は、ＳＩＭＤ化前のループ回転数の１６分の１である。 [Step S166] The loop optimizing unit 122a calculates the loop rotation speed x the number of cycles C2 after SIMD in the case of "TYPE2", and sets the loop cost TC2 of "TYPE2". For example, when eight elements can be stored in the SIMD register and the increment value of the loop control variable i is 2, the loop rotation speed after SIMD conversion in the case of "TYPE2" is 16 minutes of the loop rotation speed before SIMD conversion. It is one of.

［ステップＳ１６７］ループ最適化部１２２ａは、「ＴＹＰＥ１」と「ＴＹＰＥ２」とのループ命令列それぞれのループコストＴＣ１，ＴＣ２を戻り値として出力し、ループコスト算出処理を終了する。 [Step S167] The loop optimizing unit 122a outputs the loop costs TC1 and TC2 of each of the loop instruction sequences of "TYPE1" and "TYPE2" as return values, and ends the loop cost calculation process.

このように「ＴＹＰＥ１」と「ＴＹＰＥ２」とのループ命令列それぞれのループコストＴＣ１，ＴＣ２を算出後、ループ最適化部１２２ａは、ループコストの低い方のループ命令列をコンパイル後のプログラムに適用する命令列として採用する。これにより、より効率的に処理を実行可能なオブジェクトファイル１１３が生成される。 After calculating the loop costs TC1 and TC2 of each of the loop instruction sequences of "TYPE1" and "TYPE2" in this way, the loop optimization unit 122a applies the loop instruction sequence having the lower loop cost to the compiled program. Adopt as a command sequence. As a result, the object file 113 that can execute the processing more efficiently is generated.

またループ最適化部１２２ａは、ループ命令列のうち定数化が可能な部分について定数に置き換えている。これにより、ループ処理の効率化が図れる。またループ最適化部１２２ａは、ループ外に移動可能な命令をループ外に移動させることで、ループ内での処理を簡略化している。これにより、ループ処理のさらなる効率化が図れる。 Further, the loop optimizing unit 122a replaces a part of the loop instruction sequence that can be made into a constant with a constant. As a result, the efficiency of loop processing can be improved. Further, the loop optimizing unit 122a simplifies the processing in the loop by moving the instruction that can move out of the loop out of the loop. As a result, the efficiency of the loop processing can be further improved.

例えばマスク付きＳＩＭＤ命令を利用するループ命令列の場合、マスクを固定値にすることができる。
図１６は、マスク付きＳＩＭＤ命令におけるマスクの固定値化の例を示す図である。図１６に示すループ命令列８１は、図５に示したループ命令列４１のマスクの定義を固定値化したものである。ループ命令列４１は、配列のインデックス値が奇数の要素に対して演算を実行する処理である。この場合、ループ命令列８１に示すように、「ｍａｓｋ＝（／Ｔ，Ｆ，Ｔ，Ｆ，Ｔ，Ｆ，Ｔ，Ｆ／）」として、固定値のマスクをループの外で定義することができる。このようにマスクの定義をループ外に出すことで、ループ処理内の１イタレーションの処理の中間言語命令列８２では、マスクの定義に関する命令を除外することができる。その結果、ループコストは「３．６ｎ」となり、図８に示したループ命令列４１のループコストよりも小さくなる。 For example, in the case of a loop instruction sequence using a masked SIMD instruction, the mask can be set to a fixed value.
FIG. 16 is a diagram showing an example of fixing the mask value in the masked SIMD instruction. The loop instruction sequence 81 shown in FIG. 16 is a fixed value definition of the mask of the loop instruction sequence 41 shown in FIG. The loop instruction sequence 41 is a process of executing an operation on an element having an odd index value in the array. In this case, as shown in the loop instruction sequence 81, a fixed value mask can be defined outside the loop with "mask = (/ T, F, T, F, T, F, T, F /)". it can. By taking the mask definition out of the loop in this way, the instruction related to the mask definition can be excluded from the intermediate language instruction sequence 82 of the processing of one iteration in the loop processing. As a result, the loop cost becomes "3.6n", which is smaller than the loop cost of the loop instruction sequence 41 shown in FIG.

このようにループ最適化部１２２ａがマスクの固定値化を行い、ループ外でマスクを定義することで、マスク付きＳＩＭＤ命令利用の方が性能向上するケースが増加する。
なお、上記の説明では、インデックス値が奇数であることを判定する場合（奇数判定）の例を用いて説明しているが、インデックス値が偶数であることを判定する場合（偶数判定）にも、同様に適用可能である。 In this way, the loop optimizing unit 122a fixes the mask and defines the mask outside the loop, so that the use of the SIMD instruction with a mask improves the performance in an increasing number of cases.
In the above description, an example of determining that the index value is odd (odd number determination) is used, but it is also possible to determine that the index value is even (even number determination). , Applicable as well.

図１７は、偶数判定の場合のループ命令列の一例を示す図である。マスク付きＳＩＭＤ命令を利用するループ命令列９１では、ＩＦ文内の条件が「ｍｏｄ（ｉ，２）＝＝０」となっている。これにより、インデックス値が２で割り切れる場合にＩＦ文の判定結果が真となり、インデックス値が偶数の要素に対して演算が実施される。gather／scatter命令を利用するループ命令列９２では、「ｄｏｉ＝２，ｎ，２」により、ループ制御変数を、偶数の初期値「２」から、増分値「２」でループ回転ごとに増加させる。 FIG. 17 is a diagram showing an example of a loop instruction sequence in the case of even number determination. In the loop instruction sequence 91 that uses the masked SIMD instruction, the condition in the IF statement is "mod (i, 2) == 0". As a result, when the index value is divisible by 2, the determination result of the IF statement becomes true, and the operation is performed on the element having an even index value. In the loop instruction sequence 92 using the gather / scatter instruction, the loop control variable is increased from the even initial value "2" to the increment value "2" for each loop rotation by "do i = 2, n, 2". Let me.

偶数判定の場合であっても、ループ最適化部１２２ａは、図１７に示すループ命令列９１，９２それぞれのループコストを比較することで、処理効率がよい方のループ命令列を採用することができる。 Even in the case of even-numbered determination, the loop optimizing unit 122a can adopt the loop instruction sequence having the better processing efficiency by comparing the loop costs of the loop instruction sequences 91 and 92 shown in FIG. it can.

なお、奇数判定と偶数判定とは、演算対象の要素のストライド幅（インデックス値の増分値）は「２」であるが、ストライド幅が「３」以上の場合もある。
図１８は、ストライド幅が「３」の場合のループ命令列の一例を示す図である。マスク付きＳＩＭＤ命令を利用するループ命令列９３では、ＩＦ文内の条件が「ｍｏｄ（ｉ，３）＝＝１」となっている。これにより、インデックス値を３で除算したときの剰余が１の場合にＩＦ文の判定結果が真となり、該当要素に対して演算が実施される。gather／scatter命令を利用するループ命令列９４では、「ｄｏｉ＝１，ｎ，３」により、ループ制御変数を、初期値「１」から、増分値「３」でループ回転ごとに増加させる。 In the odd-numbered determination and the even-numbered determination, the stride width (incremental value of the index value) of the element to be calculated is "2", but the stride width may be "3" or more.
FIG. 18 is a diagram showing an example of a loop command sequence when the stride width is “3”. In the loop instruction sequence 93 using the masked SIMD instruction, the condition in the IF statement is "mod (i, 3) == 1". As a result, when the remainder when the index value is divided by 3 is 1, the determination result of the IF statement becomes true, and the operation is performed on the corresponding element. In the loop instruction sequence 94 using the gather / scatter instruction, the loop control variable is increased from the initial value "1" to the increment value "3" for each loop rotation by "do i = 1, n, 3".

ストライド幅「３」の場合であっても、ループ最適化部１２２ａは、図１８に示すループ命令列９３，９４それぞれのループコストを比較することで、処理効率がよい方のループ命令列を採用することができる。 Even when the stride width is "3", the loop optimizing unit 122a adopts the loop instruction sequence having the better processing efficiency by comparing the loop costs of the loop instruction sequences 93 and 94 shown in FIG. can do.

図１９は、ストライド幅が「４」の場合のループ命令列の一例を示す図である。マスク付きＳＩＭＤ命令を利用するループ命令列９５では、ＩＦ文内の条件が「ｍｏｄ（ｉ，４）＝＝１」となっている。これにより、インデックス値を４で除算したときの剰余が１の場合にＩＦ文の判定結果が真となり、該当要素に対して演算が実施される。gather／scatter命令を利用するループ命令列９６では、「ｄｏｉ＝１，ｎ，４」により、ループ制御変数を、偶数の初期値「１」から、増分値「４」でループ回転ごとに増加させる。 FIG. 19 is a diagram showing an example of a loop command sequence when the stride width is “4”. In the loop instruction sequence 95 that uses the masked SIMD instruction, the condition in the IF statement is "mod (i, 4) == 1". As a result, when the remainder when the index value is divided by 4 is 1, the determination result of the IF statement becomes true, and the operation is performed on the corresponding element. In the loop instruction sequence 96 using the gather / scatter instruction, the loop control variable is increased from the even initial value "1" to the increment value "4" for each loop rotation by "do i = 1, n, 4". Let me.

ストライド幅「４」の場合であっても、ループ最適化部１２２ａは、図１９に示すループ命令列９５，９６それぞれのループコストを比較することで、処理効率がよい方のループ命令列を採用することができる。 Even when the stride width is “4”, the loop optimizing unit 122a adopts the loop instruction sequence having the better processing efficiency by comparing the loop costs of the loop instruction sequences 95 and 96 shown in FIG. can do.

なお、ストライド幅が大きくなるほど、マスク付きＳＭＩＤ命令を利用した場合にＳＩＭＤ命令で並列処理できる要素数が少なくなる。そのため、ストライド幅が大きくなるほど、マスク付きＳＭＩＤ命令を利用するループ命令列よりも、gather／scatter命令を利用するループ命令列の方が効率的となる可能性が高くなる。そこでループ最適化部１２２ａは、ループタイプ判定処理（図１０参照）において、増分値ｋが「１」以外であり、所定の閾値より大きい場合には、ループタイプを「ＴＹＰＥ０」とすることで以後の処理を省略し、コンパイル処理の効率化を図っている。 The larger the stride width, the smaller the number of elements that can be processed in parallel by the SIMD instruction when the masked SCID instruction is used. Therefore, as the stride width becomes larger, the loop instruction sequence using the gather / scatter instruction is more likely to be more efficient than the loop instruction sequence using the masked SCID instruction. Therefore, in the loop type determination process (see FIG. 10), the loop optimizing unit 122a sets the loop type to "TYPE 0" when the increment value k is other than "1" and is larger than a predetermined threshold value. The process of is omitted to improve the efficiency of the compilation process.

すなわち、増分値ｋが「１」以外の場合、オリジナルループはgather／scatter命令を利用するループ命令列であるが、増分値ｋが閾値を超えているとループコストの計算をするまでもなく、gather／scatter命令を利用する方が効率的となる。このような場合、ループ最適化部１２２ａは、等価ループの生成やループコストの算出処理を省略することで、コンパイル時の処理の効率化を図っている。 That is, when the increment value k is other than "1", the original loop is a loop instruction sequence using the gather / scatter instruction, but when the increment value k exceeds the threshold value, it is not necessary to calculate the loop cost. It is more efficient to use the gather / scatter instruction. In such a case, the loop optimizing unit 122a aims to improve the efficiency of the processing at the time of compilation by omitting the generation of the equivalent loop and the calculation processing of the loop cost.

〔その他の実施の形態〕
第２の実施の形態は、ＳＩＭＤレジスタのサイズが５１２ビット（６４バイト）の場合の例を示したが、ＳＩＭＤレジスタのサイズは５１２ビットに限定されない。さらに第２の実施の形態では、ＳＩＭＤ命令による演算対象の要素のデータ長を８バイトとしているが、要素のデータ長は８バイトに限定されない。各ループ命令列におけるループ回転数は、ＳＩＭＤレジスタのサイズが大きいほど少なくなり、演算対象のデータのサイズが大きいほど大きくなる。 [Other embodiments]
The second embodiment shows an example in which the size of the SIMD register is 512 bits (64 bytes), but the size of the SIMD register is not limited to 512 bits. Further, in the second embodiment, the data length of the element to be calculated by the SIMD instruction is 8 bytes, but the data length of the element is not limited to 8 bytes. The loop rotation speed in each loop instruction sequence decreases as the size of the SIMD register increases, and increases as the size of the data to be calculated increases.

以上、実施の形態を例示したが、実施の形態で示した各部の構成は同様の機能を有する他のものに置換することができる。また、他の任意の構成物や工程が付加されてもよい。さらに、前述した実施の形態のうちの任意の２以上の構成（特徴）を組み合わせたものであってもよい。 Although the embodiment has been illustrated above, the configuration of each part shown in the embodiment can be replaced with another having the same function. Further, any other components or processes may be added. Further, any two or more configurations (features) of the above-described embodiments may be combined.

１ループ命令列
２第１ＳＩＭＤ化命令列
３第２ＳＩＭＤ化命令列
１０情報処理装置
１１記憶部
１１ａプログラム
１１ｂ実行時間情報
１２処理部 1 Loop instruction sequence 2 First SIMD instruction sequence 3 Second SIMD instruction sequence 10 Information processing device 11 Storage unit 11a Program 11b Execution time information 12 Processing unit

Claims

On the computer
Extract a loop instruction sequence that directs operations on some elements in the array from the program to be compiled.
Based on the loop instruction sequence, a mask that enables only the element to be calculated among the plurality of elements in the array is used, and SIMD (Single Instruction Multiple Data) is applied to one or more elements enabled by the mask. Generate a first SIMD instruction sequence that describes a loop process that repeats the process of executing operations in parallel by instructions.
Based on the loop instruction sequence, a plurality of discontinuous elements in the array are stored in a continuous area in the SIMD register, and operations are executed in parallel by the SIMD instruction for the plurality of elements in the SIMD register. Generates a second SIMD instruction sequence that describes the loop processing that repeats the processing to be performed.
Based on the execution time information in which the numerical value indicating the time required for executing each of the plurality of instructions is set, the processing time of the first SIMD instruction sequence and the processing time of the second SIMD instruction sequence are calculated.
Based on the processing time of the first SIMD instruction sequence and the processing time of the second SIMD instruction sequence, the instruction sequence to be included in the compiled program among the first SIMD instruction sequence and the second SIMD instruction sequence is selected. select,
A compilation program that executes processing.

In the calculation of the processing time, the value obtained by multiplying the time required for executing the processing per rotation of the loop processing of the first SIMD conversion instruction sequence by the loop rotation speed of the first SIMD conversion instruction sequence is multiplied by the loop rotation speed of the first SIMD conversion instruction sequence. The processing time is defined as the processing time of the second SIMD instruction sequence, which is obtained by multiplying the time required for executing the processing per rotation of the loop processing of the second SIMD instruction sequence by the loop rotation speed of the second SIMD instruction sequence. To
The compilation program according to claim 1.

In the calculation of the processing time, the value obtained by dividing the number of rotations of the loop processing in the loop instruction sequence by the number of elements that can be stored in the SIMD register is used as the loop rotation number of the first SIMD instruction sequence, and is used in the loop instruction sequence. The loop rotation of the second SIMD instruction sequence is obtained by dividing the number of rotations of the loop processing by the number of elements that can be stored in the SIMD register and the increment value for each rotation of the index value indicating the element to be calculated. Number,
The compilation program according to claim 2.

It has the execution time information corresponding to each of a plurality of microarchitectures,
In the calculation of the processing time, the processing time between the first SIMD instruction sequence and the second SIMD instruction sequence is calculated with reference to the execution time information corresponding to the microarchitecture specified at compile time.
The compilation program according to claim 2 or 3.

In the generation of the first SIMD instruction sequence, the first SIMD instruction sequence for executing the mask generation process before the start of the loop process is generated.
The compilation program according to any one of claims 1 to 4.

On the computer,
It is determined whether or not the increment value of the loop control variable for each loop processing of the loop instruction sequence exceeds a predetermined threshold value.
When the increment value exceeds a predetermined threshold value, a process of suppressing the generation of the first SIMD instruction sequence and the execution of the calculation of the processing time is executed.
In the selection of the command sequence, when the incremental value exceeds a predetermined threshold value, the second SIMD command sequence is selected.
The compilation program according to any one of claims 1 to 5.

A storage unit that stores the program to be compiled,
A mask that extracts a loop instruction sequence instructing an operation on some elements in the array from the program and enables only the element to be operated out of a plurality of elements in the array based on the loop instruction sequence. Is used to generate a first SIMD instruction sequence in which a loop process of repeating a process of executing operations in parallel by a SIMD (Single Instruction Multiple Data) instruction for one or more elements enabled by the mask is generated. Based on the loop instruction sequence, a plurality of discontinuous elements in the sequence are stored in a continuous area in the SIMD register, and operations are executed in parallel by the SIMD instruction for the plurality of elements in the SIMD register. The processing time of the first SIMD instruction sequence is based on the execution time information in which the second SIMD instruction sequence in which the loop processing that repeats the processing is described is generated and a numerical value indicating the time required for executing each of the plurality of instructions is set. And the processing time of the second SIMD instruction sequence are calculated, and based on the processing time of the first SIMD instruction sequence and the processing time of the second SIMD instruction sequence, the first SIMD instruction sequence and the second SIMD instruction sequence are calculated. Of the instruction sequences, the processing unit that selects the instruction sequences to be included in the compiled program, and
Information processing device with.