JP2003114618A

JP2003114618A - Data processing device

Info

Publication number: JP2003114618A
Application number: JP2001308154A
Authority: JP
Inventors: Masahiro Kaminaga; 正博神永; Takashi Endo; 隆遠藤; Takashi Watanabe; 高志渡邊; Kunihiko Nakada; 邦彦中田; Taku Tsukamoto; 卓塚本; Seiji Kobayashi; 誠治小林
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2001-10-04
Filing date: 2001-10-04
Publication date: 2003-04-18
Anticipated expiration: 2021-10-04
Also published as: JP3904421B2

Abstract

PROBLEM TO BE SOLVED: To make fast processing conventionally actualized by operating two computing elements in parallel in a Montgomery residue arithmetic circuit. SOLUTION: The mentioned processing can be mathematically composed and actualized by a single computing element.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は，剰余乗算演算(mod
ular multiplication)や、べき乗剰余演算(modular exp
onentiation)を用いた符号化(encryption)及び復号化(d
ecryption)装置に適用して有効なものであり、特に、IC
カード(smart card)のような高いセキュリティを持つデ
ータ処理装置に適用して有効な技術に関するものであ
る。TECHNICAL FIELD The present invention relates to a modular multiplication operation (mod).
ular multiplication) and modular exponentiation (modular exp)
onentiation) and decoding (d
ecryption) is effective when applied to devices, especially IC
The present invention relates to a technique effectively applied to a data processing device having high security such as a smart card.

【０００２】[0002]

【従来の技術】ヨーロッパで広く利用されているＧＳＭ
(Global Systems for Mobile communications)規格のモ
バイルホンをはじめとするモバイル端末や、ICカード(s
mart card)などでは、利用者認証(user authenticatio
n)や、電子商取引(electric commerce)を行うことがで
きる。一般に、電子マネー(electric money)として利用
する場合、それはＩＣカードの形態を取っており、ＧＳ
Ｍモバイルホン(GSM mobile radiotelephone system)の
場合は、ＳＩＭ(Subscriber Identification Module)と
よばれる形態を取っている。SIMもICカードも端子付き
の半導体チップをプラスティックの板に張り付けたもの
であり、いずれもデバイスの実体は同じく半導体チップ
であるので、以下ＩＣカードについて説明する。ＩＣカ
ードは、勝手に書き換えることが許されないような個人
情報の保持や、秘密情報である暗号鍵(cryptographic k
ey)を用いたデータの暗号化(encryption)や暗号文(ciph
er text)の復号化(decryption)を行う装置である。ICカ
ード自体は電源を持っておらず，ＩＣカード用のリーダ
ライタ(Card reader/writer)に差し込まれると，電源の
供給を受け，動作可能となる。動作可能になると，リー
ダライタから送信されるコマンドを受信し，そのコマン
ドに従って，データの転送等の処理を行う。ICカードの
一般的な解説は，オーム社出版電子情報通信学会編水沢
順一著「ICカード」などにある。ICカードの構成は，図
１に示すように，カード101の上に，ICカード用チップ1
02を搭載したものである。図に示すように、一般にICカ
ードは，ISO7816の規格に定められた位置に供給電圧端
子Vcc, グランド端子GND，リセット端子RST, 入出力端
子I/O, クロック端子CLKを持ち，これらの端子を通し
て，リーダーライタから電源の供給やリーダライタとの
データの通信を行う(W.Rankl and Effing :SMARTCARD H
ANDBOOK, John Wiley & Sons, 1997, pp.41参照)。ICカ
ード用チップの構成は，基本的には通常のマイクロコン
ピュータと同じ構成である。その構成は，図2に示すよ
うに，中央処理装置(CPU：Central Processing Unit)20
1，記憶装置(Memory)204，入出力（I/O）ポート207，コ
・プロセッサ(coprocessor)202からなる（コ・プロセッ
サはない場合もある）。CPU201は，論理演算(logical o
peration)や算術演算(arithmetic operation)などを行
う装置であり，記憶装置204は，プログラムやデータを
格納する装置である。入出力ポートは，リーダライタと
通信を行う装置である。コ・プロセッサは，暗号処理そ
のもの、または、暗号処理に必要な演算を高速に行う装
置であり、例えば、RSA暗号の剰余演算を行う為の特別
な演算装置や、DES(Data Encryption Standard)の処理
を行う暗号装置などがある。ICカード用プロセッサの中
には，コ・プロセッサを持たないものも多くある。デー
タバス203は，各装置を接続するバスである。記憶装置
２０４は，ROM(Read Only Memory)やRAM(Random Access
Memory)，EEPROM(Electrical Erasable Programmable
Read Only Memory)などからなる。ROMは，変更できない
メモリであり，主にプログラムを格納するメモリであ
る。RAMは自由に書き換えができるメモリであるが，電
源の供給が中断されると，記憶している内容は消滅す
る。ICカードがリーダライタから抜かれると電源の供給
が中断されるため，RAMの内容は，保持されなくなる。E
EPROMは，電源の供給が中断されてもその内容を保持す
ることができるメモリである。このメモリは、書き換え
る必要があり，ICカードがリーダライタから抜かれて
も，保持するデータを格納するために使われる。例え
ば，プリペイドカードの利用度数などは，使用するたび
に書き換えられ，かつリーダライタから抜かれてもデー
タを保持する必要があるため，EEPROMで保持される。利
用者認証や、電子マネー決済を実現するには、公開鍵暗
号(public key cryptography)の技術が必要である。公
開鍵暗号とは、公開情報の中に秘密情報を埋め込んで用
いるものであり、送信側と受信側で異なる鍵を用いる
為、非対称鍵暗号(asymmetric key cryptography)とも
呼ばれる。現在広く利用されている公開鍵暗号として、
RSA暗号がある。RSA暗号は、大きな素数の積を生成する
ことは容易だが、与えられた合成数を因数分解すること
は困難であるという事実に基づいている。RSA暗号で
は、公開モジュラス(public modulus)Nと、与えられた
暗号文(cipher text)Yに対し、秘密鍵指数(secret expo
nent)Xを用いて、Y^X mod N（Y^Xは、YのX乗を意味す
る）という計算を行うことにより、平文(plain text)M
を得る計算、又は、この逆の操作が行われる。この操作
は、べき乗剰余演算(modular exponentiation)と呼ばれ
る。ここで、前記X,Y,Nは、2001年現在、1024ビットか
ら2048ビット程度の非常に大きな数が利用される為、
「Y^X mod N」をいかにして高速に実行するかが、従来
から応用数学、工学の分野で課題とされていた。特に、
ICカードのように記憶装置の容量が制限されており、か
つCPUの能力が低いデバイスにおいては、前記べき乗剰
余演算は非常に大きなタスクであり、その高速演算は非
常に重要な課題である。べき乗剰余演算のアルゴリズム
は種々知られているが、例えば、次に示すものは、広く
利用されている。これは、加法連鎖方式(addition chai
n method)と呼ばれている。このアルゴリズムにおいては、nは、Xのビット長に対応
され、(X[n-1]X[n-2]...X[1]X[0])は、Xの二進数表現で
ある。本アルゴリズムは、概略的には、二乗の剰余乗算
「A^2 mod N」（ステップ４）及び、剰余乗算「A = A*B
mod N」（ステップ５）を組み合せて実行され、X[n-
1], X[n-2], ..., X[1], X[0]における1の個数をH(X)と
すると、二乗の剰余乗算「A^2 mod N」にn回、剰余乗算
「A = A*B mod N」にH(X)回の演算が繰り返し行われる
ことになる。上記のアルゴリズム1で、正しい結果が得
られることを数値例で確認しておく。アルゴリズムの確
認の意味では、指数Xを具体化すれば十分であるので、Y
及びNは、記号のまま用いることにする。 S = Y^45 mod N をアルゴリズム１に従って計算する。ここで、指数45
は、二進数では、 45 = 101101（二進数）と書くことができる。従って、アルゴリズム１の記号を
用いて指数を表現すると、 X[5] = 1, X[4] = 0, X[3] = 1, X[2] = 1, X[1] = 0,
X[0] = 1 となる。ビット数nは6である。Sの初期値は1である。 (1) j=5のとき Sの初期値は1であるから、これを二乗しNで割った剰余
を取っても、1である（ステップ４）。X[5] = 1である
から、ステップ５が実行され、 S = 1*Y mod N = Y mod N が得られる。 (2) j=4のときこのときのSの値は、最初S = Y mod Nであるから、これ
を二乗してNで割った剰余を取ると、S = Y^2 mod Nとな
る。X[4] = 0であるから、ステップ５は実行されず、S
= Y^2 mod Nのままである。 (3) j=3のときこのときのSの値は、最初S = Y^2 mod Nであるから、こ
れを二乗してNで割った剰余を取ると、S = Y^4 mod Nと
なる。X[3] = 1であるから、ステップ５が実行され、 S = Y^4*Y mod N = Y^5 mod N となる。 (4) j=2のときこのときのSの値は、最初S = Y^5 mod Nであるから、こ
れを二乗してNで割った剰余を取ると、S = Y^10 mod N
となる。X[2] = 1であるから、ステップ５が実行され、 S = Y^10*Y mod N = Y^11 mod N となる。 (5) j=1のときこのときのSの値は、最初S = Y^11 mod Nであるから、
これを二乗してNで割った剰余を取ると、S = Y^22 mod
Nとなる。X[1] = 0であるから、ステップ５は実行され
ず、S = Y^22 mod Nのままである。 (6) j=0のときこのときのSの値は、最初S=Y^22 mod Nであるから、こ
れを二乗してNで割った剰余を取ると、S = Y^44 mod N
となる。X[0] = 1であるから、ステップ５が実行され、 S = Y^44*Y mod N = Y^45 mod N となり、所望の結果が得られる。このようにアルゴリズ
ム1ではべき乗剰余演算を剰余乗算（二乗も含む）に分
解して実行する為、「A*B mod N」の演算機能を持つ演
算装置を用いればよい。しかしながら、A, B, Nはいず
れも非常に大きな値であり、例えば、データ長を現在主
流の1024ビットであるとすると、中間結果A*Bは、2048
ビットの大きな数となる問題がある。さらに、A*BをNで
割った値が最終結果となるため、2048ビット÷1024ビッ
トという大きな値を扱う除算を実行しなければならな
い。ここで、乗算は、乗数と被乗数を分割することによ
り、マイクロプロセッサ等により並列処理が可能である
が、除算は、並列化が困難であり、これが高速化を阻む
要因となっていた。このような剰余乗算「A*B mod N」
における除算の問題を解決するため、Nによる除算を行
わずに「A*B*R^(-1) mod N」を実行するアルゴリズム２
が知られている。ここで、Rは、2^n(nは例えばNのビッ
ト長(bit-length))であり、R>Nを満たす正の整数であ
る。又、以下では、Nを奇数と仮定する。この仮定は、R
SAや素体(prime field)上の楕円曲線暗号(elliptic cur
ve cryptography)においては妥当な仮定である。実際、
RSAにおいては、Nは大きな素数の積であり、素体では、
大きな素数を法とする剰余演算(modular arithmetic op
eration)が行われるので、同様に、Nは奇数である。下
記アルゴリズム２は、数学者ピーター・モンゴメリ(Pete
r Montgomery)によって提案されたものである。アルゴ
リズム２を導く論証の詳細については、ここでは説明を
省略するが、例えば、モンゴメリ自身によって書かれた
論文P. L. Montgomery, ”Modular Multiplication wit
hout Trial Division”, Math. Comp., vol. 44, 1985,
pp.519-521、又は、暗号理論(cryptology)における標
準的ハンドブック A. Menezes, P.C.van Oorschot, S.
A.Vanstone, ”Handbook of Applied Cryptography”,
CRC-press, 1997, pp. 602-603に記載がある。アルゴリズム２において求まるのは、A*B*R^(-1) mod N
の値であるので、この演算を行う装置を用いてRSA暗号
におけるべき乗剰余演算を実行するには、次のアルゴリ
ズム３のように修正したアルゴリズムを用いる必要があ
る。 [アルゴリズム３] input Y, X=(X[n-1]X[n-2]...X[1]X[0]), N output Y^X mod N A = R mod N ステップ１ B = Y*R mod N ステップ２ For j =n-1 to 0 step 1{ ステップ３ A = (A^2)*R^(-1) mod N ステップ４ If X[j] = 1 then A = A*B*R^(-1) mod N ステップ５ } A = A*R^(-1) mod N ステップ６ output A ステップ７アルゴリズム３においてステップ１−５の間、乗数(mul
tiplier)及び被乗数(multiplicand)には、R mod Nが乗
ぜられた形をとる。このデータ形式を、以下、モンゴメ
リ形式と呼ぶことにする。上記モンゴメリ乗算「ABR^(-
1) mod N」において、モンゴメリ形式は不変に保たれ
る。実際、Aのモンゴメリ形式表示をMont(A)= A*R mod
Nのように書くことにすると、 Mont(A)*Mont(B)*R^(-1) mod N = A*R*B*R*R^(-1) mod N = A*B*R mod N = Mont(A*B) となる。従って、例えば、「A*B*R^(-1) mod N」, 「(A
^2)*R^(-1) mod N」, 「A*R^(-1) mod N」を計算する演
算器があれば、上記アルゴリズム３の手続きをCPUとこ
れらの演算器を用いて高速に実行することができる。
「(A^2)*R^(-1) modN」, 「A*R^(-1) mod N」は、それ
ぞれ、「A*B*R^(-1) mod N」において、B=A，B=1とおい
たものであるので、「A*B*R^(-1) mod N」という演算器
を構成すれば、べき乗剰余演算を実現することができる
ことがわかる。次に、ステップ２で示した「A*B*R^(-1)
mod N」を計算するアルゴリズムにおいて、これを高速
化する方法について、これまでに行われてきたことを簡
単にまとめておく。説明の為、アルゴリズム２を再掲す
る。公知の代表的な高速化方法は、以下のように分類するこ
とができる。第一の高速化方法：ステップ３，４，５を含むループの
繰り返しの回数を減らすもの第二の高速化方法：ステップ３，４，５の処理を並列化
するもの第三の高速化方法：ステップ６の処理をより簡単なもの
に置換えるもの第一の高速化方法の代表的なものとして、加算処理を複
数ビットまとめて実行するものがある。例えば、kビッ
トの処理をまとめて実行する方法を以下のアルゴリズム
４に示す。 [アルゴリズム４] For j = 0 to 2^k 1 step +1{ ステップ１ TBL(j, B) = j*B ステップ２ TBL(j, N) = j*N ステップ３ } W = 0 ステップ４ For j = 0 to n/k-1 step +1{ ステップ５ T = W + TBL(A[j,k], B) ステップ６ U = T mod 2^k ステップ７ M[j,k] = -U*N[0,k]^(-1) mod 2^k ステップ８ W = T + TBL(M[j,k], N) ステップ９ W = W/2^k ステップ１０ } If W >= N then W = W N ステップ１１ Output W ステップ１２ここでは簡単の為、kはnの約数であるものとする。例え
ば、n=1024のときは、k=2, 4, 8等を選べばよい。A[j,
k]は、Aの下からj番目のkビットブロックの値を示して
いる。同様に、M[j,k]は、Mの下からj番目のkビットブ
ロックの値を示し、N[0, k]は、Nの最下位のkビットブ
ロックの値を示している。又、ステップ8の演算は、値
は正になるようにとるものとする。該アルゴリズム４に
おいて、k =1とするとNが奇数であるため、MはUに一致
し、アルゴリズム２が得られる。この方法に基づいて、
剰余乗算を行うものは、特開平7-20778, 武仲正彦他
「剰余計算装置，テーブル作成装置および乗算剰余計算
装置」において、その実現方法が示されている。アルゴ
リズム４は、若干複雑であるので、そのしくみを簡単に
説明する。アルゴリズム４の内容を説明するに先立っ
て、モンゴメリ法の原理について簡単に理解しておく必
要がある。モンゴメリ法は、原理的には、MとWを未知数
とした不定方程式(Diophantine equation)：（式１） A*B + M*N = W*R を解くことと同値である。まず、Nは奇数であり、R=2^n
と互いに素であるから、（式１）は整数の解を持つこと
に注意する。Nが偶数の場合、（式１）は必ずしも整数
解(integer solution)を持たないので、Nが奇数である
という仮定は、本質的なものである。（式１）の意味を
理解する為に、これを図で表現したものが、図３であ
る。R=2^nであるから、（式１）の右辺の下位nビットは
全て0であるので、（式１）を満たすMを求めるというこ
とは、ABとMNの下位nビットが0になるようにMを選ぶと
いう操作に他ならない。この上位半分のビット列が、求
める、A*B*R^(-1) mod Nの値となる。但し、2nビットの
数と2nビットの数の和は2n+1ビットになり、最上位のビ
ットは1になる可能性がある（以下、このビットをOVビ
ットと呼ぶことにする）。これは、アルゴリズム４のス
テップ１１において解消される。（式１）の両辺をR=2^
nで割った余り(residue)を取ると、（式２） (A*B mod 2^n) + (M*N mod 2^n) = 0 を得る。（式２）を変形して、（式３） M = -A*B*N^(-1) mod 2^n を得る。法が2のべき（power）なので、このMはビット
ブロック毎に順に求めることができる。これがアルゴリ
ズム４の意味である。アルゴリズム４においては、ル
ープの回数が1/kになっているので、粗く言って、ルー
プ部は約k倍の速度となる。但し、この場合、kを大きく
すると、乗算テーブルのサイズが大きくなるという問題
がある。サイズが大きくなると、RAM領域が圧迫され
る。一般に、ICカード等のデバイスでは、RAM領域が小
さいので、kを大きくとることはできないことが多い。
これが第一の問題である。第二の高速化方法は、第一の
高速化方法と合わせて用いることができる。アルゴリズ
ム４において、Mの値は、Tの下位kビットが決まった時
点で決定される（ステップ８）ことに注目すると、決定
直後にステップ９の処理が実行可能であることがわか
る。従って、nビットの加算器(adder)を二つ搭載し、ス
テップ６とステップ９を並列に実行するような回路を構
成すれば、速度はほぼ2倍になる。但し、この方法を取
ると、加算器が二倍必要となる。これは演算器の量が二
倍になることを意味するだけでなく、実行時の消費電力
(power consumption)もまた二倍になることを意味して
いる。これが第二の問題である。第三の高速化方法は、
第一、第二の高速化方法とは独立である。よく知られて
いるように、アルゴリズム４のステップ１１のような比
較処理は、実際に減算を実行し、負フラグ(negative fl
ag)を調べることによって達成される。これは、1024ビ
ット等の大きな数同士の減算処理であるので、処理時間
が無視できない場合がある。一方、計算機では、R = 2^
nとの比較は容易である。そこで、最後の比較減算処理
部（ステップ１１）を、次のように変更したアルゴリズ
ム５を用いることにより高速化を実現することができ
る。この変更による高速化については、既に、特開平10
-21057,中田邦彦「データ処理装置及びマイクロコンピ
ュータ」, Kunihiko Nakada: DATA PROCESSOR AND MICR
OPROCESSOR, United States Patent, US005961578Aに記
載がある。 [アルゴリズム５] For j = 0 to 2^k 1 step +1{ ステップ１ TBL(j, B) = j*B ステップ２ TBL(j, N) = j*N ステップ３ } W = 0 ステップ４ For j = 0 to n/k-1 step +1{ ステップ５ T = W + TBL(A[j,k], B) ステップ６ U = T mod 2^k ステップ７ M[j,k] = -U*N[0,k]^(-1) mod 2^k ステップ８ W = T + TBL(M[j,k], N) ステップ９ W = W/2^k ステップ１０ } If W >= R then W = W N ステップ１１ Output W ステップ１２ここで、ステップ１１における比較処理は、図３におけ
るOVビットが1であるかどうかを判定するだけで実現で
きる。これは減算処理よりもずっと軽い処理であり、高
速である。但し、アルゴリズム５で得られる値は、A*B*
R^(-1) mod N 自身ではなくA*B*R^(-1) mod N + Nであ
る場合がある。しかし、ビット長はnビットのままであ
るので、べき乗剰余演算では、最終的に値の修正を行う
だけでよい。本発明は、この高速化方法とは無関係であ
り、第一、第二の高速化方式と関係するものである。上
記の問題１，２は、楕円曲線暗号(elliptic curve cryp
tosystem)で現れる標数２のガロア体(Galois field of
characteristic 2)においても生ずるものである。ガロ
ア体の概念を簡単に説明しておく。ガロア体そのもの
は、純粋に数学の概念であるが、適当な同型写像(homom
orphism)を構成することによって、具体的な演算に翻訳
することができる。ガロア体を計算機上に実現する方法
のうち、最も簡単なものは、多項式の剰余演算(modular
arithmetic operation)を用いるものである。まず、係
数(coefficient)が、0か1であるような多項式全体を考
え、この集合をPOLYとする。POLYの元(element)で、勝
手にn次の既約多項式(irreducible polynomial of degr
ee n)F(X)を選び、POLYの二つの元、A(X), B(X)に対
し、その和（差と同義になる）を、通常の多項式の和に
おいて、係数の和は排他的論理和(exclusive OR)の意味
で行うものと定義し、積をA(X)*B(X) mod F(X)によって
定義する。代数学においてよく知られているように、F
(X)の倍元(multiple)でないPOLYの元は、前記の積演算
に対する逆(inverse)を持つ。POLYにおいて、A(X)とB
(X)が同値(equivalent)であるとは、これらの差A(X)-B
(X)がF(X)の倍元になっていることと定義し、POLYをこ
の同値関係(equivalent relation)によって同値類(equi
valent class)に分けたものをGF(2^n)と書く。これが、
標数２のガロア体である。GF(2^n)の代数構造(algebrai
c structure)は、n次の既約多項式F(X)の選び方に依ら
ないことはよく知られている。実装上は、3項(3 terms)
及び5項からなる多項式を選ぶことが多い。この多項式F
(X)を、reduction polynomialと呼ぶことがある。ガロ
ア体の積演算は、通常の積演算に現れる加算処理を排他
的論理和の意味で行うことによって実現される。前記加
算器において、キャリーの伝播を行わなければ、ビット
毎の排他的論理和(bitwise exclusive OR)を行うことと
同値になり、ガロア体の積演算が実現される。モンゴメ
リ法においても、本質的に同様の操作を行っている。通
常のモンゴメリ法において必要だった「モジュラスNは
奇数」であるという条件は、F(X)の定数項が1であると
いう条件に置換えられ、これは、F(X)が既約であるとい
うことから、自動的に従うものであり、モンゴメリ法の
適用に問題はない。通常のモンゴメリ法におけるR=2^n
は、R(X) = X^nに置換える。また、モンゴメリ法におい
て重要であったMの決定の処理もまた、キャリー伝播を
行わないことによって実現することができる。従っ
て、ガロア体におけるモンゴメリ法の処理は、以下のよ
うに実現することができる。 [アルゴリズム６] W(X) = 0 ステップ１ For j = 0 to n-1 step +1{ ステップ２ T(X) = W(X) + A[j]*B(X) ステップ３ If T(X)mod X is 1 then W(X) = (T(X) + F(X))/X ステップ４ Else W = T/X ステップ５ } ここで、A[j]は、A(X)のｊ次の項の係数である。また、
和は係数毎の排他的論理和の意味で用いている。アル
ゴリズム６において、通常のモンゴメリ法において存在
した減算処理が必要ないのは、ガロア体においては、次
数を除いて大小関係が存在せず、A(X), B(X)の次数がn
次未満であれば(GF(2^n)では、これは常に成り立つ)、
上記アルゴリズム６において構成されるW(X)は、n次未
満となるので、そのままF(X)に対する剰余となるためで
ある。従って、通常の剰余乗算においてモンゴメリ法を
行う際に生ずる第三の問題は、ガロア体においては存在
しない。このアルゴリズム６で、正しい答が得られるこ
とを数値例によって確認しておく。簡単の為、GF(2^4)
において、F(X) = X^4 + X + 1とする。A(X) = X^3 + X
^2 + 1, B(X) = X^3 + X + 1に対し、GF(2^4)におけるA
とBの積、すなわち、A(X)*B(X) mod F(X)を計算する。
この場合、A[0] = 1, A[1] = 0, A[2] = 1, A[3] = 1で
あることに注意して、アルゴリズム６の処理を順に実行
する。（1） j = 0 のとき W(X) = 0, A[0] = 1であるから、ステップ3では、 W(X) = 0 + 1*(X^3 + X + 1) = X^3 + X + 1 となり、この定数項が1であるから、ステップ４を実行
し、 W (X) = (X^3 + X + 1 + X^4 + X + 1)/X = (X^4 + X^3 + 2*X + 2)/X = (X^4 + X^3)/X = X^3 + X^2 となる。ここで、2倍を0としていることに注意する。（2） j = 1のとき W(X) = X^3 + X^2, A[1] = 0であるから、ステップ３で
は、 W(X) = X^3 + X^2 + 0*(X^3 + X + 1) = X^3 + X^2 となる。この定数項は0であるので、ステップ５を実行
し、 W(X) = (X^3 + X^2)/X = X^2 + X となる。（3） j = 2のとき W(X) = X^2 + X, A[2] = 1であるから、ステップ３で
は、 W(X) = X^2 + X + 1*(X^3 + X + 1) = X^3 + X^2 + 2*X + 1 = X^3 + X^2 + 1 となる。この定数項は、1であるので、ステップ４を実
行し、 W(X) = (X^3 + X^2 + 1 + X^4 + X + 1)/X = (X^4 + X^3 + X^2 + X + 2)/X = (X^4 + X^3 + X^2 + X)/X = X^3 + X^2 + X + 1 となる。（4） j = 3のとき W(X) = X^3 + X^2 + X + 1, A[3] = 1であるから、ステ
ップ３では、 W(X) = X^3 + X^2 + X + 1 + 1*(X^3 + X + 1) = 2*X^3 + X^2 + 2*X + 2 = X^2 となる。定数項は、0であるので、ステップ５を実行
し、 W(X) = X^2/X = X となる。上記の計算で得られたW(X)は、A(X)*B(X)*R(X)
^(-1) mod F(X)になっているはずであるので、R(X)^(-
1) mof F(X)を除去する為にR(X)=X^4を乗ずると、 X*X^4 mod (X^4 + X + 1) = X*(X+1) = X^2 + X となる。この結果が正しいかどうかを確認する為に、A
(X)*B(X) mod F(X)を通常の方法で計算する。 A(X)*B(X) = (X^3 + X^2 + 1)*(X^3 + X + 1) = X^6 + X^5 + X^4 + 3*X^3 + X^2 + X + 1 = X^6 + X^5 + X^4 + X^3 + X^2 + X + 1 であるから、 A(X)*B(X) mod F(X) = X^6 + X^5 + X^4 + X^3 + X^2 + X + 1 mod (X^4 + X
+ 1) = (X^2 + X + 1)*(X^4 + X + 1) + X^2 + X mod (X^4 +
X + 1) = X^2 + X となり、確かに、アルゴリズム６で計算した値にR(X)を
乗じたものと同じ多項式が得られる。上記のガロア体に
おける計算においては、２を0に置換える操作をしてい
るが、これは、和の計算を排他的論理和で実行している
ことによる。このアルゴリズムをkビット毎に処理する
ように修正することは、通常の剰余乗算の場合と同様に
して実現することができる。GF(2^n)におけるモンゴメ
リ法のアルゴリズムを示しておく。 [アルゴリズム７] For j = 0 to 2^k 1 step +1{ ステップ１ TBL(H[j](X), B(X)) = H[j](X)*B(X) ステップ２ TBL(H[j](X), F(X)) = H[j](X)*F(X) ステップ３ } W(X) = 0 ステップ４ For j = 0 to n/k-1 step +1{ ステップ５ T(X) = W(X) + TBL(A[j,k](X), B(X)) ステップ６ U(X) = T(X) mod X^k ステップ７ M[j,k](X) = U(X)*F[0,k](X)^(-1) mod X^k ステップ８ W(X) = T(X) + TBL(M[j,k](X), F(X)) ステップ９ W(X) = W(X)/X^k ステップ１０ } ここで、アルゴリズム７における多項式H[j](X)は、ｊ
の二進数展開が、j = (j[k-1]j[k-2]...j[0])であると
きに、H[j](X) = j[k-1]*X^(k-1) + j[k-2]*X^(k-2) +
...+ j[0]を対応させて得られるものである。又、A[j,
k](X)は、多項式A(X)をビット列表現した場合のｊ番目
のkビットブロックに対応する多項式であり、F[0,k]
(X), M[j,k](X)も、これに準ずる。さらに、ステップ
６、９の和は、各項毎の排他的論理和を意味する。数値
例は省略する。GSM widely used in Europe
(Global Systems for Mobile communications) standard mobile phones such as mobile phones and IC cards (s
mart card), user authentication (user authenticity
n) and electronic commerce can be performed. Generally, when used as electric money, it takes the form of an IC card,
In the case of an M mobile phone (GSM mobile radiotelephone system), a form called SIM (Subscriber Identification Module) is used. Both SIMs and IC cards are semiconductor chips with terminals attached to a plastic plate, and the device itself is the same semiconductor chip in both cases, so the IC card will be described below. An IC card holds private information that cannot be rewritten without permission, and a cryptographic key (cryptographic k) that is secret information.
ey) for data encryption (encryption) and ciphertext (ciph
er text) is a device for performing decryption. The IC card itself does not have a power supply, and when it is inserted into a reader / writer for IC cards, it receives power supply and becomes operable. When it becomes operable, it receives a command transmitted from the reader / writer and performs processing such as data transfer according to the command. A general explanation of IC cards can be found in "IC Card" by Junichi Mizusawa, edited by The Institute of Electronics, Information and Communication Engineers, Ohmsha. As shown in FIG. 1, the configuration of the IC card is such that the IC card chip 1 is placed on the card 101.
It is equipped with 02. As shown in the figure, an IC card generally has a supply voltage terminal Vcc, a ground terminal GND, a reset terminal RST, an input / output terminal I / O, and a clock terminal CLK at the positions defined in the ISO7816 standard, and these terminals are connected through these terminals. , Power supply from the reader / writer and data communication with the reader / writer (W. Rankl and Effing: SMARTCARD H
ANDBOOK, John Wiley & Sons, 1997, pp.41). The configuration of the IC card chip is basically the same as that of a normal microcomputer. As shown in FIG. 2, the configuration is such that a central processing unit (CPU) 20
1, a memory device 204, an input / output (I / O) port 207, and a coprocessor 202 (there may be no coprocessor). The CPU 201 uses logical operations (logical o
The storage device 204 is a device that stores programs and data, and is a device that performs peration and arithmetic operation. The input / output port is a device that communicates with the reader / writer. The co-processor is a device that performs high-speed cryptographic processing itself or computations required for cryptographic processing. For example, a special computing device for performing a remainder operation of RSA cryptography or DES (Data Encryption Standard) processing. There is a cryptographic device for performing. Many IC card processors do not have a co-processor. The data bus 203 is a bus that connects each device. The storage device 204 is a ROM (Read Only Memory) or a RAM (Random Access).
Memory), EEPROM (Electrical Erasable Programmable
Read Only Memory). ROM is a memory that cannot be changed, and is a memory that mainly stores programs. RAM is a memory that can be rewritten freely, but if the power supply is interrupted, the stored contents will be lost. When the IC card is removed from the reader / writer, the power supply is interrupted and the RAM contents are no longer retained. E
EPROM is a memory that can retain its contents even when power supply is interrupted. This memory needs to be rewritten and is used to store the data to be held even if the IC card is removed from the reader / writer. For example, the usage frequency of the prepaid card is rewritten each time it is used, and since it is necessary to retain the data even if it is removed from the reader / writer, it is retained in the EEPROM. In order to realize user authentication and electronic money payment, public key cryptography technology is required. Public key cryptography is used by embedding secret information in public information, and since different keys are used on the transmission side and the reception side, it is also called asymmetric key cryptography. As a public key cryptosystem that is currently widely used,
There is RSA encryption. The RSA cryptography is based on the fact that it is easy to generate a product of large prime numbers, but it is difficult to factor a given composite number. In RSA cryptography, for a public modulus N and a given cipher text Y, a secret expo
nent) X is used to calculate Y ^ X mod N (Y ^ X means Y to the Xth power) to obtain plain text M
And vice versa. This operation is called modular exponentiation. Here, X, Y, N, as of 2001, since a very large number of 1024 bits to 2048 bits is used,
How to execute “Y ^ X mod N” at high speed has been a problem in the fields of applied mathematics and engineering. In particular,
In a device such as an IC card in which the capacity of a storage device is limited and the capacity of a CPU is low, the power-residue calculation is a very large task, and its high-speed calculation is a very important issue. Various power-residue calculation algorithms are known, but for example, the following ones are widely used. This is an addition chain
n method). In this algorithm, n corresponds to the bit length of X, and (X [n-1] X [n-2] ... X [1] X [0]) is the binary representation of X. . This algorithm is roughly the squared modular multiplication "A ^ 2 mod N" (step 4) and the modular multiplication "A = A * B.
mod N ”(step 5) is executed in combination and X [n-
If the number of 1's in 1], X [n-2], ..., X [1], X [0] is H (X), then the squared modular multiplication "A ^ 2 mod N" is repeated n times. The remainder multiplication "A = A * B mod N" is repeated H (X) times. It is confirmed by a numerical example that the correct result is obtained by the above algorithm 1. In terms of confirming the algorithm, it is sufficient to materialize the index X, so Y
And N will be used as they are. Calculate S = Y ^ 45 mod N according to Algorithm 1. Where the index 45
Can be written in binary as 45 = 101101 (binary). Therefore, when the exponent is expressed using the symbol of Algorithm 1, X [5] = 1, X [4] = 0, X [3] = 1, X [2] = 1, X [1] = 0,
X [0] = 1. The number of bits n is 6. The initial value of S is 1. (1) Since the initial value of S is 1 when j = 5, even if this is squared and the remainder is divided by N, it is 1 (step 4). Since X [5] = 1, step 5 is executed and S = 1 * Y mod N = Y mod N is obtained. (2) When j = 4, the value of S at this time is initially S = Y mod N, so if you take the square of this and take the remainder, you get S = Y ^ 2 mod N. Since X [4] = 0, step 5 is not executed and S
= Y ^ 2 mod N remains. (3) When j = 3, the value of S at this time is S = Y ^ 2 mod N at first.Therefore, if this is squared and divided by N, the remainder is S = Y ^ 4 mod N. Become. Since X [3] = 1, step 5 is executed and S = Y ^ 4 * Y mod N = Y ^ 5 mod N. (4) When j = 2, the value of S at this time is initially S = Y ^ 5 mod N, so if you take the remainder by squaring this and dividing by N, S = Y ^ 10 mod N
Becomes Since X [2] = 1, step 5 is executed and S = Y ^ 10 * Y mod N = Y ^ 11 mod N. (5) When j = 1, the value of S at this time is initially S = Y ^ 11 mod N, so
Square this and take the remainder divided by N, S = Y ^ 22 mod
N. Since X [1] = 0, step 5 is not executed and S = Y ^ 22 mod N remains. (6) When j = 0, the value of S at this time is S = Y ^ 22 mod N at first.Therefore, if this is squared and the remainder is divided by N, S = Y ^ 44 mod N
Becomes Since X [0] = 1, step 5 is executed and S = Y ^ 44 * Y mod N = Y ^ 45 mod N, and the desired result is obtained. As described above, in the algorithm 1, the power-residue calculation is decomposed into the modular multiplication (including the square) and executed, so that an arithmetic device having an arithmetic function of “A * B mod N” may be used. However, A, B, and N are all very large values. For example, assuming that the data length is 1024 bits, which is the current mainstream, the intermediate result A * B is 2048 bits.
There is the problem of a large number of bits. Furthermore, the value obtained by dividing A * B by N is the final result, so division that handles a large value of 2048 bits ÷ 1024 bits must be executed. Here, the multiplication can be performed in parallel by a microprocessor or the like by dividing the multiplier and the multiplicand, but it is difficult to parallelize the division, which has been a factor that impedes speeding up. Such modular multiplication "A * B mod N"
Algorithm for executing "A * B * R ^ (-1) mod N" without dividing by N to solve the problem of division in 2
It has been known. Here, R is 2 ^ n (n is a bit length of N, for example), and is a positive integer that satisfies R> N. Further, in the following, it is assumed that N is an odd number. This assumption is R
Elliptic curve cryptography (elliptic curl) on SA and prime field
ve cryptography) is a valid assumption. In fact
In RSA, N is a product of large prime numbers, and in prime field,
Modular arithmetic op modulo large prime
N is an odd number as well, since N. Algorithm 2 below is based on the mathematician Peter Montgomery (Pete
r Montgomery). The details of the argument leading to Algorithm 2 are omitted here, but for example, the paper PL Montgomery, “Modular Multiplication wit written by Montgomery himself.
hout Trial Division ”, Math. Comp., vol. 44, 1985,
pp.519-521, or the standard handbook in cryptography A. Menezes, PCvan Oorschot, S.
A. Vanstone, “Handbook of Applied Cryptography”,
CRC-press, 1997, pp. 602-603. Algorithm 2 requires that A * B * R ^ (-1) mod N
Therefore, in order to execute the modular exponentiation operation in the RSA cryptography using the device that performs this operation, it is necessary to use a modified algorithm such as the following algorithm 3. [Algorithm 3] input Y, X = (X [n-1] X [n-2] ... X [1] X [0]), N output Y ^ X mod NA = R mod N Step 1 B = Y * R mod N Step 2 For j = n-1 to 0 step 1 {Step 3 A = (A ^ 2) * R ^ (-1) mod N Step 4 If X [j] = 1 then A = A * B * R ^ (-1) mod N step 5} A = A * R ^ (-1) mod N step 6 output A step 7 In algorithm 3 during steps 1-5, a multiplier (mul
The tiplier) and the multiplicand are multiplied by R mod N. Hereinafter, this data format will be referred to as Montgomery format. The above Montgomery multiplication "ABR ^ (-
1) In "mod N", the Montgomery form is kept unchanged. Actually, Mont (A) = A * R mod
If you write like N, Mont (A) * Mont (B) * R ^ (-1) mod N = A * R * B * R * R ^ (-1) mod N = A * B * R mod N = Mont (A * B). Therefore, for example, "A * B * R ^ (-1) mod N", "(A
^ 2) * R ^ (-1) mod N ”,“ A * R ^ (-1) mod N ”If there is a computing unit, the procedure of Algorithm 3 above can be performed using the CPU and these computing units. It can run fast.
"(A ^ 2) * R ^ (-1) mod N" and "A * R ^ (-1) mod N" are B = in "A * B * R ^ (-1) mod N", respectively. Since A and B = 1, it can be seen that the modular exponentiation operation can be realized by configuring an arithmetic unit "A * B * R ^ (-1) mod N". Next, in step 2, "A * B * R ^ (-1)
Here is a brief summary of what has been done so far on how to speed up the algorithm for calculating "mod N". For the sake of explanation, Algorithm 2 is shown again. Known typical speed-up methods can be classified as follows. First speed-up method: A method that reduces the number of loop iterations including steps 3, 4, and 5 Second speed-up method: A method that parallelizes the processing of steps 3, 4, and 5 Third speed-up method: Replacing the processing in step 6 with a simpler method A typical one of the first speed-up methods is to execute addition processing collectively for a plurality of bits. For example, a method for collectively executing k-bit processing is shown in Algorithm 4 below. [Algorithm 4] For j = 0 to 2 ^ k 1 step +1 {Step 1 TBL (j, B) = j * B Step 2 TBL (j, N) = j * N Step 3} W = 0 Step 4 For j = 0 to n / k-1 step +1 {Step 5 T = W + TBL (A [j, k], B) Step 6 U = T mod 2 ^ k Step 7 M [j, k] = -U * N [0, k] ^ (-1) mod 2 ^ k Step 8 W = T + TBL (M [j, k], N) Step 9 W = W / 2 ^ k Step 10} If W> = N then W = WN Step 11 Output W Step 12 Here, for simplicity, k is assumed to be a divisor of n. For example, when n = 1024, k = 2, 4, 8, etc. may be selected. A [j,
k] indicates the value of the j-th k-bit block from the bottom of A. Similarly, M [j, k] indicates the value of the j-th k-bit block from the bottom of M, and N [0, k] indicates the value of the lowest k-bit block of N. In addition, the calculation in step 8 is performed so that the value becomes positive. In the algorithm 4, if k = 1, then N is an odd number, so that M matches U and algorithm 2 is obtained. Based on this method,
A method of performing modular multiplication is disclosed in Japanese Patent Laid-Open No. 7-20778, Masahiko Takenaka et al., "Residual Calculator, Table Creating Device and Multiplicative Residual Calculator". Algorithm 4 is a little complicated, so its mechanism will be briefly described. Before explaining the contents of the algorithm 4, it is necessary to briefly understand the principle of the Montgomery method. The Montgomery method is, in principle, equivalent to solving the indefinite equation (Diophantine equation) in which M and W are unknowns: (Equation 1) A * B + M * N = W * R. First, N is an odd number and R = 2 ^ n
Note that (Equation 1) has an integer solution, since The assumption that N is odd is essential, as (Equation 1) does not necessarily have an integer solution when N is even. In order to understand the meaning of (Equation 1), FIG. 3 is a diagram expressing this. Since R = 2 ^ n, the lower n bits of the right side of (Equation 1) are all 0. Therefore, finding M that satisfies (Equation 1) means that the lower n bits of AB and MN are 0. It is nothing but the operation of choosing M. This upper half bit string is the value of A * B * R ^ (-1) mod N to be obtained. However, the sum of the number of 2n bits and the number of 2n bits becomes 2n + 1 bits, and the most significant bit may be 1 (hereinafter, this bit is referred to as an OV bit). This is resolved in step 11 of Algorithm 4. R = 2 ^ on both sides of (Equation 1)
Taking the remainder divided by n gives (Equation 2) (A * B mod 2 ^ n) + (M * N mod 2 ^ n) = 0. (Equation 2) is transformed to obtain (Equation 3) M = -A * B * N ^ (-1) mod 2 ^ n. Since the modulus is power of 2, this M can be obtained in order for each bit block. This is the meaning of Algorithm 4. In Algorithm 4, since the number of loops is 1 / k, roughly speaking, the speed of the loop portion is about k times. However, in this case, if k is increased, there is a problem that the size of the multiplication table is increased. As the size increases, the RAM area gets under pressure. Generally, in devices such as IC cards, since the RAM area is small, it is often impossible to make k large.
This is the first problem. The second speed-up method can be used together with the first speed-up method. Note that in Algorithm 4, the value of M is determined when the lower k bits of T are determined (step 8), and it can be seen that the process of step 9 can be executed immediately after the determination. Therefore, if two n-bit adders are installed and a circuit is configured to execute step 6 and step 9 in parallel, the speed is almost doubled. However, this method requires twice the number of adders. Not only does this mean that the number of computing units is doubled, but it also reduces the power consumption during execution.
(power consumption) also means doubling. This is the second problem. The third speed-up method is
It is independent of the first and second speed-up methods. As is well known, the comparison process, such as step 11 of Algorithm 4, actually performs the subtraction and the negative flag (negative fl
Achieved by examining ag). Since this is a subtraction process between large numbers such as 1024 bits, the processing time may not be ignored. On the other hand, in the computer, R = 2 ^
Comparison with n is easy. Therefore, the last comparison / subtraction processing unit (step 11) can be speeded up by using the following modified algorithm 5. Regarding speeding up by this change, Japanese Patent Laid-Open No.
-21057, Kunihiko Nakada "Data Processor and Microcomputer", Kunihiko Nakada: DATA PROCESSOR AND MICR
OPROCESSOR, United States Patent, US005961578A. [Algorithm 5] For j = 0 to 2 ^ k 1 step +1 {Step 1 TBL (j, B) = j * B Step 2 TBL (j, N) = j * N Step 3} W = 0 Step 4 For j = 0 to n / k-1 step +1 {Step 5 T = W + TBL (A [j, k], B) Step 6 U = T mod 2 ^ k Step 7 M [j, k] = -U * N [0, k] ^ (-1) mod 2 ^ k Step 8 W = T + TBL (M [j, k], N) Step 9 W = W / 2 ^ k Step 10} If W> = R then W = WN Step 11 Output W Step 12 Here, the comparison processing in Step 11 can be realized only by determining whether or not the OV bit in FIG. 3 is 1. This is much lighter and faster than the subtraction process. However, the value obtained by Algorithm 5 is A * B *
It may be A * B * R ^ (-1) mod N + N rather than R ^ (-1) mod N itself. However, since the bit length remains n bits, in the modular exponentiation operation, it is only necessary to finally modify the value. The present invention is irrelevant to this speed-up method and is related to the first and second speed-up methods. The problems 1 and 2 above are elliptic curve cryptography.
to system) Galois field of characteristic 2 (Galois field of
It also occurs in characteristic 2). Let me briefly explain the concept of the Galois field. The Galois field itself is a purely mathematical concept, but a suitable isomorphism (homom
Orphism) can be translated into a specific operation. The simplest method to implement a Galois field on a computer is modular polynomial
arithmetic operation) is used. First, consider an entire polynomial whose coefficient is 0 or 1, and let this set be POLY. It is an element of POLY and is an irreducible polynomial of degr
ee n) F (X), and for the two elements of POLY, A (X) and B (X), the sum (which is synonymous with the difference) is the sum of the coefficients in the normal polynomial sum. It is defined to be performed in the meaning of exclusive OR, and the product is defined by A (X) * B (X) mod F (X). As is well known in algebra, F
Elements of POLY that are not multiples of (X) have the inverse of the product operation described above. In POLY, A (X) and B
The fact that (X) is equivalent means that these differences A (X) -B
(X) is defined as a multiple of F (X), and POLY is defined by this equivalence relation.
Write GF (2 ^ n) that is divided into valent class). This is,
It is a Galois field with characteristic 2. Algebraic structure of GF (2 ^ n) (algebrai
It is well known that (c structure) does not depend on how to select the irreducible polynomial F (X) of degree n. In terms of implementation, 3 terms
And a polynomial consisting of 5 terms is often chosen. This polynomial F
(X) is sometimes called reduction polynomial. The Galois field product operation is realized by performing addition processing that appears in a normal product operation in the sense of exclusive OR. In the adder, if carry propagation is not performed, the value becomes the same as that of performing bitwise exclusive OR, and the Galois field product operation is realized. In the Montgomery method, essentially the same operation is performed. The condition that "the modulus N is an odd number", which was necessary in the ordinary Montgomery method, is replaced by the condition that the constant term of F (X) is 1, which means that F (X) is irreducible. Therefore, it follows automatically and there is no problem in applying the Montgomery method. R = 2 ^ n in normal Montgomery method
Is replaced by R (X) = X ^ n. Further, the process of determining M, which is important in the Montgomery method, can also be realized by not carrying carry. Therefore, the processing of the Montgomery method in the Galois field can be realized as follows. [Algorithm 6] W (X) = 0 Step 1 For j = 0 to n-1 step +1 {Step 2 T (X) = W (X) + A [j] * B (X) Step 3 If T ( X) mod X is 1 then W (X) = (T (X) + F (X)) / X Step 4 Else W = T / X Step 5} where A [j] is A (X) It is the coefficient of the j-th term. Also,
The sum is used to mean an exclusive OR for each coefficient. The algorithm 6 does not require the subtraction process that is used in the ordinary Montgomery method because the Galois field does not have a size relation except the order, and the order of A (X) and B (X) is n.
If less than (for GF (2 ^ n) this is always true),
This is because W (X) configured in the above algorithm 6 is less than the nth order, and is therefore the remainder to F (X) as it is. Therefore, the third problem that occurs when performing the Montgomery method in ordinary modular multiplication does not exist in the Galois field. It is confirmed by a numerical example that a correct answer can be obtained by this algorithm 6. GF (2 ^ 4) for simplicity
In, F (X) = X ^ 4 + X + 1. A (X) = X ^ 3 + X
^ 2 + 1, B (X) = X ^ 3 + X + 1 for A in GF (2 ^ 4)
And B are calculated, that is, A (X) * B (X) mod F (X) is calculated.
In this case, note that A [0] = 1, A [1] = 0, A [2] = 1, A [3] = 1, and the processing of Algorithm 6 is sequentially executed. (1) W (X) = 0, A [0] = 1 when j = 0, so in step 3, W (X) = 0 + 1 * (X ^ 3 + X + 1) = X ^ 3 + X + 1 and this constant term is 1, so execute step 4 and write W (X) = (X ^ 3 + X + 1 + X ^ 4 + X + 1) / X = (X ^ 4 + X ^ 3 + 2 * X + 2) / X = (X ^ 4 + X ^ 3) / X = X ^ 3 + X ^ 2. Note that the double is 0. (2) W (X) = X ^ 3 + X ^ 2, A [1] = 0 when j = 1, so in step 3, W (X) = X ^ 3 + X ^ 2 + 0 * (X ^ 3 + X + 1) = X ^ 3 + X ^ 2. Since this constant term is 0, step 5 is executed and W (X) = (X ^ 3 + X ^ 2) / X = X ^ 2 + X. (3) W (X) = X ^ 2 + X, A [2] = 1 when j = 2, so in step 3, W (X) = X ^ 2 + X + 1 * (X ^ 3 + X + 1) = X ^ 3 + X ^ 2 + 2 * X + 1 = X ^ 3 + X ^ 2 + 1. Since this constant term is 1, execute step 4, and W (X) = (X ^ 3 + X ^ 2 + 1 + X ^ 4 + X + 1) / X = (X ^ 4 + X ^ 3 + X ^ 2 + X + 2) / X = (X ^ 4 + X ^ 3 + X ^ 2 + X) / X = X ^ 3 + X ^ 2 + X + 1. (4) When j = 3, W (X) = X ^ 3 + X ^ 2 + X + 1, A [3] = 1, so in step 3, W (X) = X ^ 3 + X ^ 2 + X + 1 + 1 * (X ^ 3 + X + 1) = 2 * X ^ 3 + X ^ 2 + 2 * X + 2 = X ^ 2. Since the constant term is 0, step 5 is executed and W (X) = X ^ 2 / X = X. W (X) obtained by the above calculation is A (X) * B (X) * R (X)
It should be ^ (-1) mod F (X), so R (X) ^ (-
1) Multiplying R (X) = X ^ 4 to remove mof F (X), X * X ^ 4 mod (X ^ 4 + X + 1) = X * (X + 1) = X ^ 2 + X. To confirm whether this result is correct, A
(X) * B (X) mod F (X) is calculated in the usual way. A (X) * B (X) = (X ^ 3 + X ^ 2 + 1) * (X ^ 3 + X + 1) = X ^ 6 + X ^ 5 + X ^ 4 + 3 * X ^ 3 + X ^ 2 + X + 1 = X ^ 6 + X ^ 5 + X ^ 4 + X ^ 3 + X ^ 2 + X + 1, so A (X) * B (X) mod F (X) = X ^ 6 + X ^ 5 + X ^ 4 + X ^ 3 + X ^ 2 + X + 1 mod (X ^ 4 + X
+ 1) = (X ^ 2 + X + 1) * (X ^ 4 + X + 1) + X ^ 2 + X mod (X ^ 4 +
X + 1) = X ^ 2 + X, and certainly the same polynomial as the value calculated by Algorithm 6 multiplied by R (X) is obtained. In the above calculation in the Galois field, the operation of replacing 2 with 0 is performed, but this is because the calculation of the sum is executed by the exclusive OR. Modification of this algorithm to process every k bits can be realized in the same manner as in the case of normal modular multiplication. The algorithm of Montgomery method in GF (2 ^ n) is shown. [Algorithm 7] For j = 0 to 2 ^ k 1 step +1 {Step 1 TBL (H [j] (X), B (X)) = H [j] (X) * B (X) Step 2 TBL (H [j] (X), F (X)) = H [j] (X) * F (X) Step 3} W (X) = 0 Step 4 For j = 0 to n / k-1 step + 1 {Step 5 T (X) = W (X) + TBL (A [j, k] (X), B (X)) Step 6 U (X) = T (X) mod X ^ k Step 7 M [ j, k] (X) = U (X) * F [0, k] (X) ^ (-1) mod X ^ k Step 8 W (X) = T (X) + TBL (M [j, k ] (X), F (X)) Step 9 W (X) = W (X) / X ^ k Step 10} where polynomial H [j] (X) in Algorithm 7 is j
H [j] (X) = j [k-1] * X when the binary expansion of is j = (j [k-1] j [k-2] ... j [0]) ^ (k-1) + j [k-2] * X ^ (k-2) +
... + j [0] are associated with each other. Also, A [j,
k] (X) is a polynomial corresponding to the j-th k-bit block when the polynomial A (X) is expressed as a bit string, and F [0, k]
The same applies to (X) and M [j, k] (X). Furthermore, the sum of steps 6 and 9 means the exclusive OR of each term. Numerical examples are omitted.

【０００３】[0003]

【発明が解決しようとしている課題】本発明は、大きな
数に対する剰余乗算（modular multiplication）及びべ
き乗剰余演算(modular exponentiation)を実行する為の
マイクロコンピュータ並びにその実行方法に関し、モン
ゴメリ(Montgomery)の方法(P. L. Montgomery, ”Modul
ar Multiplication without Trial Division”, Math.
Comp., vol. 44, 1985, pp.519-521)に基づき、これを
該マイクロコンピュータ上に実装する際に上記SUMMARY OF THE INVENTION The present invention relates to a microcomputer for performing modular multiplication and modular exponentiation on a large number, and a method of executing the same. The method of Montgomery PL Montgomery, ”Modul
ar Multiplication without Trial Division ”, Math.
Comp., Vol. 44, 1985, pp.519-521), the above-mentioned when mounting this on the microcomputer.

【従来の技術】において説明した二つの課題、すなわ
ち、（第一の課題）同時に複数ビットを処理する場合に必要
な乗算テーブルが肥大化してしまうこと（第二の課題）A*B, M*Nの処理を同時に実行して二倍の
速度を実現する場合、演算器の量及び消費電力が二倍に
なってしまうことを解決し、又、ガロア体GF(2^n)の積
演算に対して同様の問題を解決しようとするものであ
る。本発明の前記ならびにその他の目的と新規な特徴につい
ては、本明細書の記述及び添付図面から明らかになるで
あろう。2. Description of the Related Art Two problems explained in the prior art, namely, (first problem) enlargement of multiplication table necessary for simultaneously processing a plurality of bits (second problem) A * B, M * When the processing of N is executed simultaneously and the double speed is realized, it solves the problem that the amount of the arithmetic unit and the power consumption are doubled, and the product operation of Galois field GF (2 ^ n) is solved. On the other hand, it tries to solve the same problem. The above and other objects and novel features of the present invention will be apparent from the description of this specification and the accompanying drawings.

【０００４】[0004]

【課題を解決するための手段】本願において開示される
発明の概要を説明すれば、下記の通りである。下記モン
ゴメリ法のアルゴリズム４（再掲）を考える。 [アルゴリズム４（再掲）] For j = 0 to 2^k 1 step +1{ ステップ１ TBL(j, B) = j*B ステップ２ TBL(j, N) = j*N ステップ３ } W = 0 ステップ４ For j = 0 to n/k-1 step +1{ ステップ５ T = W + TBL(A[j,k], B) ステップ６ U = T mod 2^k ステップ７ M[j,k] = -U*N[0,k]^(-1) mod 2^k ステップ８ W = T + TBL(M[j,k], N) ステップ９ W = W/2^k ステップ１０ } If W >= N then W = W N ステップ１１ Output W ステップ１２先にThe outline of the invention disclosed in the present application is as follows. Consider Algorithm 4 (repost) of the Montgomery method below. [Algorithm 4 (repost)] For j = 0 to 2 ^ k 1 step +1 {Step 1 TBL (j, B) = j * B Step 2 TBL (j, N) = j * N Step 3} W = 0 Step 4 For j = 0 to n / k-1 step +1 {Step 5 T = W + TBL (A [j, k], B) Step 6 U = T mod 2 ^ k Step 7 M [j, k] = -U * N [0, k] ^ (-1) mod 2 ^ k Step 8 W = T + TBL (M [j, k], N) Step 9 W = W / 2 ^ k Step 10} If W > = N then W = WN Step 11 Output W Step 12 First

【従来の技術】において述べたように、アルゴリズム４
における処理ステップ６及び９は、大きな数の加算処理
であり、ハードウエアの大部分は、この加算器であると
考えてよい。ステップ６とステップ９の処理は、A*B の
処理と M*N の処理のkビットブロックに対する処理であ
るので、これを合成し、A*B + M*Nのkビットブロックに
対する加算処理を行ってもよいことがわかる。各ループ
一回に対応する加算処理は、「kビットの数×B」及び
「kビットの数×N」という形をしているので、必要とな
る新たなテーブルは、次のような形になることがわか
る。 TBL( j, t ) = ｊ*B + ｔ*N(j, t = 0, 1, 2, ..., 2^k
1) このテーブルの選択には、二つのパラメータj, tが必要
である。これらのパラメータは、Wに加えたときに、そ
の和の下位kビットが0になるように選べばよい。すなわ
ち、第jステップの処理に際し、テーブルとして、 TBL( A[j,k], M[j,k] ) = A[j,k],*B + M[j,k]*N を選べばよい。M[j,k]は、A[j,k]にも依存して決まる値
である。下記のアルゴリズム８は、テーブルをあらかじ
め並べ替えておき、Mの計算を行わずに済むようにした
ものである。 [アルゴリズム８] For j = 0 to 2^k-1 step 1{ ステップ１ For s = 0 to 2^k-1 step 1{ ステップ２ U = (s + j*B[0, k]) mod 2^k ステップ３ M =−U*N[0,k]^(-1) mod 2^k ステップ４ TBL[ j , s ] = j*B + M*N ステップ５ } } W = 0 ステップ６ For i = 0 to n/k-1 step 1｛ステップ７ W = W + TBL(A[i, k], W[0, k]) ステップ８ W = W/2^k ステップ９ } If W >= N then W = W N ステップ１０ Output W ステップ１１上記アルゴリズム８は、本発明の本質を表している。ア
ルゴリズム４において二個所に現れた加算処理、すなわ
ち、アルゴリズム４におけるステップ６及び９の処理
は、アルゴリズム８においては、一つの加算処理になっ
ている。もし、アルゴリズム４におけるステップ６及び
９の処理を順に計算していた場合、アルゴリズム７を採
用することにより、計算速度が約2倍となる。又、多く
コプロセッサが行っているようにアルゴリズム４におけ
るステップ６及び９の処理を並列に実行している場合
は、加算器が二個必要だが、アルゴリズム７を採用した
場合は、加算器の数を半分にすることができ、さらに、
このことの直接の結果として、消費電力が減少する。こ
のアルゴリズム７において、キャリーの伝播を行わなけ
れば、GF(2^n)の積演算が実現することは、2. Description of the Prior Art Algorithm 4
The processing steps 6 and 9 in 1. are large number addition processes, and most of the hardware can be considered to be this adder. The processing of steps 6 and 9 is processing for k * bit blocks of A * B processing and M * N processing. Therefore, these are combined and addition processing is performed for k * bit blocks of A * B + M * N. I know I can go. Since the addition process corresponding to each loop is of the form "k-bit number x B" and "k-bit number x N", the required new table has the following form. You can see. TBL (j, t) = j * B + t * N (j, t = 0, 1, 2, ..., 2 ^ k
1) Two parameters, j and t, are needed to select this table. These parameters can be selected so that when added to W, the lower k bits of the sum will be zero. In other words, if you select TBL (A [j, k], M [j, k]) = A [j, k], * B + M [j, k] * N as a table in the processing of the j-th step, Good. M [j, k] is a value that also depends on A [j, k]. The following algorithm 8 is one in which the tables are rearranged in advance so that the calculation of M is unnecessary. [Algorithm 8] For j = 0 to 2 ^ k-1 step 1 {Step 1 For s = 0 to 2 ^ k-1 step 1 {Step 2 U = (s + j * B [0, k]) mod 2 ^ k step 3 M = −U * N [0, k] ^ (-1) mod 2 ^ k step 4 TBL [j, s] = j * B + M * N step 5}} W = 0 step 6 For i = 0 to n / k-1 step 1 {Step 7 W = W + TBL (A [i, k], W [0, k]) Step 8 W = W / 2 ^ k Step 9} If W> = N then W = WN Step 10 Output W Step 11 The above algorithm 8 represents the essence of the present invention. The addition process that appears in two places in Algorithm 4, that is, the processes of steps 6 and 9 in Algorithm 4 are one addition process in Algorithm 8. If the processes of steps 6 and 9 in the algorithm 4 are sequentially calculated, the calculation speed is approximately doubled by adopting the algorithm 7. Also, when the processes of steps 6 and 9 in algorithm 4 are executed in parallel as many coprocessors do, two adders are required, but when algorithm 7 is adopted, the number of adders is increased. Can be halved, and
As a direct result of this, the power consumption is reduced. In this algorithm 7, if carry propagation is not performed, the product operation of GF (2 ^ n) is realized as

【従来の技術】にて説明した通りである。従って、GF(2
^n)の積演算回路に対しても、上記の通常の剰余乗算に
おいて生ずる本発明の効果が現れる。This is as described in "Prior Art". Therefore, GF (2
The effect of the present invention that occurs in the above normal modular multiplication also appears for the product operation circuit of (^ n).

【０００５】[0005]

【発明の実施の形態】以下に前記アルゴリズム８を再掲
する。以下、このアルゴリズムに従って、回路を構成す
る方法を述べる。 [アルゴリズム８（再掲）] For j = 0 to 2^k-1 step 1{ ステップ１ For s = 0 to 2^k-1 step 1{ ステップ２ U = (s + j*B[0, k]) mod 2^k ステップ３ M =−U*N[0,k]^(-1) mod 2^k ステップ４ TBL[ j , s ] = j*B + M*N ステップ５ } } W = 0 ステップ６ For i = 0 to n/k-1 step 1｛ステップ７ W = W + TBL(A[i, k], W[0, k]) ステップ８ W = W/2^k ステップ９ } If W >= N then W = W N ステップ１０ Output W ステップ１１アルゴリズム７の処理は、大きく分けて三つの部分から
構成される。ずなわち、ステップ１からステップ５まで
の「テーブル生成部」、ステップ６からステップ９まで
の「加算処理部」、及び、ステップ10，11の「値修正
部」の三つである。まず、一般性の高い「加算処理部」
を説明し、次に「値修正部」、最後に「テーブル生成
部」を説明し、これらを統合した回路構成について説明
する。加算器の構成法は種々のものが知られているが、
ここでは代表的なものについて説明する。加算器は、1
ビット毎の加算器を並べて構成される。今、二つのmビ
ットの数S及びTの加算を行う回路を構成する。以下の説
明において、それぞれの二進数表示S = (S[m-1]S[m-
2]...S[1]S[0]), T = (T[m-1]T[m-2]...T[1]T[0])及び
その和Z = (Z[m]Z[m-1]...Z[1]Z[0])を用いる。加算処
理をビット単位に分解して考えたとき、ｊ番目のビット
の加算に必要な値は、S[j], T[j]及び、直前の計算で生
じたキャリーC[j-1]である。但し、C[-1] = 0及び、Z
[m] = C[m-1]と定めておく。図４は、1ビットの加算を
行う全加算器(full adder)の例である。j番目の1ビット
全加算器は、S[j], T[j]及び、前段のキャリー信号C[j-
1]を入力として、Zの第ｊビットの値Z[j]及びキャリー
信号C[j]を出力する回路として捉えることができる。図
４の回路は、論理式：（式４） Z[j] = (S[j] AND T[j] AND C[j-1]) OR (S
[j] AND NOT(T[j]) AND NOT(C[j-1])) OR (NOT(S[j]) A
ND T[j] AND NOT(C[j-1])) OR (NOT(S[j]) AND NOT(T
[j]) AND C[j-1]), （式５） C[j] = NOT((S[j] AND NOT(T[j]) AND NOT(C
[j-1])) OR (NOT(S[j]) AND T[j] AND NOT(C[j-1])) OR
(NOT(S[j]) AND NOT(T[j]) AND C[j-1]) OR (NOT(S
[j]) AND NOT(T[j]) AND NOT(C[j-1])) を回路化したものである。図４の401, 402, 403, 404,
405は、AND回路であり、406, 407はOR回路、409, 410,
411, 412, 413, 414, 415, 416, 417は、インバータ（N
OT回路）である。これらが、それぞれ、（式４）、（式
５）のAND, OR,NOTに対応する。AND，OR，NOTはぞれぞ
れ、いくつかのトランジスタから構成される論理ゲート
(logic gate)である。ANDは、入力の全てが1のときのみ
1を出力し、それ以外では0を出力する論理ゲートであ
り、ORは、入力のどれか1つでも1であれば出力が1にな
り、全て0のときのみ出力が0となる論理ゲートである。
NOT（インバータ）は、入力0に対しては1を出力し、入
力1に対しては、0を出力する論理ゲートである。1ビッ
トの全加算器を並べれば、複数ビットの全加算器を作る
ことができる。図５に、4ビットの全加算器を示す。50
1，502，503，504は、図４に示した1ビットの全加算器
であって、（式４）、（式５）に従って計算がなされ
る。最下位のビットでは、キャリーを0にセットしてお
く。又、最後のキャリーはZ[4]となる。ここでは、4ビ
ットの場合を示したが、1ビットの全加算器を増やすこ
とによって、何ビットの加算でもできることは明らかで
ある。ここに示した全加算器の構成方法は、「桁上げ伝
播加算器」(ripple carry adder)と呼ばれる最も素朴な
ものであり、桁上げ伝播時間(carry propagation time)
の比率が大きく、最も加算時間のかかる回路である。よ
り高速に加算を行う為のハードウエアアルゴリズムは、
数多く存在するが、加算器の構成法は、本発明の本質と
は無関係であるので、これ以上の説明は省略する。次
に、値修正部の処理について説明する。値修正部では、
減算処理(subtraction)が用いられる。減算器(subtract
er)は、加算器と似た構造をしており、桁上げ伝播加算
器を利用して構成することができる。このため、通常の
計算機は、加算回路は持っているが、減算回路は持って
いない。図6は、全加算器を利用した4ビットの減算回路
（Z = S T）の構成例である。601，602，603，604は1ビ
ットの全加算器であるが、入力に際し、最初のキャリー
（ボロー）が1になっている点と、Tをインバータ605，6
06，607，608によって反転させているという点が異な
る。減算では、最後のキャリー（ボロー）が1になると
きは、結果が負になったことを意味している。この現象
はアンダーフロー(under flow)と呼ばれる。上記の減算
器の概念を用いて、値修正部の構成例を説明する。図７
は、値修正回路のブロック図(block diagram)である。n
+1ビットのレジスタ702には、値Wが、nビットのレジス
タ703には、モジュラスNの値が含まれており、両者が、
n+1ビットの減算器704によって、W-Nが計算される。こ
の値W-Nと、この演算の結果として発生するボローを選
択信号(selection signal)、及び、Wの値がセレクタ705
に入力され、選択信号が0であれば、セレクタ705は、W-
Nを選択して出力し、選択信号が1であれば、Wを選択し
て、これを出力する。ここでは、W，Nは、専用のレジス
タとして説明したが、これをCPU処理で利用するRAM上に
配置することもできることは言うまでもない。次に、テ
ーブル生成部の回路について説明する。簡単の為、k=2
の場合のみ示す。k=2の場合のテーブル生成部で生成さ
れる値を列挙すると、図8の表のようになる。この構成
例では、入力値A，Bに依存しないモジュラスの値Nの倍
数、N, 2*N， 3*Nについては、あらかじめ計算して、こ
れをレジスタ909，910，911に保持しておき、これを用
いる。また、入力値Bは、入力値そのものであるから、
計算する必要はない。Bは、レジスタ906に保持されてい
る。0及び、上記の４つの値B, N, 2*N, 3*N以外の11個
の値は、毎回計算しなければならない。ここでは、加算
器のみ利用して、前記11個の値を計算する構成を示し
た。レジスタの大きさは、最大で（k+1=）3ビット大き
く取らなければならない。各レジスタは、データバス92
1に接続され、前記データバス921は、加算処理部にデー
タを供給する信号線として利用される。904は、n+2ビッ
トの数の加算を行う加算器であり、n+3ビットの和の値
をセレクタ903に送る。制御装置905は、クロック信号か
ら決定された計算開始信号を受けて内部のカウンタを二
進数表示で、0001に初期化し、セレクタ901を経由し
て、Bの値を加算器904の入力（両方）として加算器904
を動作させる。加算器904で計算が始まる直前に、前記
加算器904から制御装置905に制御信号を送る。制御装置
905は、前記制御信号を受けて、その内部のカウンタを
インクリメントし、二進数表示で、0010という値を得、
この下位2ビット00と上位2ビットをそれぞれ、j*B + t*
Nにおけるj、ｔとして、セレクタ903が、加算器904の答
を書き込むレジスタ907の位置を決定し、前記加算結果
を書き込む。次に、再び計算開始信号が制御装置905に
送信され、制御装置905は、セレクタ901に制御信号を送
り、Bと2*Bを読み出して、加算器904に入力し、先と同
じ動作をしてレジスタ908に値3*Bを書き込む。ここまで
の動作で、テーブルを合成する為の値、B, 2*B, 3*B,
N, 2*N, 3*Nが全て揃ったことになる。以下同様に制御
装置のカウンタの値から、読み出すレジスタと書き込む
レジスタを決定し、同様の動作を繰り返す。但し、カウ
ンタの下位2ビットが00のときは、加算動作は、スキッ
プするものとする。こうして、全ての値が揃った後は、
レジスタ906, 907, 908, 909, 910, 911, 912, 913, 91
4, 915, 916, 917, 918, 919, 920は、読み出し専用の
レジスタとして利用される。次に、これらを組み合せ
て、A*B*R^(-1) mod Nを計算し、結果をAレジスタに書
き込む回路のブロック図を図１０に示す。但し、値調整
部については、煩雑さを避ける為省略した。一般に、値
調整部で必要な減算処理は、加算器1015の入力値を変更
することによって実現する。図１０は、本発明における
重要な構成を全て含んでいる。以下、変数名は、アルゴ
リズム８で用いたものと同じものを利用する。図10にお
いて、まず、Aレジスタ1003に格納されたAの値を下位か
ら順に読み出す。Aレジスタの構成は、図１１に示され
ている。この際、制御装置1007は、読み出す2ビットブ
ロックの番号iを9ビットレジスタ1006から受け取って、
該当する2ビットブロックの値を読み出し、これをレジ
スタ1008に送信する。レジスタ1009には、Wの下位2ビッ
トが格納されている。1008，1009の値が、値選択及び信
号制御装置1011に送信され、1011は、該当するデータを
データバス1001を介して1002の値テーブルから読み出
し、値を1027ビットと1026ビットの値の加算を行う加算
器1015に送る。但し、前記加算器1015には、専用のレジ
スタ1013，1014が接続されており、テーブルから読み出
された値は、一時的に1027ビットのレジスタ1014に保持
される。1013は、1026ビットのレジスタであり、アルゴ
リズム７に現れるWが格納される。動作開始時には0にク
リアされている。値テーブル1002に含まれるj*B + t*N
という形をしたデータとWの和が加算器1015で加算され
る。この加算結果は、下位2ビットが二進数表示で00に
なっているので、これを2ビット右シフタ1016で切り落
とし、制御装置1017に入力される。一方、加算器1015
は、計算開始と同時に、値1を加算器1018に送り、加算
器は、9ビットレジスタ1019の値と、計算開始信号
「１」を合計し、その結果を9ビットレジスタ1019に送
る。1019には、最初0が格納されている。この和は、制
御装置1017に送られ、制御装置1017は、前記の和が、51
2に達していなければ、1016の出力結果をそのままレジ
スタ1013に書き込み、1013の値は、1011にてWの下位2ビ
ットのみが取り出されて、レジスタ1009に書き込まれ
る。又、この際に、1017から、1011を経由して1007に信
号が送られ、加算器1005によりiの値がインクリメント
され、Aレジスタの次の2ビットブロックが読み出され、
先に延べたのと同じ過程を経てテーブルの値との加算が
行われる。1019の値が、512に達していれば、選択的透
過回路1012に信号を送り、1013にあるWの値をAレジスタ
に送る。この時点でのAレジスタの値は、A*B*R^(-1) mo
d N又は、A*B*R^(-1) mod N＋Nである。後者では、Aか
らNを減算する処理が必要になる。減算処理部は省略す
る。これは、本発明請求項１，7の実施例の一つであ
る。上記実施例は、本発明の要旨を逸脱しない範囲にお
いて種々変更可能である。例えば、以下のような変更が
考えられる。（１）加算器1015は、ここでは、1027ビットを同時に
加算する構成になっているが、これを複数のブロックに
分割して、これらをブロック毎に順に加算していくこと
により実現することもできる。この場合、ハード量は減
るが、速度は落ちる。（２） 2ビット左シフタ1016を用いないで、直接1013の
レジスタに下位2ビットより上位のビットのみ接続する
こともできる。（３）値A, B, Nのビット数を変更する。例えば、2048
ビットや、512ビット、768ビット等に変更することもで
きる。（４） kの値を2でない値、例えば、1，3，4等の値に変
更することもできる。但し、nがkの倍数に成っていない
場合は、修正する為の論理回路が必要である。（５）テーブルにおいて、0をなくし、代わりに、Wに
加える値が0のときは、Wの値を加算器に入力せず、直接
シフタに入力するような論理回路を追加する。上記の変更は、本質的に本発明の趣旨である「加算用テ
ーブルの合成」と無関係に行うことができる。次に、上
記のような変更とは異なり、若干の注意を要するものに
ついて説明する。再び、アルゴリズム８を示す。 [アルゴリズム８（再掲）] For j = 0 to 2^k-1 step 1{ ステップ１ For s = 0 to 2^k-1 step 1{ ステップ２ U = (s + j*B[0, k]) mod 2^k ステップ３ M =−U*N[0,k]^(-1) mod 2^k ステップ４ TBL[ j , s ] = j*B + M*N ステップ５ } } W = 0 ステップ６ For i = 0 to n/k-1 step 1｛ステップ７ W = W + TBL(A[i, k], W[0, k]) ステップ８ W = W/2^k ステップ９ } If W >= N then W = W N ステップ１０ Output W ステップ１１ここで、テーブルTBL[j, s]の値は、最大で、(2^k-1)*B
+ (2^k-1)*Nとなる。この値は、最大で、n+k+1ビット
となるので、kの値に応じて、TBL[j, s]のビット長を増
やさなければならない。データ数は、4^k個あるので、k
に伴うデータの増加は、k*4^kビットに達する。一般にk
は大きな値ではないので、本発明を実現する上で大きな
問題になることはないが、望ましい現象ではない。この
問題を避ける方法が存在する。ステップ８及び９に着目
する。ステップ８における、TBL(A[i, k], W[0, k])の
値は、ステップ８の和の下位kビットが、必ず0になるよ
うにが選ばれている。そして、ステップ９にて、そのk
ビットを右シフトによって切り落としている。これは、
下位kビットの計算が、全く不要であることを示してい
る。従って、テーブルの値そのものも、下位kビットを
切り落としたもの（kビット右シフトしたもの）を用意
するだけでよいことがわかる。具体的には、値「j*B +
t*N」を用いる代わりに、これをkビットシフトした値
「(j*B + t*N)>>k」を新しいテーブルと定義して、アル
ゴリズム８と同様の計算を行う。但し、Wの値は、加算
処理の前にkビット右シフトしておく。すなわち、以下
のアルゴリズムに従って計算を行う。 [アルゴリズム９] For j = 0 to 2^k-1 step 1{ ステップ１ For s = 0 to 2^k-1 step 1{ ステップ２ U = (s + j*B[0, k]) mod 2^k ステップ３ M =−U*N[0,k]^(-1) mod 2^k ステップ４ NTBL[ j , s ] = (j*B + M*N)>>k ステップ５ } } W = 0 ステップ６ For i = 0 to n/k-1 step 1｛ステップ７ W = W/2^k ステップ８ W = W + NTBL(A[i, k], W[0, k]) ステップ９ } If W >= N then W = W N ステップ１０ Output W ステップ１１ここで、ステップ８は、実装上は、シフタを利用する必
要はない。右シフトは、常にkビットだけ行われるの
で、これを直接kビットシフトに対応するように配線す
る。下位kビットの和の計算は行わない。但し、Wの下位
kビットのうち、少なくとも１つが1になる場合、すなわ
ち、下位kビットのそれぞれの論理和(logicalOR)が1で
ある場合、これをキャリーC[-1]として加算器に入力す
る。実際、前記論理和が0の場合は、当該kビット全てが
0であることを示しており、これに対応するMの下位kビ
ットも0でなければならないが、前記論理和が1である場
合、Mとの和の下位kビットが全て0になるようにMを選ぶ
と、必ずキャリーが発生するからである。これらを考慮
した加算処理の回路の実施例を図15に示す。本実施例
は、k=2の場合である。図15に示す回路は、NTBLの値を
格納するNTBLレジスタ1504と、Wの値を格納するWレジス
タ1503が加算器1502に接続されているものである。Wレ
ジスタの下位2ビットは、加算器1502には接続されてお
らず、代わりに、値選択及び信号制御装置に接続されて
おり、この2ビットは、NTBLの値を選択するのに用いら
れる。さらに、前記2ビットはそれぞれ、論理和を計算
するORゲート1501に接続されている。ORゲート1501の計
算値は、キャリー信号C[-1]として加算器1502に入力さ
れる。前記キャリー信号C[-1]は、加算を行う際のキャ
リーの初期値として利用される。これは本発明4,7の実
施例の一つである。次に、ガロア体GF(2^n)に対し、以
下のアルゴリズム７（再掲）に従って動作する本発明の
実施例を説明する。 [アルゴリズム７（再掲）] For j = 0 to 2^k 1 step +1{ ステップ１ TBL(H[j](X), B(X)) = H[j](X)*B(X) ステップ２ TBL(H[j](X), F(X)) = H[j](X)*F(X) ステップ３ } W(X) = 0 ステップ４ For j = 0 to n/k-1 step +1{ ステップ５ T(X) = W(X) + TBL(A[j,k](X), B(X)) ステップ６ U(X) = T(X) mod X^k ステップ７ M[j,k](X) = U(X)*F[0,k](X)^(-1) mod X^k ステップ８ W(X) = W(X) + TBL(M[j,k](X), F(X)) ステップ９ W(X) = W(X)/X^k ステップ１０ } ガロア体GF(2^n)における演算については、BEST MODE FOR CARRYING OUT THE INVENTION The above-mentioned algorithm 8 will be described below again. Hereinafter, a method of forming a circuit according to this algorithm will be described. [Algorithm 8 (repost)] For j = 0 to 2 ^ k-1 step 1 {Step 1 For s = 0 to 2 ^ k-1 step 1 {Step 2 U = (s + j * B [0, k] ) mod 2 ^ k step 3 M = −U * N [0, k] ^ (-1) mod 2 ^ k step 4 TBL [j, s] = j * B + M * N step 5}} W = 0 Step 6 For i = 0 to n / k-1 step 1 {Step 7 W = W + TBL (A [i, k], W [0, k]) Step 8 W = W / 2 ^ k Step 9} If W> = N then W = WN Step 10 Output W Step 11 The processing of Algorithm 7 is roughly divided into three parts. In other words, there are three: the "table generation unit" from Step 1 to Step 5, the "addition processing unit" from Step 6 to Step 9, and the "value correction unit" from Steps 10 and 11. First, the "addition processing unit", which has high generality
And then the “value correction unit” and finally the “table generation unit”, and the circuit configuration in which these are integrated will be described. There are various known adder construction methods.
Here, a typical one will be described. Adder is 1
It is configured by arranging adders for each bit. Now, a circuit for adding two m-bit numbers S and T is constructed. In the following description, each binary number S = (S [m-1] S [m-
2] ... S [1] S [0]), T = (T [m-1] T [m-2] ... T [1] T [0]) and their sum Z = (Z [ m] Z [m-1] ... Z [1] Z [0]) is used. When the addition process is broken down into bit units, the values required for the addition of the jth bit are S [j], T [j] and the carry C [j-1] generated in the previous calculation. is there. However, C [-1] = 0 and Z
[m] = C [m-1]. FIG. 4 is an example of a full adder that adds 1 bit. The j-th 1-bit full adder has S [j], T [j] and the carry signal C [j-
[1] can be regarded as a circuit which outputs the value z [j] of the j-th bit of Z and the carry signal C [j]. The circuit of FIG. 4 has a logical formula: (Formula 4) Z [j] = (S [j] AND T [j] AND C [j-1]) OR (S
[j] AND NOT (T [j]) AND NOT (C [j-1])) OR (NOT (S [j]) A
ND T [j] AND NOT (C [j-1])) OR (NOT (S [j]) AND NOT (T
[j]) AND C [j-1]), (Equation 5) C [j] = NOT ((S [j] AND NOT (T [j]) AND NOT (C
[j-1])) OR (NOT (S [j]) AND T [j] AND NOT (C [j-1])) OR
(NOT (S [j]) AND NOT (T [j]) AND C [j-1]) OR (NOT (S
[j]) AND NOT (T [j]) AND NOT (C [j-1])). 401, 402, 403, 404 of FIG.
405 is an AND circuit, 406, 407 are OR circuits, 409, 410,
411, 412, 413, 414, 415, 416, 417 are inverters (N
OT circuit). These correspond to AND, OR, and NOT in (Equation 4) and (Equation 5), respectively. Each of AND, OR, and NOT is a logic gate composed of several transistors.
(logic gate). AND is only when all inputs are 1
It is a logic gate that outputs 1 and outputs 0 otherwise.OR is a logic gate that outputs 1 if any one of them is 1 and outputs 0 only when all 0s are input. is there.
NOT (inverter) is a logic gate that outputs 1 for an input 0 and outputs 0 for an input 1. By arranging 1-bit full adders, a multi-bit full adder can be created. FIG. 5 shows a 4-bit full adder. 50
Reference numerals 1, 502, 503, and 504 are 1-bit full adders shown in FIG. 4, and are calculated according to (Equation 4) and (Equation 5). Carry is set to 0 in the least significant bit. The last carry is Z [4]. Although the case of 4 bits is shown here, it is obvious that any number of bits can be added by increasing the number of 1-bit full adders. The configuration method of the full adder shown here is the simplest one called "carry propagation adder", and carry propagation time (carry propagation time)
Is a circuit with a large ratio and takes the longest addition time. The hardware algorithm for faster addition is
Although there are many, the method of constructing the adder has nothing to do with the essence of the present invention, and therefore, further explanation is omitted. Next, the processing of the value correction unit will be described. In the value correction section,
A subtraction process is used. Subtractor
er) has a structure similar to that of an adder, and can be configured by using a carry propagation adder. Therefore, an ordinary computer has an adder circuit but not a subtractor circuit. FIG. 6 is a configuration example of a 4-bit subtraction circuit (Z = ST) using a full adder. Although 601, 602, 603, and 604 are 1-bit full adders, the first carry (borrow) is 1 at the time of input, and T is the inverter 605, 6
It is different in that it is inverted depending on 06, 607, 608. In subtraction, when the last carry (borrow) becomes 1, it means that the result is negative. This phenomenon is called underflow. A configuration example of the value correction unit will be described using the concept of the subtractor described above. Figure 7
FIG. 4 is a block diagram of the value correction circuit. n
The + 1-bit register 702 contains the value W and the n-bit register 703 contains the value of the modulus N, both of which
WN is calculated by the n + 1-bit subtractor 704. This value WN, a borrow selection signal (selection signal) generated as a result of this operation, and the value of W are selected by the selector 705.
If the selection signal is 0, the selector 705
If N is selected and output, and the selection signal is 1, W is selected and output. Here, W and N have been described as dedicated registers, but it goes without saying that they can be arranged in the RAM used for CPU processing. Next, the circuit of the table generation unit will be described. K = 2 for simplicity
Only shown in case of. When the values generated by the table generation unit when k = 2 are listed, the table shown in FIG. 8 is obtained. In this configuration example, the multiples N, 2 * N, 3 * N of the modulus value N independent of the input values A, B are calculated in advance and stored in the registers 909, 910, 911. , Use this. Since the input value B is the input value itself,
No need to calculate. B is held in the register 906. Eleven values other than 0 and the above four values B, N, 2 * N, 3 * N must be calculated each time. Here, the configuration has been shown in which the 11 values are calculated using only the adder. The size of the register must be at least (k + 1 =) 3 bits larger. Each register has a data bus 92
The data bus 921 is connected to 1 and is used as a signal line for supplying data to the addition processing unit. Reference numeral 904 denotes an adder that adds the number of n + 2 bits, and sends the value of the sum of n + 3 bits to the selector 903. Upon receiving the calculation start signal determined from the clock signal, the control device 905 initializes the internal counter to 0001 in binary notation, and inputs the value of B to the adder 904 (both) via the selector 901. As adder 904
To operate. Immediately before the calculation is started in the adder 904, a control signal is sent from the adder 904 to the control device 905. Control device
905 receives the control signal, increments a counter therein, and obtains a value of 0010 in binary notation,
These lower 2 bits 00 and upper 2 bits are respectively j * B + t *
As j and t in N, the selector 903 determines the position of the register 907 in which the answer of the adder 904 is written, and writes the addition result. Next, the calculation start signal is transmitted again to the control device 905, and the control device 905 sends a control signal to the selector 901, reads B and 2 * B, inputs them to the adder 904, and performs the same operation as before. Write the value 3 * B to register 908. By the operation up to here, the value for composing the table, B, 2 * B, 3 * B,
N, 2 * N, 3 * N are all available. Similarly, the register to be read and the register to be written are determined from the value of the counter of the control device, and the same operation is repeated. However, when the lower 2 bits of the counter are 00, the addition operation is skipped. In this way, after all the values are set,
Register 906, 907, 908, 909, 910, 911, 912, 913, 91
4, 915, 916, 917, 918, 919, 920 are used as read-only registers. Next, FIG. 10 shows a block diagram of a circuit in which these are combined to calculate A * B * R ^ (-1) mod N and the result is written in the A register. However, the value adjusting section is omitted to avoid complication. In general, the subtraction processing required by the value adjusting unit is realized by changing the input value of the adder 1015. FIG. 10 includes all the important components of the present invention. Hereinafter, the same variable names as those used in Algorithm 8 are used. In FIG. 10, first, the value of A stored in the A register 1003 is sequentially read from the lower order. The structure of the A register is shown in FIG. At this time, the control device 1007 receives the number i of the 2-bit block to be read from the 9-bit register 1006,
The value of the corresponding 2-bit block is read out and transmitted to the register 1008. The lower 2 bits of W are stored in the register 1009. The values of 1008 and 1009 are transmitted to the value selection and signal control device 1011, and 1011 reads the corresponding data from the value table of 1002 via the data bus 1001, and adds the values of 1027 bit and 1026 bit. Send to adder 1015. However, dedicated registers 1013 and 1014 are connected to the adder 1015, and the value read from the table is temporarily held in the 1027-bit register 1014. Reference numeral 1013 is a 1026-bit register in which W appearing in Algorithm 7 is stored. It is cleared to 0 at the start of operation. J * B + t * N in value table 1002
The sum of the data of the form and W is added by the adder 1015. Since the lower 2 bits of this addition result are 00 in binary notation, they are cut off by the 2-bit right shifter 1016 and input to the control device 1017. Meanwhile, the adder 1015
Simultaneously with the calculation start, the value 1 is sent to the adder 1018, and the adder sums the value of the 9-bit register 1019 and the calculation start signal “1” and sends the result to the 9-bit register 1019. Initially 0 is stored in 1019. This sum is sent to the control device 1017, and the control device 1017 determines that the sum is 51
If it has not reached 2, the output result of 1016 is written to the register 1013 as it is, and as for the value of 1013, only the lower 2 bits of W are extracted at 1011 and written to the register 1009. Further, at this time, a signal is sent from 1017 to 1007 via 1011, the value of i is incremented by the adder 1005, and the next 2-bit block of the A register is read,
The addition with the value in the table is performed through the same process as described above. If the value of 1019 has reached 512, it sends a signal to the selective transmission circuit 1012 and sends the value of W in 1013 to the A register. The value of the A register at this point is A * B * R ^ (-1) mo
d N or A * B * R ^ (-1) mod N + N. The latter requires the process of subtracting N from A. The subtraction processing unit is omitted. This is one of the embodiments of claims 1 and 7 of the present invention. The above embodiment can be variously modified without departing from the scope of the present invention. For example, the following changes are possible. (1) The adder 1015 is configured to add 1027 bits at the same time here, but it can also be realized by dividing this into a plurality of blocks and sequentially adding these for each block. it can. In this case, the amount of hardware decreases, but the speed decreases. (2) Without using the 2-bit left shifter 1016, only the bits higher than the lower 2 bits can be directly connected to the register of 1013. (3) Change the number of bits of the values A, B, N. For example, 2048
It can be changed to bits, 512 bits, 768 bits, etc. (4) The value of k can be changed to a value other than 2, for example, 1, 3, 4 or the like. However, if n is not a multiple of k, a logic circuit for correction is required. (5) In the table, eliminate 0, and instead, when the value added to W is 0, add a logic circuit that directly inputs the value of W to the shifter without inputting it to the adder. The above changes can be made independently of "synthesis of addition tables" which is essentially the gist of the present invention. Next, different from the above-mentioned changes, those requiring some attention will be described. Again, Algorithm 8 is shown. [Algorithm 8 (repost)] For j = 0 to 2 ^ k-1 step 1 {Step 1 For s = 0 to 2 ^ k-1 step 1 {Step 2 U = (s + j * B [0, k] ) mod 2 ^ k step 3 M = −U * N [0, k] ^ (-1) mod 2 ^ k step 4 TBL [j, s] = j * B + M * N step 5}} W = 0 Step 6 For i = 0 to n / k-1 step 1 {Step 7 W = W + TBL (A [i, k], W [0, k]) Step 8 W = W / 2 ^ k Step 9} If W> = N then W = WN Step 10 Output W Step 11 Here, the maximum value of the table TBL [j, s] is (2 ^ k-1) * B.
+ (2 ^ k-1) * N. Since this value has a maximum of n + k + 1 bits, the bit length of TBL [j, s] must be increased according to the value of k. Since the number of data is 4 ^ k, k
The amount of data increase with is up to k * 4 ^ k bits. Generally k
Is not a large value, so it does not pose a big problem in implementing the present invention, but it is not a desirable phenomenon. There are ways to avoid this problem. Attention is paid to steps 8 and 9. The value of TBL (A [i, k], W [0, k]) in step 8 is selected so that the lower k bits of the sum in step 8 are always 0. Then, in step 9, that k
Bits are cut off by right shifting. this is,
It shows that the calculation of the lower k bits is completely unnecessary. Therefore, it is understood that the table values themselves need only be prepared by cutting off the lower k bits (shifted by k bits to the right). Specifically, the value “j * B +
Instead of using "t * N", a value "(j * B + t * N) >>k" obtained by shifting this by k bits is defined as a new table, and the same calculation as algorithm 8 is performed. However, the value of W is right-shifted by k bits before the addition processing. That is, the calculation is performed according to the following algorithm. [Algorithm 9] For j = 0 to 2 ^ k-1 step 1 {Step 1 For s = 0 to 2 ^ k-1 step 1 {Step 2 U = (s + j * B [0, k]) mod 2 ^ k Step 3 M = −U * N [0, k] ^ (-1) mod 2 ^ k Step 4 NTBL [j, s] = (j * B + M * N) >> k Step 5}} W = 0 Step 6 For i = 0 to n / k-1 step 1 {Step 7 W = W / 2 ^ k Step 8 W = W + NTBL (A [i, k], W [0, k]) Step 9 } If W> = N then W = WN Step 10 Output W Step 11 Here, step 8 does not need to use a shifter in terms of implementation. The right shift is always performed by k bits, so this is wired so as to directly correspond to the k bit shift. The sum of the lower k bits is not calculated. However, lower than W
When at least one of the k bits becomes 1, that is, when the logical OR of each of the lower k bits is 1, this is input to the adder as a carry C [-1]. In fact, if the OR is 0, all k bits are
It is shown that 0, and the corresponding lower k bits of M must also be 0. However, when the logical sum is 1, all the lower k bits of the sum with M are 0. This is because a carry always occurs when M is selected. FIG. 15 shows an embodiment of a circuit for addition processing in consideration of these. In this embodiment, k = 2. In the circuit shown in FIG. 15, an NTBL register 1504 for storing the value of NTBL and a W register 1503 for storing the value of W are connected to the adder 1502. The lower two bits of the W register are not connected to the adder 1502, but instead to the value selection and signal controller, which two bits are used to select the value of NTBL. Further, each of the two bits is connected to an OR gate 1501 that calculates a logical sum. The calculated value of the OR gate 1501 is input to the adder 1502 as the carry signal C [-1]. The carry signal C [-1] is used as an initial value of carry when performing addition. This is one of the embodiments of the present inventions 4 and 7. Next, an embodiment of the present invention which operates according to the following algorithm 7 (reprinted) for the Galois field GF (2 ^ n) will be described. [Algorithm 7 (repost)] For j = 0 to 2 ^ k 1 step +1 {Step 1 TBL (H [j] (X), B (X)) = H [j] (X) * B (X) Step 2 TBL (H [j] (X), F (X)) = H [j] (X) * F (X) Step 3} W (X) = 0 Step 4 For j = 0 to n / k- 1 step +1 {step 5 T (X) = W (X) + TBL (A [j, k] (X), B (X)) step 6 U (X) = T (X) mod X ^ k step 7 M [j, k] (X) = U (X) * F [0, k] (X) ^ (-1) mod X ^ k Step 8 W (X) = W (X) + TBL (M [ j, k] (X), F (X)) Step 9 W (X) = W (X) / X ^ k Step 10} For operations in the Galois field GF (2 ^ n),

【従来の技術】において説明した通りである。従って、
先に説明した本発明の第一の実施形態における回路にお
ける加算処理を全てビット毎の排他的論理和に変更する
ことによって実現される。積演算の処理は、シフト処理
と加算処理で構成されているので、加算処理の部分をビ
ット毎の排他的論理和の処理に置換えるだけでよい。勿
論、テーブルの構成においても、和を排他的論理和に置
換えることは言うまでもない。但し、通常のモンゴメリ
法における法(modulus)Nがnビットだったのに対し、n次
の多項式を使う為、n+1ビットの表現を必要とするとい
う点が異なる。又、ガロア体GF(2^n)の場合は、最後の
値修正処理は不要である。その他の構成は、通常のモン
ゴメリ法の回路と全く同一の構成でよい。これを図１２
に示す。但し、ここでは、k=2とし, reduction polynom
ial F(X)の次数(degree)は、257であるものとする。以
下、変数名は、アルゴリズム７で用いたものと同じもの
を利用する。図12において、まず、Aレジスタ1203に格
納されたAの値を下位から順に読み出す。Aレジスタの構
成は、図１１に示されているものと同一である。この
際、制御装置1207は、読み出す2ビットブロックの番号i
を7ビットレジスタ1206から受け取って、該当する2ビッ
トブロックの値を読み出し、これをレジスタ1208に送信
する。レジスタ1209には、Wの下位2ビットが格納されて
いる。1208，1209の値が、値選択及び信号制御装置1211
に送信され、1211は、該当するデータをデータバス1201
を介して1202の値テーブル（テーブル生成については後
に説明する）から読み出し、値を258ビットと256ビット
の値の排他的論理和を行う排他的論理和計算器1215に送
る。但し、前記排他的論理和計算器1215には、専用のレ
ジスタ1213，1214が接続されており、テーブルから読み
出された値は、一時的に258ビットのレジスタ1214に保
持される。1213は、258ビットのレジスタであり、アル
ゴリズム７に現れるW(X)が格納される。動作開始時には
0にクリアされている。値テーブル1202に含まれる
「（一次以下の多項式）*B(X) + （一次以下の多項式）
*F(X)という形をしたデータ（図１３参照）とW(X)の排
他的論理和が排他的論理和計算器1215で排他的論理和の
計算がなされる。この結果は、下位2ビットが二進数表
示で00になっているので、これを2ビット右シフタ1216
で切り落とし、制御装置1217に入力される。一方、排他
的論理和計算器1215は、計算開始と同時に、値1を加算
器1218に送り、加算器は、7ビットレジスタ1219の値
と、計算開始信号「１」を合計し、その結果を7ビット
レジスタ1219に送る。1219には、最初0が格納されてい
る。この和は、制御装置1217に送られ、制御装置1217
は、前記の和が、128に達していなければ、1216の出力
結果をそのままレジスタ1213に書き込み、1213の値は、
1211にてW(X)の下位2ビットのみが取り出されて、レジ
スタ1209に書き込まれる。又、この際に、1217から、12
11を経由して1207に信号が送られ、排他的論理和計算器
1205によりiの値がインクリメントされ、Aレジスタの次
の2ビットブロックが読み出され、先に延べたのと同じ
過程を経てテーブルの値との排他的論理和の計算が行わ
れる。1219の値が、128に達していれば、選択的透過回
路1212に信号を送り、1213にあるW(X)の値をAレジスタ
に送る。この時点でのAレジスタの値は、A(X)*B(X)*R
(X)^(-1) mod F(X)である。テーブル生成回路も、加算
処理を排他的論理和処理に置換えることにより、実現さ
れる。これを図１４に示す。この構成例では、入力値A
(X)，B(X)に依存しないreduction polynomial F(X)の倍
元、F(X), XF(X)， (X+1)F(X)については、あらかじめ
計算して、これをレジスタ1409，1410，1411に保持して
おき、これを用いる。また、入力値B(X)は、入力値その
ものであるから、計算する必要はない。B(X)は、レジス
タ1406に保持されている。0及び、上記の４つの値B(X),
F(X), XF(X)，(X+1)F(X)以外の11個の値は、毎回計算
しなければならない。ここでは、排他的論理和計算器の
みを利用して、前記11個の値を計算する構成を示した。
レジスタの大きさは、最大で（k=）2ビット大きく取ら
なければならない。各レジスタは、データバス1421に接
続され、前記データバス1421は、排他的論理和処理部に
データを供給する信号線として利用される。1404は、25
8ビットの数の加算を行う加算器であり、258ビットの和
の値（GF(2^n)では桁上がりは生じないので、ビット数
は増えないことに注意）をセレクタ1403に送る。制御装
置1405は、クロック信号から決定された計算開始信号を
受けて内部のカウンタを二進数表示で、0001に初期化
し、セレクタ1401を経由して、B(X)の値を加算器1404の
入力（両方）として排他的論理和計算器1404を動作させ
る。排他的論理和計算器1404で計算が始まる直前に、前
記排他的論理和計算器1404から制御装置1405に制御信号
を送る。制御装置1405は、前記制御信号を受けて、その
内部のカウンタをインクリメントし、二進数表示で、00
10という値を得、この下位2ビット00と上位2ビットをそ
れぞれ、j(X)*B(X) + t(X)*F(X)におけるj(X)、ｔ(X)と
して、セレクタ1403が、排他的論理和計算器1404の答を
書き込むレジスタ1407の位置を決定し、前記排他的論理
和の結果を書き込む。次に、再び計算開始信号が制御装
置1405に送信され、制御装置1405は、セレクタ1401に制
御信号を送り、B(X)とXB(X)を読み出して、排他的論理
和計算器1404に入力し、先と同じ動作をしてレジスタ14
08に値(X+1)B(X)を書き込む。ここまでの動作で、テー
ブルを合成する為の値、B(X), XB(X), (X+1)B(X), F
(X), XF(X), (X+1)F(X)が全て揃ったことになる。以下
同様に制御装置のカウンタの値から、読み出すレジスタ
と書き込むレジスタを決定し、同様の動作を繰り返す。
但し、カウンタの下位2ビットが00のときは、加算動作
は、スキップするものとする。こうして、全ての値が揃
った後は、レジスタ1406, 1407, 1408, 1409, 1410, 14
11, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 141
9, 1420は、読み出し専用のレジスタとして利用され
る。これは、本発明請求項2，7の実施例の一つである。
上記実施例は、本発明の要旨を逸脱しない範囲において
種々変更可能である。例えば、以下のような変更が考え
られる。（１）排他的論理和計算器1215は、ここでは、258ビッ
トを同時に加算する構成になっているが、これを複数の
ブロックに分割して、これらをブロック毎に順に加算し
ていくことにより実現することもできる。この場合、ハ
ード量は減るが、速度は落ちる。（２） 2ビット左シフタ1216を用いないで、直接1213の
レジスタに下位2ビットより上位のビットのみ接続する
こともできる。（３）値A(X), B(X), F(X)のビット数を変更する。（４） kの値を2でない値、例えば、1，3，4等の値に変
更することもできる。但し、nがkの倍数に成っていない
場合は、修正する為の論理回路が必要である。（５）テーブルにおいて、0をなくし、代わりに、W(X)
に加える値が0のときは、W(X)の値を排他的論理和計算
器に入力せず、直接シフタに入力するような論理回路を
追加する。上記の変更は、本質的に本発明の趣旨である「値テーブ
ルの合成」と無関係に行うことができる。ガロア体にお
いても、通常のモンゴメリ法の処理の場合と同じように
して、テーブルサイズをkビット減らすことができる。
キャリー及び、最後の値調整の為の減算処理を除いて考
え方はアルゴリズム９と全く同一である。以下に、その
アルゴリズムを示す。 [アルゴリズム１０] For j = 0 to 2^k 1 step +1{ ステップ１ NTBL(H[j](X), B(X)) = H[j](X)*B(X)>>k ステップ２ NTBL(H[j](X), F(X)) = H[j](X)*F(X)>>k ステップ３ } W(X) = 0 ステップ４ For j = 0 to n/k-1 step +1{ ステップ５ T(X) = W(X) + TBL(A[j,k](X), B(X)) ステップ６ U(X) = T(X) mod X^k ステップ７ M[j,k](X) = U(X)*F[0,k](X)^(-1) mod X^k ステップ８ W(X) = W(X) + NTBL(M[j,k](X), F(X)) ステップ９ W(X) = W(X)/X^k ステップ１０ } ここで、上記M[j,k](X)の決定については、処理中のW
(X)の下位のkビットをそのままとればよく、キャリーは
発生しない点が、通常のモンゴメリ法と異なる（但し、
数学的表現は、ほぼ同じ）。図１５に対応するガロア体
の計算回路を図１６に示す。図１５における回路との違
いは、加算器が排他的論理和計算器1601になっている点
と、キャリー信号を生成するOR回路がないという点であ
る。これは、本発明請求項４，７の実施例の一つであ
る。ここまでは、剰余乗算に対する本発明の実施例と、
ガロア体G(2^n)に対する本発明の実施例を別々に示した
が、両者を融合することができる。RSA暗号と楕円曲線
暗号で必要とされるデータ長は著しく異なる。一般に、
RSA暗号でのNのビット数は、十分な安全性を有する為に
は、1024ビットから2048ビット程度が必要である。一
方、RSA暗号と同等の強度を有する楕円曲線暗号でのデ
ータ長は、160ビットから256ビット程度である。従っ
て、RSAの回路の一部分をGF(2^n)の演算器として利用す
るという構成が可能となる。GF(2^n)の和演算が、通常
の加算処理において、キャリーを全て0とおいたものと
一致していることは、以下のようにすればわかる。例え
ば、図４の回路に対応する論理式のうち、第ｊビットを
表現するもの：（式４（再掲）） Z[j] = (S[j] AND T[j] AND C[j-1])
OR (S[j] AND NOT(T[j]) AND NOT(C[j-1])) OR (NOT(S
[j]) AND T[j] AND NOT(C[j-1])) OR (NOT(S[j])AND NO
T(T[j]) AND C[j-1]), において、C[j-1] = 0とおけば、（式６） Z[j] = (S[j] AND T[j] AND 0) OR (S[j] AND
NOT(T[j]) AND NOT(0))OR (NOT(S[j]) AND T[j] AND N
OT(0)) OR (NOT(S[j]) AND NOT(T[j]) AND 0) = (S[j] AND NOT(T[j])) OR (NOT(S[j]) AND T[j]) = S[j] EXOR T[j] を得る。ここで、S[j] EXOR T[j]は、S[j]とT[j]の排他
的論理和である。すなわち、キャリーを全て0にすれ
ば、それは、ビット毎の排他的論理和であることがわか
る。ビット毎の排他的論理和は、GF(2^n)の元Aの多項式
表現：A(X) = A[n-1]*X^(n-1) + A[n-2]*X^(n-2) + …
+ A[1]*X + A[0]を、ビット列：(A[n-1]A[n-2]…A[1]A
[0])に対応させたときの、GF(2^n)における和の操作に
一致していることがわかる。従って、加算器において、
キャリーを0とおく回路を追加して、GF(2^n)の加算（排
他的論理和）を実現することができる。この実施例を図
１７に示す。図１７は、図４の加算器にセレクタを付け
たものである。加算器は連鎖的に接続されており、キャ
リーC[j]は、次の加算器に伝播する。それぞれのセレク
タは、前加算器からのキャリーと、値0を制御信号が1で
あるか0であるかに応じてキャリーか、0を選択する。例
えば、制御信号が1のとき、キャリーを選択し、0のとき
は、値0を選択することにすれば、前者では通常の加算
器、後者ではGF(2^n)の加算器（排他的論理和計算器）
として機能する。これを組み合せて、図１５，１６の回
路を共用する加算器の構成を図18に示す。図18における
キャリー制御機能付き加算器は、図17におけるものと同
じであり、切り替え信号によって、剰余乗算用の加算器
か、ガロア体GF(2^n)の積演算用の加算器として機能す
るかが決定される。これは、本発明請求項６，７の実施
例の一つである。本発明の趣旨は、kの値とは無関係で
あるが、実装上は、kをいくらでも大きく取るわけには
いかない。特にICカード用のマイクロコンピュータにお
いては、RAMのサイズは、数キロバイトから十数キロバ
イト程度であり、kの値は大きく制約される。又、他の
方法との比較においても、本発明のアドバンテージが最
も高くなるのは、k=2の場合であることが、発明者によ
って見出された。従って、これを請求項に含め、請求項
７とした。また、べき乗剰余演算、又は、ガロア体GF(2
^n)の積演算を実現する為のハードウエア構成は、上記
の各種実施の形態に限定されず、適宜変更可能であるこ
とは言うまでもない。As described in "Prior Art". Therefore,
This is realized by changing all the addition processing in the circuit in the first embodiment of the present invention described above to exclusive OR for each bit. Since the process of the product operation is composed of the shift process and the addition process, it is only necessary to replace the part of the addition process with the process of the exclusive OR for each bit. Needless to say, the sum may be replaced with an exclusive OR in the table configuration. However, unlike the normal Montgomery method, which has a modulus N of n bits, a polynomial of degree n is used, and therefore an expression of n + 1 bits is required. Further, in the case of the Galois field GF (2 ^ n), the final value correction processing is unnecessary. Other configurations may be exactly the same as those of the circuit of the normal Montgomery method. This is shown in FIG.
Shown in. However, here, k = 2 and reduction polynom
The degree of ial F (X) shall be 257. Hereinafter, the same variable names as those used in Algorithm 7 are used. In FIG. 12, first, the value of A stored in the A register 1203 is sequentially read from the lower order. The configuration of the A register is the same as that shown in FIG. At this time, the control device 1207 controls the number i of the 2-bit block to be read.
Is read from the 7-bit register 1206, the value of the corresponding 2-bit block is read, and this is transmitted to the register 1208. The lower 2 bits of W are stored in the register 1209. The values of 1208 and 1209 are the value selection and signal control device 1211.
1211 sends the appropriate data to the data bus 1201
Via the value table of 1202 (the table generation will be described later), and sends the value to the exclusive OR calculator 1215 which performs the exclusive OR of the values of 258 bits and 256 bits. However, dedicated registers 1213 and 1214 are connected to the exclusive OR calculator 1215, and the value read from the table is temporarily held in the 258-bit register 1214. Reference numeral 1213 is a 258-bit register in which W (X) appearing in Algorithm 7 is stored. At the start of operation
It is cleared to 0. Included in the value table 1202 "(polynomial of less than or equal to 1) * B (X) + (polynomial of less than or equal to 1)
The exclusive OR calculator 1215 calculates the exclusive OR of the data in the form of * F (X) (see FIG. 13) and W (X). In this result, the lower 2 bits are 00 in binary notation.
Is cut off and input to the control device 1217. On the other hand, the exclusive OR calculator 1215 sends the value 1 to the adder 1218 at the same time as the calculation is started, and the adder sums the value of the 7-bit register 1219 and the calculation start signal “1” and outputs the result. Send to 7-bit register 1219. In 1219, 0 is initially stored. This sum is sent to the controller 1217 and the controller 1217
If the above sum does not reach 128, the output result of 1216 is directly written to the register 1213, and the value of 1213 is
At 1211, only the lower 2 bits of W (X) are extracted and written to the register 1209. Also, at this time, from 1217 to 12
Signaled to 1207 via 11 and exclusive OR calculator
The value of i is incremented by 1205, the next 2-bit block of the A register is read, and the exclusive OR with the value of the table is calculated through the same process as the above. If the value of 1219 reaches 128, it sends a signal to the selective transmission circuit 1212 and sends the value of W (X) in 1213 to the A register. The value of the A register at this point is A (X) * B (X) * R
(X) ^ (-1) mod F (X). The table generation circuit is also realized by replacing the addition process with the exclusive OR process. This is shown in FIG. In this configuration example, input value A
For F (X), XF (X), and (X + 1) F (X), which are multiples of reduction polynomial F (X) that do not depend on (X) and B (X), they are calculated in advance. It is held in registers 1409, 1410, 1411 and used. The input value B (X) does not need to be calculated because it is the input value itself. B (X) is held in the register 1406. 0 and the above four values B (X),
Eleven values other than F (X), XF (X), and (X + 1) F (X) must be calculated each time. Here, the configuration for calculating the 11 values using only the exclusive OR calculator has been shown.
The size of the register must be at least (k =) 2 bits larger. Each register is connected to a data bus 1421, and the data bus 1421 is used as a signal line for supplying data to the exclusive OR processing unit. 1404 is 25
It is an adder that adds 8-bit numbers, and sends the value of the sum of 258 bits (note that the number of bits does not increase because no carry occurs in GF (2 ^ n)) to the selector 1403. The control device 1405 receives the calculation start signal determined from the clock signal, initializes the internal counter in binary notation to 0001, and inputs the value of B (X) to the adder 1404 via the selector 1401. The exclusive OR calculator 1404 is operated as (both). Immediately before the calculation starts in the exclusive OR calculator 1404, a control signal is sent from the exclusive OR calculator 1404 to the control device 1405. Upon receiving the control signal, the control device 1405 increments the internal counter, and displays 00 in binary.
A value of 10 is obtained, and the lower 2 bits 00 and the upper 2 bits are set as j (X) and t (X) in j (X) * B (X) + t (X) * F (X), respectively. 1403 determines the position of the register 1407 in which the result of the exclusive OR calculator 1404 is written, and writes the result of the exclusive OR. Next, the calculation start signal is again transmitted to the control device 1405, and the control device 1405 sends a control signal to the selector 1401, reads B (X) and XB (X), and inputs them to the exclusive OR calculator 1404. And register 14
Write the value (X + 1) B (X) to 08. By the operation up to here, the value for composing the table, B (X), XB (X), (X + 1) B (X), F
(X), XF (X), (X + 1) F (X) are all complete. Similarly, the register to be read and the register to be written are determined from the value of the counter of the control device, and the same operation is repeated.
However, when the lower 2 bits of the counter are 00, the addition operation is skipped. In this way, after all values are set, registers 1406, 1407, 1408, 1409, 1410, 14
11, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 141
9 and 1420 are used as read-only registers. This is one of the embodiments of claims 2 and 7 of the present invention.
The above embodiment can be variously modified without departing from the scope of the present invention. For example, the following changes are possible. (1) The exclusive OR calculator 1215 is configured to add 258 bits at the same time here, but by dividing this into a plurality of blocks and sequentially adding these for each block It can also be realized. In this case, the amount of hardware decreases, but the speed decreases. (2) Without using the 2-bit left shifter 1216, only the bits higher than the lower 2 bits can be directly connected to the register 1213. (3) Change the number of bits of the values A (X), B (X), F (X). (4) The value of k can be changed to a value other than 2, for example, 1, 3, 4 or the like. However, if n is not a multiple of k, a logic circuit for correction is required. (5) In the table, eliminate 0 and use W (X) instead
When the value added to is 0, the value of W (X) is not input to the exclusive OR calculator, but a logic circuit is directly input to the shifter. The above changes can be made independently of the "value table composition" that is essentially the subject matter of the present invention. Also in the Galois field, the table size can be reduced by k bits in the same manner as in the case of the normal Montgomery method processing.
The concept is exactly the same as Algorithm 9 except for carry and subtraction processing for the final value adjustment. The algorithm is shown below. [Algorithm 10] For j = 0 to 2 ^ k 1 step +1 {Step 1 NTBL (H [j] (X), B (X)) = H [j] (X) * B (X) >> k Step 2 NTBL (H [j] (X), F (X)) = H [j] (X) * F (X) >> k Step 3} W (X) = 0 Step 4 For j = 0 to n / k-1 step +1 {Step 5 T (X) = W (X) + TBL (A [j, k] (X), B (X)) Step 6 U (X) = T (X) mod X ^ k Step 7 M [j, k] (X) = U (X) * F [0, k] (X) ^ (-1) mod X ^ k Step 8 W (X) = W (X) + NTBL (M [j, k] (X), F (X)) Step 9 W (X) = W (X) / X ^ k Step 10} Here, regarding the determination of M [j, k] (X) above. Is processing W
Unlike the normal Montgomery method, the carry bit does not occur as long as the lower k bits of (X) are retained (however,
Mathematical expression is almost the same). A Galois field calculation circuit corresponding to FIG. 15 is shown in FIG. The difference from the circuit in FIG. 15 is that the adder is an exclusive OR calculator 1601 and there is no OR circuit that generates a carry signal. This is one of the embodiments of claims 4 and 7 of the present invention. So far, an embodiment of the present invention for modular multiplication, and
Although the examples of the present invention for the Galois field G (2 ^ n) are shown separately, they can be combined. The data length required by the RSA encryption and the elliptic curve encryption are significantly different. In general,
The number of N bits in the RSA encryption needs to be about 1024 bits to 2048 bits in order to have sufficient security. On the other hand, the data length in elliptic curve cryptography having the same strength as the RSA cryptography is about 160 to 256 bits. Therefore, it is possible to use a part of the RSA circuit as a GF (2 ^ n) calculator. It can be seen that the sum operation of GF (2 ^ n) is the same as the carry operation with all carry values set to 0 in the normal addition process, as follows. For example, among the logical expressions corresponding to the circuit of FIG. 4, those expressing the j-th bit: (Equation 4 (repost)) Z [j] = (S [j] AND T [j] AND C [j-1 ])
OR (S [j] AND NOT (T [j]) AND NOT (C [j-1])) OR (NOT (S
[j]) AND T [j] AND NOT (C [j-1])) OR (NOT (S [j]) AND NO
In T (T [j]) AND C [j-1]), if C [j-1] = 0, then (Equation 6) Z [j] = (S [j] AND T [j] AND 0) OR (S [j] AND
NOT (T [j]) AND NOT (0)) OR (NOT (S [j]) AND T [j] AND N
OT (0)) OR (NOT (S [j]) AND NOT (T [j]) AND 0) = (S [j] AND NOT (T [j])) OR (NOT (S [j]) AND T [j]) = S [j] EXOR T [j] is obtained. Here, S [j] EXOR T [j] is the exclusive OR of S [j] and T [j]. That is, if all the carry is set to 0, it is understood that it is an exclusive OR for each bit. The bitwise exclusive OR is the polynomial expression of the element A of GF (2 ^ n): A (X) = A [n-1] * X ^ (n-1) + A [n-2] * X ^ (n-2) +…
+ A [1] * X + A [0] as the bit string: (A [n-1] A [n-2]… A [1] A
It can be seen that it corresponds to the operation of the sum in GF (2 ^ n) when corresponding to [0]). Therefore, in the adder,
It is possible to add GF (2 ^ n) (exclusive OR) by adding a circuit that sets the carry to 0. This embodiment is shown in FIG. FIG. 17 shows the adder of FIG. 4 with a selector. The adders are connected in a chain, and the carry C [j] propagates to the next adder. Each selector selects carry or 0 depending on the carry from the pre-adder and the value 0 depending on whether the control signal is 1 or 0. For example, if the control signal is 1, the carry is selected, and if the control signal is 0, the value 0 is selected. In the former case, a normal adder and in the latter case, a GF (2 ^ n) adder (exclusive OR calculator)
Function as. FIG. 18 shows the configuration of an adder which combines these circuits and shares the circuits of FIGS. The adder with carry control function in FIG. 18 is the same as that in FIG. 17, and functions as an adder for modular multiplication or an adder for product operation of Galois field GF (2 ^ n) depending on the switching signal. Is decided. This is one of the embodiments of claims 6 and 7 of the present invention. The gist of the present invention is irrelevant to the value of k, but in implementation, it is not possible to take k as large as possible. In particular, in a microcomputer for an IC card, the size of RAM is about several kilobytes to ten and several kilobytes, and the value of k is greatly restricted. The inventors have found that the advantage of the present invention is highest when k = 2 in comparison with other methods. Therefore, this is included in the claim, and the claim is defined as claim 7. Also, the modular exponentiation operation or Galois field GF (2
It is needless to say that the hardware configuration for implementing the product operation of ^ n) is not limited to the above-mentioned various embodiments and can be changed appropriately.

【０００６】[0006]

【発明の効果】本願において開示される発明のうち代表
的なものによって得られる効果を簡単に説明すれば、下
記の通りである。すなわち、高速にべき乗剰余演算「Y^
X mod N」、及びガロア体GF(2^n)上の楕円曲線暗号の処
理を実現することができる。また、べき乗剰余演算、又
は、ガロア体GF(2^n)上の積演算の為に上記の専用ハー
ドウエアの実現において、その論理回路規模を最小限に
することができる。更に、上記専用ハードウエアをICカ
ード用マイクロコンピュータと同一の半導体チップに搭
載し、べき乗剰余演算「Y^X mod N」を適用した暗号化・
復号化'(encryption/decryption)、又はガロア体GF(2^
n)上の楕円曲線暗号の暗号化・復号化為のマイクロコン
ピュータを低コストで使いやすく実現することができ
る。又、ICカードだけでなく、GSM等のモバイル端末等
の、低消費電力で暗号処理をすることが必要な装置に対
して同様の効果が見込まれることは言うまでもない。The effects obtained by the typical ones of the inventions disclosed in the present application will be briefly described as follows. That is, the modular exponentiation operation "Y ^
X mod N ”and the processing of the elliptic curve cryptography on the Galois field GF (2 ^ n) can be realized. Further, in the implementation of the above dedicated hardware for the modular exponentiation operation or the product operation on the Galois field GF (2 ^ n), the logic circuit scale can be minimized. Furthermore, the above-mentioned dedicated hardware is mounted on the same semiconductor chip as the IC card microcomputer, and encryption using the modular exponentiation operation "Y ^ X mod N" is performed.
Decryption '(encryption / decryption), or Galois field GF (2 ^
n) A microcomputer for encryption / decryption of the above elliptic curve cryptography can be realized at low cost and with ease of use. Needless to say, the same effect can be expected not only for IC cards but also for devices that require cryptographic processing with low power consumption, such as mobile terminals such as GSM.

[Brief description of drawings]

【図１】ICカードの概観及び、端子。[Figure 1] Overview of IC card and terminals.

【図２】マイクロコンピュータの構成。FIG. 2 is a configuration of a microcomputer.

【図３】モンゴメリ法の原理。FIG. 3 Principle of Montgomery method.

【図４】1ビットの全加算器の例。FIG. 4 shows an example of a 1-bit full adder.

【図５】4ビットの全加算器。FIG. 5 is a 4-bit full adder.

【図６】全加算器を利用した4ビットの減算回路。FIG. 6 is a 4-bit subtraction circuit using a full adder.

【図７】値修正回路。FIG. 7 is a value correction circuit.

【図８】k=2の場合の値テーブル。FIG. 8 is a value table when k = 2.

【図９】k=2の場合の値テーブル生成回路。FIG. 9 is a value table generation circuit when k = 2.

【図１０】本発明の実施例。FIG. 10 is an embodiment of the present invention.

【図１１】Aレジスタ。FIG. 11 is an A register.

【図１２】ガロア体GF(2^n)に対する本発明の実施例。FIG. 12 is an example of the present invention for Galois field GF (2 ^ n).

【図１３】ガロア体GF(2^n)に対するk=2の場合の値テー
ブル。FIG. 13 is a value table for Galois field GF (2 ^ n) when k = 2.

【図１４】ガロア体GF(2^n)に対するk=2の場合の値テー
ブル生成回路。FIG. 14 is a value table generation circuit for Galois field GF (2 ^ n) when k = 2.

【図１５】モンゴメリ剰余乗算における加算処理装置の
実施例。FIG. 15 is an example of an addition processing device in Montgomery modular multiplication.

【図１６】ガロア体GF(2^n)に対するモンゴメリ法にお
ける加算処理装置の実施例。FIG. 16 is an embodiment of an addition processing device in the Montgomery method for Galois field GF (2 ^ n).

【図１７】キャリー制御機能付き加算器。FIG. 17 is an adder with a carry control function.

【図１８】剰余乗算用加算器とガロア体GF(2^n)の積演
算用加算器の共用回路。FIG. 18 is a shared circuit of a modular multiplication adder and a Galois field GF (2 ^ n) product calculation adder.

───────────────────────────────────────────────────── フロントページの続き (72)発明者渡邊高志東京都国分寺市東恋ケ窪一丁目280番地株式会社日立製作所中央研究所内 (72)発明者中田邦彦東京都小平市上水本町五丁目20番１号株式会社日立製作所半導体グループ内 (72)発明者塚本卓東京都小平市上水本町５丁目22番１号株式会社日立超エル・エス・アイ・システムズ内 (72)発明者小林誠治東京都小平市上水本町五丁目20番１号株式会社日立製作所半導体グループ内Ｆターム(参考） 5J104 AA21 NA17 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Takashi Watanabe 1-280, Higashi Koikekubo, Kokubunji, Tokyo Central Research Laboratory, Hitachi, Ltd. (72) Inventor Kunihiko Nakata 5-20-1 Kamimizuhonmachi, Kodaira-shi, Tokyo Stock Ceremony Company within Hitachi Semiconductor Group (72) Inventor Takashi Tsukamoto 5-22-1 Kamimizuhonmachi, Kodaira-shi, Tokyo Stock Ceremony Company Hitachi Cho-LS System Within (72) Inventor Seiji Kobayashi 5-20-1 Kamimizuhonmachi, Kodaira-shi, Tokyo Stock Ceremony Company within Hitachi Semiconductor Group F-term (reference) 5J104 AA21 NA17

Claims

[Claims]

1. A modular multiplication A * B * R ^ (-1) m for non-negative integers A, B and an n-bit odd number N.
od N or A * B * R ^ which is congruent with it N
(-1) modN + s * N (s is a non-negative integer, R = 2 ^ n, or less, X ^ Y
Represents the power of X to the power of Y), which is the process required when the value W of () is obtained by the method of Montgomery, where the integer M and the integer W that are equivalent to the remainder multiplication are the unknowns. Diophantine equation A * B +
In the process of obtaining W by sequentially obtaining M from k bits (k is a divisor of n) from the lower bits in M * N = W * R, in A * B of each k-bit block of A. Partial multiplication result j * B (j = 0, 1, 2, ..., 2 ^ k 1) and M * N multiplication result t * N (t = 0, 1, 2 , ...,
2 ^ k 1) and the combined table value TBL (j, t) = j * B + t
* N (j, t = 0, 1, 2, ..., 2 ^ k 1) is prepared, and (1) W + TBL (j, t) of the table value TBL (j, t) is used. A first process for selecting j, t such that the lower k bits are all 0, (2) W + TBL (j, t) for j, t defined in the first process
A second process for obtaining the value of (3) a value obtained by dividing the value of W + TBL (j, t) in the second process by 2 ^ k, or k bits equivalent to 2 ^ k When the right-shifted value is obtained, and this is the third process for updating the value with the next W, the first, second, and third processes are repeatedly executed in that order n / k times, The remainder multiplication result A * B * R ^ (-1)
A data processing device characterized by obtaining mod N + s * N.

2. The data processing apparatus according to claim 1, further comprising means for setting the value of a carry signal generated in the addition processing appearing in the first and second processings to 0.

3. The value TBL (j, t) = j * B + t * N in the table.
Instead of (j, t = 0, 1, 2, ..., 2 ^ k 1), the table value NTBL (j, t) = j * B + t * N (j, t = 0, 1, 2 , ..., 2 ^ k
1) is prepared as a value obtained by shifting TBL (j, t) to the right or in the lower direction by k bits, and the processing for performing the first processing on the original TBL is processing (A), Instead of the second process, when the lower k bits of W are all 0, the value obtained by shifting W to the right or lower by k bits is C (W, k).
If at least one of the lower k bits of W is 1 or more, W is shifted to the right or in the lower direction by k bits, and 1 is added to C (W, k). Then, the process of obtaining the value of C (W, k) + NTBL (j, t) for j, t determined in the process (A) is defined as process (B), and C (W, k) ) + NTBL (j, t)
When the process of updating the value of W as the next W is the process (C), the above processes (A), (B) and (C) are repeatedly executed in that order n / k times to obtain modular multiplication (modular multipli).
cation) A * B * R ^ (-1) mod N or congruent with mod N
The data processing apparatus according to claim 1, wherein a (congruent) A * B * R ^ (-1) mod N + s * N (s is a non-negative integer, R = 2 ^ n) is obtained.

4. The data processing apparatus according to claim 3, further comprising means for setting the value of the carry signal generated in the addition process appearing in the middle of the processes (A) and (B) to 0.

5. The data processing apparatus according to claim 1, wherein k is 2.

6. The data processing apparatus according to claim 3, wherein k is 2.

7. A polynomial A (X) having only 0 and 1 as coefficients,
B (X) and also an irreducible polynomial of degree n with only 0 and 1 in its coefficient and whose constant term is 1.
ial) F (X), modulo multiplication A (X) * B (X) * R (X) ^ (-1) mod F
(X) (however, R (X) = X ^ n) is a process required when the value of Montgomery's method is obtained. M (X), W (X) equivalent to the remainder multiplication are Indefinite equation with unknown polynomial
A (X) * B (X) + M (X) * F (X) = W (X) * R (X) In the process of obtaining the W (X), the M (X) is set to a low degree (degree). K terms
The W is obtained by sequentially obtaining for each (term) (k is a divisor of n).
(X), which is a partial multiplication result j (X) * B (X) of A (X) * B (X) for each k-term block of A (X) (or B (X)). (j (X)
Is a polynomial of order k-1 or less) and M (X) * F for each k-term block of M (X)
The table value TBL (j (X), t (X)) = j (combined with the multiplication result of (X) t (X) * F (X) (t (X) is a polynomial of degree k-1 or less) X) * B (X) +
Prepare t (X) * F (X) (j (X), t (X) is a polynomial of degree less than or equal to k-1), and (1) the table value TBL (j (X), t (X)) Using W
(X) + TBL (j (X), t (X)) The first process that selects j (X), t (X) so that all the coefficients less than or equal to the k-1th degree become 0, (2) For j (X), t (X) determined in the first processing, W (X) + TBL (j (X), t
(3) A second process for obtaining the value of (X), (3) A value obtained by dividing the value of W (X) + TBL (j (X), t (X)) in the second process by X ^ k Alternatively, a value obtained by right-shifting the value of W (X) + TBL (j (X), t (X)) by k bits is obtained, and this value is set as the next W (X) to be the third processing for updating the value. At this time, the remainder multiplication result A (X) * B (X) * R (X) ^ (-1) is obtained by repeatedly executing the first, second and third processes in that order n / k times. A data processing device for obtaining mod F (X), wherein the sum and product here are sums and products in Galois field GF (2 ^ n).

8. The TBL (j (X), t (X)) = j (X) * B (X) + t (X)
* F (X) (j (X), t (X) is a polynomial of degree less than or equal to k-1), instead of the table value NTBL (j (X), t (X)), TBL (j (X (X ), t (X)) values below the k-1 order are set to 0, and this is divided by X ^ k to prepare this, and the process (A) is performed on the original TBL. Processing (A '), C (W (X), k) is replaced by the processing (A'), and the part less than the k-1 order of W (X) is set to 0. As a polynomial divided by k, C (W (X), k) + NTBL (j (X), t (X) for j (X), t (X) determined in the process (A ') ) Processing the calculation of the sum value (B '), C (W (X), k) + NT
When the process of updating the value of BL (j (X), t (X)) as the next W (X) is the process (C '), the process (A'), (B ') and (C') By repeating n / k times in that order, the remainder multiplication (m
odular multiplication) A (X) * B (X) * R (X) ^ (-1) mod F
(X) or A (X) * congruent therewith in law F (X) *
B (X) * R (X) ^ (-1) mod F (X) + S (X) * F (X) (S (X) is any 0
8. The data processing device according to claim 7, wherein a polynomial having only 1 and 1 as a coefficient) is obtained, and the sum here is a sum in a Galois field GF (2 ^ n).