JP2008512797A

JP2008512797A - Deterministic finite automaton (DFA) processing

Info

Publication number: JP2008512797A
Application number: JP2007531386A
Authority: JP
Inventors: ブーチャード・グレッグ・エイ; カールソン・ディビッド・エイ; ケスラー・リチャード・イー; フセイン・ムハマンド・アール
Original assignee: カビウム・ネットワークス
Priority date: 2004-09-10
Filing date: 2005-09-08
Publication date: 2008-04-24
Also published as: WO2006031659A2; EP1790148B1; US20060069872A1; WO2006031659A3; EP1790148A2; US8392590B2

Abstract

到着するパケットデータで決定性有限オートマトン（ＤＦＡ）グラフをリアルタイムで走査するプロセッサ１１０を提供する。プロセッサは、少なくとも１つのプロセッサコア１２０と、この少なくとも１つのプロセッサコア１２０と非同期に動作するＤＦＡモジュール１３４とを有し、キャッシュコヒーレントメモリ１３０，１０８に格納されたパケットデータでノンキャッシュメモリ１１８に格納された少なくとも１つのＤＦＡグラフを走査する。
【選択図】図１ＢA processor 110 is provided that scans a deterministic finite automaton (DFA) graph in real time with incoming packet data. The processor has at least one processor core 120 and a DFA module 134 that operates asynchronously with the at least one processor core 120 and stores the packet data stored in the cache coherent memories 130 and 108 in the non-cache memory 118. Scan at least one DFA graph generated.
[Selection] Figure 1B

Description

Related applications

本出願は、2004年9月10日に出願された米国仮特許出願第60/609,211号、および2005年4月8日に出願された米国仮特許出願第60/669,672号の利益を主張するものであり、上記各出願の全内容は参照により本明細書に引用したものとする。 This application claims the benefit of US Provisional Patent Application No. 60 / 609,211 filed on September 10, 2004 and US Provisional Patent Application No. 60 / 669,672 filed on April 8, 2005. The entire contents of each of the above applications are incorporated herein by reference.

開放型システム間相互接続（ＯＳＩ）参照モデルは、伝送媒体を介する通信に使用される７つのネットワークプロトコル層（Ｌ１−Ｌ７）を規定している。上位層（Ｌ４−Ｌ７）はエンド・ツー・エンド通信を表し、下位層（Ｌ１−Ｌ３）は局所的な通信を表す。 The Open Systems Interconnection (OSI) reference model defines seven network protocol layers (L1-L7) that are used for communication over transmission media. The upper layer (L4-L7) represents end-to-end communication, and the lower layer (L1-L3) represents local communication.

ネットワーキングアプリケーションアウェア型（アプリケーションが自動的にネットワークを制御する）システムは、Ｌ３からＬ７までの範囲のネットワークプロトコル層、例えば、ハイパーテキスト転送プロトコル（ＨＴＴＰ）およびシンプルメール転送プロトコル（ＳＭＴＰ）などのＬ７ネットワークプロトコル層、および伝送制御プロトコル（ＴＣＰ）などのＬ４ネットワークプロトコル層を処理し、フィルタリングし、かつ切り換える必要がある。ネットワークプロトコル層の処理に加えて、ネットワーキングアプリケーションアウェア型システムは、ファイアウォール、仮想プライベートネットワーク（ＶＰＮ）、セキュアソケットレイヤ（ＳＳＬ）、侵入検知システム（ＩＤＳ）、インターネットプロトコルセキュリティ（ＩＰＳｅｃ）、アンチウィルス（ＡＶ）およびアンチスパム機能を含む、Ｌ４からＬ７までのネットワークプロトコル層を介するアクセスベースおよびコンテンツベースのセキュリティによって、これらのプロトコルの安全性を同時にワイヤスピードで保護する必要もある。 Networking application-aware systems (applications automatically control the network) network protocol layers ranging from L3 to L7, for example L7 networks such as Hypertext Transfer Protocol (HTTP) and Simple Mail Transfer Protocol (SMTP) There is a need to process, filter and switch protocol layers and L4 network protocol layers such as Transmission Control Protocol (TCP). In addition to network protocol layer processing, networking application-aware systems include firewall, virtual private network (VPN), secure socket layer (SSL), intrusion detection system (IDS), Internet protocol security (IPSec), antivirus (AV ) And access-based and content-based security through the L4 to L7 network protocol layers, including anti-spam features, also requires that these protocols be secured at wire speed at the same time.

ネットワークプロセッサが、高スループットのＬ２およびＬ３ネットワークプロトコル処理に利用される。このＬ２およびＬ３ネットワークプロトコル処理は、パケットをワイヤスピードで転送するパケット処理の実行である。一般に、より高度の処理を必要とするＬ４からＬ７までのネットワークプロトコルの処理には、汎用プロセッサが使用される。例えば、Ｌ４ネットワークプロトコルである伝送制御プロトコル（ＴＣＰ）は、パケット内の全ペイロードにわたるチェックサムの計算、ＴＣＰセグメントバッファの管理および複数のタイマの接続毎の常時保持を含む複数の計算集約型タスクを必要とする。汎用プロセッサは、計算集約型タスクを実行できるが、データを処理してワイヤスピードで転送できるだけの十分な性能は備えていない。 Network processors are utilized for high throughput L2 and L3 network protocol processing. The L2 and L3 network protocol processing is execution of packet processing for transferring a packet at wire speed. In general, a general-purpose processor is used for processing network protocols from L4 to L7 that require higher-level processing. For example, the Transmission Control Protocol (TCP), an L4 network protocol, includes multiple computationally intensive tasks, including checksum calculation across all payloads in a packet, TCP segment buffer management, and multiple timers that are always retained for each connection. I need. A general purpose processor can perform computationally intensive tasks, but does not have enough performance to process and transfer data at wire speed.

さらに、パケットのコンテンツを調べるコンテンツアウェア型アプリケーション（コンテンツを自動的に認識するアプリケーション）は、データストリーム内の固定文字列および可変回数で反復される文字クラスの双方を含む表現を探索する必要がある。ソフトウェアでこのタスクを実行するために、複数の探索アルゴリズムが使用される。このようなアルゴリズムの１つが、決定性有限オートマトン（ＤＦＡ）である。ただし、ＤＦＡ探索アルゴリズムを使用する場合、反復パターンによるグラフサイズの指数関数的拡大およびデータストリーム内の偽一致などの制限が生じる。 Furthermore, a content-aware application that examines the contents of a packet (an application that automatically recognizes content) needs to search for an expression that includes both a fixed character string and a character class that is repeated a variable number of times in the data stream. . Multiple search algorithms are used to perform this task in software. One such algorithm is deterministic finite automaton (DFA). However, when using the DFA search algorithm, limitations such as exponential expansion of the graph size due to repetitive patterns and false matches in the data stream arise.

これらの制限があるため、コンテンツ処理アプリケーションは、パターン探索によって生じる結果に対してかなりの量の後処理を必要とする。後処理は、接続種別などの他の接続状態情報およびパケットに包含されるプロトコルヘッダ内の特定の値によって、一致パターンを識別する必要がある。またこれは、特定の他のタイプの計算集約型識別を必要とし、例えば、パターン一致が有効であるのは、それがデータストリーム内の特定の位置範囲内にあるか、あるいは、別のパターンが後に続いてかつ先行するパターンから特定の範囲内にあるか、または、先行するパターンから特定のオフセット位置もしくはそのオフセット位置の後にある場合のみである。例えば、正規表現一致は、種々の演算子と単一文字を組み合わせて、複合な表現の構成を可能にする。 Because of these limitations, content processing applications require a significant amount of post-processing on the results produced by pattern searching. In the post-processing, it is necessary to identify the matching pattern based on other connection state information such as the connection type and a specific value in the protocol header included in the packet. This also requires certain other types of computationally intensive identification, for example, pattern matching is valid if it is within a certain position range in the data stream, or if another pattern is Only if it follows and is within a certain range from the preceding pattern, or only after or after a certain offset position from the preceding pattern. For example, regular expression matching combines various operators and single characters to allow the construction of complex expressions.

本発明は、プロセッサがコンテンツ処理アプリケーションを処理できる速度を向上させることを目的とする。本プロセッサは、少なくとも１つのプロセッサコアと、この少なくとも１つのプロセッサコアと非同期に動作する決定性有限オートマトン（ＤＦＡ）モジュールであって、第１メモリに格納された少なくとも１つのＤＦＡグラフを、第２メモリに格納されたパケットデータを用いて走査するＤＦＡモジュールとを有する。 The present invention aims to improve the speed at which a processor can process a content processing application. The processor includes at least one processor core and a deterministic finite automaton (DFA) module that operates asynchronously with the at least one processor core. The processor stores at least one DFA graph stored in the first memory. And a DFA module that scans using the packet data stored in.

ＤＦＡモジュールは、第１メモリコントローラと、少なくとも１つのＤＦＡスレッドエンジンと、命令入力論理とを有することができる。プロセッサコアは、命令入力論理の命令待ち行列を介してＤＦＡモジュールにＤＦＡ命令を送出できる。ＤＦＡ命令は、第２メモリに格納されたパケットデータを使用のために指示でき、第１メモリに格納されたＤＦＡグラフを走査のために指示できる。ＤＦＡモジュールは、ＤＦＡスレッドエンジンにＤＦＡ命令をスケジューリングすることができる。ＤＦＡスレッドエンジンは、第２メモリに格納されているパケットデータをフェッチし、フェッチされたパケットデータに応じたメモリアクセス命令を発行できる。 The DFA module can have a first memory controller, at least one DFA thread engine, and instruction input logic. The processor core can send DFA instructions to the DFA module via an instruction queue of instruction input logic. The DFA instruction can indicate the packet data stored in the second memory for use, and can indicate the DFA graph stored in the first memory for scanning. The DFA module can schedule DFA instructions to the DFA thread engine. The DFA thread engine can fetch the packet data stored in the second memory and issue a memory access instruction according to the fetched packet data.

例えば、第１メモリはノンキャッシュメモリであってもよく、第２メモリはキャッシュコヒーレントメモリであってもよい。ＤＦＡスレッドエンジンは、キャッシュコヒーレントメモリに格納されているパケットデータを一度に１バイトずつ逐次フェッチする。次に、ＤＦＡスレッドエンジンは、キャッシュコヒーレントメモリから受け取るパケットデータのバイト毎にノンキャッシュメモリロード命令を発行して、ノンキャッシュメモリに格納されているＤＦＡグラフの次の状態を走査する。ＤＦＡスレッドエンジンはまた、キャッシュコヒーレントメモリに中間および最終結果を書き込む。 For example, the first memory may be a non-cache memory and the second memory may be a cache coherent memory. The DFA thread engine sequentially fetches packet data stored in the cache coherent memory one byte at a time. Next, the DFA thread engine issues a non-cache memory load instruction for each byte of packet data received from the cache coherent memory, and scans the next state of the DFA graph stored in the non-cache memory. The DFA thread engine also writes intermediate and final results to cache coherent memory.

本発明の上述および他の目的、特徴および利点は、添付図面に示す本発明の好ましい実施形態に関する以下のより詳細な説明から明らかとなるであろう。添付図面では、同一参照符号は異なる図面においても同一部分を指す。図面は、必ずしも縮尺通りではなく、本発明の原理を示すことに重点を置いている。 The foregoing and other objects, features and advantages of the invention will become apparent from the following more detailed description of the preferred embodiment of the invention as illustrated in the accompanying drawings. In the accompanying drawings, the same reference numerals denote the same parts in different drawings. The drawings are not necessarily to scale, emphasis instead being placed on illustrating the principles of the invention.

以下、本発明の好ましい実施形態について説明する。 Hereinafter, preferred embodiments of the present invention will be described.

図１Ａは、本発明の原理によるネットワークサービスプロセッサ１１０を含むセキュリティアシステム１００を示すブロック図である。セキュリティシステム１００は、１つのイーサネット（登録商標）ポート（ＧｉｇＥ）で受信したパケットを別のイーサネット（登録商標）ポート（ＧｉｇＥ）に切り換えて、受信したパケットに複数のセキュリティ機能を実行した後にそのパケットを転送する、スタンドアロン型システムである。例えば、セキュリティシステム１００を用いて、広域ネットワーク上で受信するパケットに対してセキュリティ処理を実行した後に、そのパケットをローカルエリアネットワークに転送することができる。 FIG. 1A is a block diagram illustrating a security system 100 that includes a network service processor 110 according to the principles of the present invention. The security system 100 switches a packet received at one Ethernet (registered trademark) port (Gig E) to another Ethernet (registered trademark) port (Gig E), and executes a plurality of security functions on the received packet. It is a stand-alone system that forwards the packet. For example, the security system 100 can be used to perform security processing on a packet received on the wide area network and then transfer the packet to the local area network.

ネットワークサービスプロセッサ１１０は、ハードウェアパケット処理、バッファリング、ワークスケジューリング、順序づけ、同期化、およびキャッシュコヒーレンスサポートを提供することにより、全てのパケット処理タスクの速度を高める。ネットワークサービスプロセッサ１１０は、受信されたパケットにカプセル化されている開放型システム間相互接続ネットワークのＬ２からＬ７層までのプロトコルを処理する。 The network service processor 110 speeds up all packet processing tasks by providing hardware packet processing, buffering, work scheduling, ordering, synchronization, and cache coherence support. The network service processor 110 processes the protocols from the L2 to L7 layers of the open intersystem interconnection network encapsulated in the received packet.

ネットワークサービスプロセッサ１１０は、イーサネット（登録商標）ポート（ＧｉｇＥ）から物理インタフェースＰＨＹ１０４ａ、１０４ｂを介してパケットを受信し、受信されたパケットに対してＬ７〜Ｌ２ネットワークプロトコル処理を実行し、処理されたパケットを物理インタフェース１０４ａ、１０４ｂから、またはＰＣＩバス１０６を介して転送する。ネットワークプロトコル処理には、ファイアウォール、アプリケーションファイアウォール、ＩＰセキュリティ（ＩＰＳＥＣ）および／またはセキュアソケットレイヤ（ＳＳＬ）を含む仮想プライベートネットワーク（ＶＰＮ）、侵入検知システム（ＩＤＳ）およびアンチウィルス（ＡＶ）などのネットワークセキュリティプロトコルの処理が含まれてもよい。 The network service processor 110 receives a packet from the Ethernet (registered trademark) port (Gig E) via the physical interfaces PHY 104a and 104b, executes L7 to L2 network protocol processing on the received packet, and performs processing. The packet is transferred from the physical interfaces 104 a and 104 b or via the PCI bus 106. Network protocol processing includes network security such as firewall, application firewall, IP security (IPSEC) and / or virtual private network (VPN) including secure socket layer (SSL), intrusion detection system (IDS) and antivirus (AV) Protocol processing may be included.

ネットワークサービスプロセッサ１１０内のダイナミックランダムアクセスメモリ（ＤＲＡＭ）コントローラ１３３（図１Ｂ）は、ネットワークサービスプロセッサ１１０に結合される外部ＤＲＡＭ１０８へのアクセスを制御する。ＤＲＡＭ１０８は、ＰＨＹインタフェース１０４ａ、１０４ｂまたはＰＣＩ拡張（ＰＣＩ−Ｘ）インタフェース１０６から受信されるデータパケットを、ネットワークサービスプロセッサ１１０によって処理するために格納する。 A dynamic random access memory (DRAM) controller 133 (FIG. 1B) within network service processor 110 controls access to external DRAM 108 that is coupled to network service processor 110. The DRAM 108 stores data packets received from the PHY interfaces 104 a, 104 b or the PCI expansion (PCI-X) interface 106 for processing by the network service processor 110.

ネットワークサービスプロセッサ１１０内の低遅延メモリコントローラ３６０（図３Ｂ）は、低遅延メモリ（ＬＬＭ）１１８を制御する。ＬＬＭ１１８は、侵入検知システム（ＩＤＳ）またはアンチウィルス（ＡＶ）アプリケーションに必要とされる正規表現一致を含む、高速検索を可能にするインターネットサービス／セキュリティアプリケーションに使用されることができる。 A low latency memory controller 360 (FIG. 3B) within the network service processor 110 controls the low latency memory (LLM) 118. The LLM 118 can be used in Internet service / security applications that enable fast searches, including regular expression matching required for intrusion detection system (IDS) or antivirus (AV) applications.

正規表現は、文字列一致パターンを表現する一般的な方法である。正規表現の極小要素は、一致されるべき単一文字である。これらは、ユーザが連結、選択、クリーネ・スターなどを表現できるようにするメタ文字演算子と組み合わせることもできる。連結は、単一文字（またはサブ文字列）から複数の文字一致パターンを生成するために使用され、選択（|）は、２つ以上のサブ文字列のいずれにも一致できるパターンを生成するために使用される。クリーネ・スター（*）を使用することによって、パターンが文字列内でそのパターンのゼロ（０）またはそれ以上の出現に一致することが可能になる。それぞれの演算子と単一文字との組合せにより、複雑な表現の構築が可能になる。例えば、表現（th(is|at)*）は、th、this、that、thisis、thisat、thatis、thatat、などに一致する。 A regular expression is a general method for expressing a character string matching pattern. The minimal element of a regular expression is a single character to be matched. They can also be combined with metacharacter operators that allow the user to express concatenation, selection, Kleene Star, etc. Concatenation is used to generate multiple character matching patterns from a single character (or substring) and selection (|) to generate a pattern that can match any of two or more substrings used. Using a Kleene Star (*) allows a pattern to match zero (0) or more occurrences of that pattern in the string. The combination of each operator and a single character makes it possible to construct complex expressions. For example, the expression (th (is | at) *) matches th, this, that, thisis, thisat, thatis, thatat, etc.

図１Ｂは、図１Ａに示すネットワークサービスプロセッサ１１０のブロック図である。ネットワークサービスプロセッサ１１０は、図１Ａに関連して説明したとおり、少なくとも１つのプロセッサコア１２０を使用して高いアプリケーション性能を提供する。 FIG. 1B is a block diagram of the network service processor 110 shown in FIG. 1A. Network service processor 110 uses at least one processor core 120 to provide high application performance as described in connection with FIG. 1A.

パケットは、ＳＰＩ−４．２またはＲＧＭIIインタフェースを介して、ＧＭＸ／ＳＰＸユニット１２２ａ、１２２ｂの任意の１つによって処理されるために受信される。パケットは、ＰＣＩインタフェース１２４によって受信されることもできる。ＧＭＸ／ＳＰＸユニット（１２２ａ、１２２ｂ）は、受信されたパケットに含まれるＬ２ネットワークプロトコルヘッダ内の様々なフィールドをチェックすることによって受信されたパケットの前処理を実行し、次にパケットをパケット入力ユニット１２６に転送する。 The packet is received for processing by any one of the GMX / SPX units 122a, 122b via the SPI-4.2 or RGMII interface. The packet can also be received by the PCI interface 124. The GMX / SPX unit (122a, 122b) performs preprocessing of the received packet by checking various fields in the L2 network protocol header included in the received packet, and then the packet is sent to the packet input unit. 126.

パケット入力ユニット１２６は、受信されたパケットに含まれるネットワークプロトコルヘッダ（Ｌ３およびＬ４）のさらなる前処理を実行する。前処理には、通信制御プロトコル（ＴＣＰ）／ユーザデータグラムプロトコル（ＵＤＰ）（Ｌ３ネットワークプロトコル）に関するチェックサムのチェックが含まれる。 The packet input unit 126 performs further preprocessing of the network protocol headers (L3 and L4) included in the received packet. The pre-processing includes a checksum check for the communication control protocol (TCP) / user datagram protocol (UDP) (L3 network protocol).

フリープールのアロケータ（ＦＰＡ）１２８が、レベル２キャッシュメモリ１３０およびＤＲＡＭ１０８における空きメモリのポインタプールを保持する。パケット入力ユニット１２６は、ポインタのプールの１つを使用して、受信されたパケットデータをレベル２キャッシュメモリ１３０に格納し、ポインタの別のプールを使用してプロセッサコア１２０に対してワーク待ち行列エントリを割り当てる。 A free pool allocator (FPA) 128 maintains a pointer pool of free memory in the level 2 cache memory 130 and DRAM 108. The packet input unit 126 uses one of the pools of pointers to store received packet data in the level 2 cache memory 130 and uses another pool of pointers to the processor core 120 for work queues. Assign an entry.

パケット入力ユニット１２６は次に、レベル２キャッシュ１３０またはＤＲＡＭ１０８内のバッファにパケットデータを書き込む。この書込みのフォーマットは、上位層のネットワークプロトコルによるその後の処理を容易にするために、プロセッサコア１２０のうちの少なくとも１つで実行される上位層ソフトウェアに好都合なフォーマットである。 The packet input unit 126 then writes the packet data to the level 2 cache 130 or a buffer in the DRAM 108. This writing format is convenient for higher layer software running on at least one of the processor cores 120 to facilitate subsequent processing by higher layer network protocols.

Ｉ／Ｏインタフェース１３６は、全体のプロトコルおよびアービトレーション（調停）を管理し、コヒーレントなＩ／Ｏ区分（partitioning）を提供する。Ｉ／Ｏインタフェース１３６は、Ｉ／Ｏブリッジ（ＩＯＢ）１３８およびフェッチ・アンド・アッドユニット（ＦＡＵ）１４０を有する。フェッチ・アンド・アッドユニット１４０内のレジスタを用いて、パケット出力ユニット１４６を介して処理されたパケットを転送するのに使用される出力待ち行列の長さを保持する。Ｉ／Ｏブリッジ１３８は、コヒーレントメモリバス１４４、Ｉ／Ｏバス１４２、パケット入力ユニット１２６、およびパケット出力ユニット１４６間で転送される情報を格納するためのバッファ待ち行列を含む。 The I / O interface 136 manages the overall protocol and arbitration and provides coherent I / O partitioning. The I / O interface 136 includes an I / O bridge (IOB) 138 and a fetch and add unit (FAU) 140. A register in the fetch and add unit 140 is used to hold the length of the output queue used to transfer the processed packet through the packet output unit 146. The I / O bridge 138 includes a buffer queue for storing information transferred between the coherent memory bus 144, the I / O bus 142, the packet input unit 126, and the packet output unit 146.

パケット順序／ワークモジュール（ＰＯＷ）１４８は、プロセッサコア１２０に対するワークをキューイングして（待ち行列に入れて）スケジューリングする。ワーク待ち行列エントリを待ち行列に追加することによって、ワークがキューイングされる（待ち行列に入れられる）。例えば、ワーク待ち行列エントリは、パケットの到着毎にパケット入力ユニット１２６によって追加される。プロセッサコア１２０のワークのスケジューリングには、タイマユニット１５０が使用される。 The packet order / work module (POW) 148 queues (queues) work for the processor core 120 and schedules it. A work is queued by adding a work queue entry to the queue. For example, a work queue entry is added by the packet input unit 126 for each packet arrival. The timer unit 150 is used for scheduling the work of the processor core 120.

プロセッサコア１２０は、ＰＯＷモジュール１４８にワークを要求する。ＰＯＷモジュール１４８は、プロセッサコア１２０のうちの１つに対してワークを選択（すなわち、スケジューリング）し、そのワークを記述するワーク待ち行列エントリを指すポインタをプロセッサコア１２０に返す。 The processor core 120 requests a work from the POW module 148. The POW module 148 selects a work for one of the processor cores 120 (ie, scheduling) and returns a pointer to the processor core 120 that points to a work queue entry that describes the work.

プロセッサコア１２０は、命令キャッシュ１５２、レベル１（Ｌ１）データキャッシュ１５４、および暗号アクセラレータ１５６を有する。一実施形態においては、ネットワークサービスプロセッサ１１０は、１６個のスーパースカラーＲＩＳＣ（縮小命令セットコンピュータ）型プロセッサコア１２０を有する。一実施形態においては、各スーパースケーラＲＩＳＣ型プロセッサコア１２０はＭＩＰＳ６４バージョン２プロセッサコアの拡張版である。 The processor core 120 includes an instruction cache 152, a level 1 (L1) data cache 154, and a cryptographic accelerator 156. In one embodiment, the network service processor 110 has 16 superscalar RISC (Reduced Instruction Set Computer) type processor cores 120. In one embodiment, each superscaler RISC processor core 120 is an extension of the MIPS64 version 2 processor core.

レベル２（Ｌ２）キャッシュメモリ１３０およびＤＲＡＭ１０８は、プロセッサコア１２０およびＩ／Ｏコプロセッサデバイスの全てによって共有される。各プロセッサコア１２０は、コヒーレントメモリバス１４４によってレベル２キャッシュメモリ１３０に結合される。コヒーレントメモリバス１４４は、プロセッサコア１２０、ＩＯＢ１３８ならびにＬ２キャッシュメモリ１３０およびＬ２キャッシュメモリコントローラ１３１間の全てのメモリおよびＩ／Ｏトランザクションのための通信チャネルである。一実施形態においては、コヒーレントメモリバス１４４は１６個のプロセッサコア１２０に対して適応でき、ライトスルーによって完全にコヒーレントなＬ１データキャッシュ１５４をサポートし、高度なバッファ機能を果たして、Ｉ／Ｏを優先順位付けすることができる。 Level 2 (L2) cache memory 130 and DRAM 108 are shared by all of processor core 120 and I / O coprocessor devices. Each processor core 120 is coupled to the level 2 cache memory 130 by a coherent memory bus 144. The coherent memory bus 144 is a communication channel for all memory and I / O transactions between the processor core 120, IOB 138 and L2 cache memory 130 and L2 cache memory controller 131. In one embodiment, the coherent memory bus 144 is adaptable for 16 processor cores 120, supports a fully coherent L1 data cache 154 with write-through, performs advanced buffering, and prioritizes I / O. Can be ranked.

Ｌ２キャッシュメモリコントローラ１３１は、メモリ参照コヒーレンスを保持する。Ｌ２キャッシュメモリコントローラ１３１は、ブロックがレベル２キャッシュメモリ１３０もしくは外部ＤＲＡＭ１０８に格納されていても、または「処理中（in-flight）」であっても、あらゆるＦＩＬＬ（フィル）要求に対して、ブロックの最新コピーを返す。Ｌ２キャッシュメモリコントローラ１３１はまた、各プロセッサコア１２０内にデータキャッシュ１５４に対するタグを２部格納する。Ｌ２キャッシュメモリコントローラ１３１は、キャッシュブロック格納の要求のアドレスをデータ−キャッシュタグと比較し、格納命令が別のプロセッサコアからであるか、またはＩ／Ｏインタフェース１３６を介するＩ／Ｏコンポーネントからであるときは、常にプロセッサコア１２０のデータ−キャッシュタグを（２部とも）無効にする。 The L2 cache memory controller 131 holds memory reference coherence. The L2 cache memory controller 131 will block for any FILL request, whether the block is stored in the level 2 cache memory 130 or external DRAM 108, or “in-flight”. Returns the latest copy of. The L2 cache memory controller 131 also stores two tags for the data cache 154 in each processor core 120. The L2 cache memory controller 131 compares the address of the cache block store request with the data-cache tag and the store instruction is from another processor core or from an I / O component via the I / O interface 136. Sometimes, the processor core 120 data-cache tag (both copies) is invalidated.

ＤＲＡＭコントローラ１３３は、最大１６メガバイトのＤＲＡＭをサポートする。ＤＲＡＭコントローラ１３３は、ＤＲＡＭ１０８との６４ビットまたは１２８ビットインタフェースをサポートする。ＤＲＡＭコントローラ１３３は、ＤＤＲ−Ｉ（ダブルデータレート）およびＤＤＲ−IIプロトコルをサポートする。 The DRAM controller 133 supports up to 16 megabytes of DRAM. The DRAM controller 133 supports a 64-bit or 128-bit interface with the DRAM 108. The DRAM controller 133 supports DDR-I (double data rate) and DDR-II protocols.

プロセッサコア１２０によってパケットが処理されると、パケット出力ユニット（ＰＫＯ）１４６は、メモリからパケットデータを読み出し、Ｌ４ネットワークプロトコル後処理を実行し（例えば、ＴＣＰ／ＵＤＰチェックサムを生成し）、上記パケットをＧＭＸ／ＳＰＸユニット１２２ａ、１２２ｂを介して転送し、パケットによって使用されたＬ２キャッシュ１３０／ＤＲＡＭ１０８を解放する。 When the packet is processed by the processor core 120, the packet output unit (PKO) 146 reads the packet data from the memory, performs post-processing of the L4 network protocol (for example, generates a TCP / UDP checksum), and the packet Are transferred via the GMX / SPX units 122a and 122b, and the L2 cache 130 / DRAM 108 used by the packet is released.

低遅延メモリコントローラ３６０（図３Ｂ）は、ＬＬＭ１１８との間の処理中（in-flight）のトランザクション（ロード／格納）を管理する。低遅延メモリ（ＬＬＭ）１１８は全てのプロセッサコア１２０によって共有される。ＬＬＭ１１８は、ダイナミックランダムアクセスメモリ（ＤＲＡＭ）、低遅延ダイナミックランダムアクセスメモリ（ＲＬＤＲＡＭ）、同期ランダムアクセスメモリ（ＳＲＡＭ）、高速サイクルランダムアクセスメモリ（ＦＣＲＡＭ）または技術的に公知の他の任意の種類の低遅延メモリであってもよい。ＲＬＤＲＡＭは、３０ナノ秒またはそれ以下のメモリ遅延（すなわち、プロセッサ１２０によって開始されたメモリ要求が満たされるまでに要する時間）を提供する。各プロセッサコア１２０は、低遅延メモリバス１５８によってＬＬＭコントローラ３６０に直接結合される。低遅延メモリバス１５８は、プロセッサコア１２０とＬＬＭコントローラ３６０との間のコンテンツアウェア型アプリケーション処理のための通信チャネルである。ＬＬＭコントローラ３６０は、ＬＬＭ１１８へのアクセスを制御するためにプロセッサコア１２０とＬＬＭ１１８との間に結合される。 The low latency memory controller 360 (FIG. 3B) manages in-flight transactions (load / store) with the LLM 118. Low latency memory (LLM) 118 is shared by all processor cores 120. The LLM 118 is a dynamic random access memory (DRAM), a low delay dynamic random access memory (RLDRAM), a synchronous random access memory (SRAM), a fast cycle random access memory (FCRAM), or any other type of low memory known in the art. It may be a delay memory. RLDRAM provides a memory delay of 30 nanoseconds or less (ie, the time it takes for a memory request initiated by processor 120 to be satisfied). Each processor core 120 is directly coupled to the LLM controller 360 by a low latency memory bus 158. The low-latency memory bus 158 is a communication channel for content-aware application processing between the processor core 120 and the LLM controller 360. The LLM controller 360 is coupled between the processor core 120 and the LLM 118 to control access to the LLM 118.

ネットワークサービスプロセッサ１１０は、プロセッサコア１２０をオフロード（負荷軽減）する特定用途向けコプロセッサを有し、これにより、ネットワークサービスプロセッサが高スループットを達成できるようにする。圧縮／解凍コプロセッサ１３２は、受信パケットの圧縮および解凍を専用に実行する。決定性有限オートマトン（ＤＦＡ）モジュール１３４は、専用のＤＦＡエンジン３７０（図３Ｂ）を有し、アンチウィルス（ＡＶ）、侵入検知システム（ＩＤＳ）および他のコンテンツ処理アプリケーションに必要なパターンおよび署名一致の処理を速くして、最大４Ｇｂｐｓまで加速する。 The network service processor 110 has an application specific coprocessor that offloads the processor core 120, thereby enabling the network service processor to achieve high throughput. The compression / decompression coprocessor 132 performs dedicated compression and decompression of received packets. The deterministic finite automaton (DFA) module 134 has a dedicated DFA engine 370 (FIG. 3B) and handles pattern and signature matching required for antivirus (AV), intrusion detection system (IDS) and other content processing applications. To accelerate up to 4 Gbps.

コンテンツアウェア型アプリケーションの処理は、ＬＬＭ１１８に格納されるパターン／表現（データ）を利用する。パターン／表現は、決定性有限オートマトン（ＤＦＡ）の形式であってもよい。ＤＦＡは、状態マシンである。ＤＦＡ状態マシンへの入力は、バイト（１バイト＝８ビット）の文字列である（すなわち、ＤＦＡではアルファベットが１バイトである）。各入力バイトが、状態マシンを１つの状態から次の状態に遷移させる。状態および遷移関数は、図２Ａに示すように、グラフ２００で表示することができる。ただし、各グラフノード（ノード０〜３）は１つの状態であり、異なるノードを相互接続する異なるグラフ弧は異なる入力バイトに対する状態遷移を表す。状態は、「Ａ…Ｚ，ａ…ｚ，０…９，」などの、状態に関連する特定の文字を含む。状態マシンの現在の状態は、グラフの特定のノードを選択するノード識別子である。ノード数は、小型サイズのグラフで数ノードから約128,000ノードまでの範囲内であってもよい。大型サイズのグラフは1,000,000ノードまでを有してもよく、それ以上であってもよい。 The processing of the content-aware application uses a pattern / expression (data) stored in the LLM 118. The pattern / representation may be in the form of a deterministic finite automaton (DFA). A DFA is a state machine. The input to the DFA state machine is a string of bytes (1 byte = 8 bits) (ie, the alphabet is 1 byte in DFA). Each input byte causes the state machine to transition from one state to the next. States and transition functions can be displayed in a graph 200 as shown in FIG. 2A. However, each graph node (nodes 0-3) is in one state, and different graph arcs interconnecting different nodes represent state transitions for different input bytes. The state includes specific characters associated with the state, such as "A ... Z, a ... z, 0 ... 9,". The current state of the state machine is a node identifier that selects a particular node in the graph. The number of nodes may be in the range from a few nodes to about 128,000 nodes in a small size graph. A large size graph may have up to 1,000,000 nodes, or more.

説明のための例では、ＤＦＡグラフ２００は、ターゲットの文字列表現「ａｂｃ」を探索するように設計されている。したがって、ＤＦＡグラフは、文字「ａｂｃ」の文字列に正確に一致する入力データを探索するために使用される。この表現は固定長表現であり、これより、ノード数、したがってグラフの深さは既知である（すなわち、一定である）。 In the illustrative example, the DFA graph 200 is designed to search for a target string representation “abc”. Thus, the DFA graph is used to search for input data that exactly matches the string of characters “abc”. This representation is a fixed length representation, from which the number of nodes and hence the depth of the graph is known (ie constant).

ＤＦＡグラフを生成するために、上記表現が解析され、コンパイラはルートノード（すなわち、ノード「０」）を生成し、意図する表現に即してグラフにノード１から３を追加する（すなわち、ターゲット文字列の各文字につき１つの追加ノード）。本例を引き続き参照すると、例示的な文字の入力ストリームが文字列「１２ａｂｃ３」を含む。ＤＦＡグラフを使用して入力文字列が探索され、ターゲット文字列表現「ａｂｃ」が識別される。 To generate a DFA graph, the above expression is parsed and the compiler generates a root node (ie, node “0”) and adds nodes 1 to 3 to the graph according to the intended expression (ie, target One additional node for each character in the string). With continued reference to this example, an exemplary character input stream includes the string “12abc3”. The input string is searched using the DFA graph and the target string representation “abc” is identified.

ＤＦＡグラフの初期状態は、ノード「０」である。各文字またはバイトは逐次読み取られ、ＤＦＡは、ターゲット文字列表現の最初の文字が読み取られるまでノード０に留まる。例えば、入力ストリームにおけるターゲット文字列表現の最初の文字「ａ」が検出されると、ノード０からノード１に「ａ」で表記された弧が辿られる。次に、入力ストリームの次の文字が読み取られ、これがターゲット文字列表現の次の文字（すなわち、「ｂ」）以外のいずれかであることが検出されれば、「not ｂ」で表記された、ノード１からノード０に戻る弧が辿られる。しかし、入力ストリーム内の次の文字として文字「ｂ」が検出されると、「ｂ」で表記されたノード１からノード２への弧が辿られる。次いで、入力ストリームの次の文字が読み取られ、これがターゲット文字列表現の次の文字（すなわち、「ｃ」）以外のいずれかの文字であれば、「not ｃ」で表記されたノード２からノード０に戻る弧が辿られる。しかし、ノード２において、入力ストリーム内の文字「ｃ」が検出されると、「ｃ」で表記されたノード２からノード３への弧が辿られる。ターゲット文字列表現「ａｂｃ」は固定長表現であることから、ノード３は終端ノードであり、探索の結果、すなわち、表現「ａｂｃ」が発見されたことと、入力ストリーム内のこの表現の位置とが報告される。 The initial state of the DFA graph is node “0”. Each character or byte is read sequentially, and the DFA remains at node 0 until the first character of the target string representation is read. For example, when the first character “a” in the target character string expression in the input stream is detected, an arc represented by “a” is traced from node 0 to node 1. Next, the next character in the input stream is read and if it is detected that it is something other than the next character in the target string representation (ie, “b”), it is written “not b” , An arc from node 1 back to node 0 is followed. However, when the character “b” is detected as the next character in the input stream, an arc from node 1 to node 2 represented by “b” is traced. Next, the next character of the input stream is read, and if this is any character other than the next character in the target string representation (ie, “c”), node 2 to node denoted “not c” An arc returning to 0 is followed. However, when the character “c” in the input stream is detected at the node 2, the arc from the node 2 to the node 3 represented by “c” is traced. Since the target string representation “abc” is a fixed length representation, node 3 is a terminal node, and as a result of the search, that is, the representation “abc” has been found and the position of this representation in the input stream Is reported.

さらに、コンパイラで１つまたは複数の意図する表現を解析し、意図する表現が要求する通りにグラフの適切なノードを生成することによって、さらに複雑なＤＦＡグラフを同様に生成することができる。したがって、単一のグラフを使用して、固定長、可変長および固定／可変長の組合せである複数の表現を探索することができる。 In addition, more complex DFA graphs can be similarly generated by analyzing one or more intended expressions in the compiler and generating the appropriate nodes of the graph as required by the intended expression. Thus, a single graph can be used to search for multiple representations that are fixed length, variable length and fixed / variable length combinations.

図３Ａは、本発明の原理による縮小命令セットコンピュータ型（ＲＩＳＣ）プロセッサ１２０を示すブロック図である。プロセッサ（プロセッサコア）１２０は整数演算ユニット３０２、命令ディスパッチユニット３０４、命令フェッチユニット３０６、メモリ管理ユニット（ＭＭＵ）３０８、システムインタフェース３１０、低遅延インタフェース３５０、ロード／格納ユニット３１４、書込みバッファ３１６、およびセキュリティアクセラレータ１５６を有する。プロセッサコア１２０はデバッグ動作を実行できるＥＪＴＡＧインタフェース３３０も有する。システムインタフェース３１０は、外部メモリ、すなわち、外部（Ｌ２）キャッシュメモリ１３０または主／メインメモリ１０８などの、プロセッサ１２０の外側にあるメモリへのアクセスを制御する。 FIG. 3A is a block diagram illustrating a reduced instruction set computer type (RISC) processor 120 according to the principles of the present invention. The processor (processor core) 120 includes an integer arithmetic unit 302, an instruction dispatch unit 304, an instruction fetch unit 306, a memory management unit (MMU) 308, a system interface 310, a low delay interface 350, a load / store unit 314, a write buffer 316, and A security accelerator 156 is included. The processor core 120 also has an EJTAG interface 330 that can execute a debugging operation. The system interface 310 controls access to external memory, ie, memory external to the processor 120, such as external (L2) cache memory 130 or main / main memory 108.

整数演算ユニット３０２は、乗算ユニット３２６、少なくとも１つのレジスタファイル（主レジスタファイル）３２８、および２つの保持レジスタ３３０ａ、３３０ｂを有する。保持レジスタ３３０ａ、３３０ｂは、ＬＬＭのロード／格納命令を使用して、ＬＬＭ１１８に書き込まれるデータと、ＬＬＭ１１８から読み出されたデータとを格納するために使用される。保持レジスタ３３０ａ、３３０ｂは、命令パイプラインの効率を、このパイプラインを機能停止する前に２つの未処理ロードを許容することによって向上させる。２つの保持レジスタを示したが、１つ、または複数の保持レジスタを使用してもよい。乗算ユニット３２６は６４ビットのレジスタの直接乗算を実行できる。命令フェッチユニット３０６は、命令キャッシュ（ＩＣａｃｈｅ）１５２を有する。ロード／格納ユニット３１４はデータキャッシュ１５４を有する。一実施形態においては、命令キャッシュ１５２は３２キロバイトであり、データキャッシュ１５４は８キロバイトであり、書込みバッファ３１６は２キロバイトである。メモリ管理ユニット３０８は、変換ルックアサイドバッファ（ＴＬＢ）３４０を有する。 The integer arithmetic unit 302 includes a multiplication unit 326, at least one register file (main register file) 328, and two holding registers 330a and 330b. The holding registers 330a and 330b are used to store data written to the LLM 118 and data read from the LLM 118 using LLM load / store instructions. Holding registers 330a, 330b improve the efficiency of the instruction pipeline by allowing two outstanding loads before decommissioning the pipeline. Although two holding registers are shown, one or more holding registers may be used. Multiplication unit 326 can perform direct multiplication of 64-bit registers. The instruction fetch unit 306 includes an instruction cache (ICache) 152. The load / store unit 314 has a data cache 154. In one embodiment, instruction cache 152 is 32 kilobytes, data cache 154 is 8 kilobytes, and write buffer 316 is 2 kilobytes. The memory management unit 308 includes a translation lookaside buffer (TLB) 340.

一実施形態においては、プロセッサ１２０は、トリプルデータ暗号化規格（３ＤＥＳ）、高度暗号化規格（ＡＥＳ）、セキュアハッシュアルゴリズム（ＳＨＡ−１）、メッセージダイジェストアルゴリズム５番（ＭＤ５）のための暗号アクセラレーションを含む暗号アクセラレーションモジュール（セキュリティアクセラレータ）１５６を有する。暗号アクセラレーションモジュール１５６は、転送を通して、演算ユニット３０２内の主レジスタファイル３２８と通信する。ＲＳＡおよびディフィ−ヘルマン（ＤＨ）アルゴリズムが、乗算ユニット３２６において実行される。 In one embodiment, processor 120 provides cryptographic acceleration for Triple Data Encryption Standard (3DES), Advanced Encryption Standard (AES), Secure Hash Algorithm (SHA-1), Message Digest Algorithm No. 5 (MD5). A cryptographic acceleration module (security accelerator) 156 including The cryptographic acceleration module 156 communicates with the main register file 328 in the arithmetic unit 302 through the transfer. RSA and Diffie-Hellman (DH) algorithms are executed in multiplication unit 326.

図３Ｂは、図３ＡのＤＦＡモジュール１３４を示すブロック図である。ＤＦＡモジュール１３４は低遅延ＤＲＡＭコントローラ３６０、少なくとも１つのＤＦＡスレッドエンジン（ＤＴＥ）３７０（１６個が示されている）、および命令入力論理３８０を有する。命令入力論理３８０は、ＤＦＡ命令待ち行列３８２およびドアベル３８４を有する。ＤＦＡ命令待ち行列３８２はＬ２／ＤＲＡＭ（１３０／１０８）に格納されたＤＦＡ命令をキューイングして（待ち行列に入れ）、ドアベルはＤＦＡ命令待ち行列３８２に格納されているＤＦＡ命令の数を示す。コア１２０のソフトウェアは、個々のＤＦＡ命令ごとにドアベルの書込みを発行することができ、すなわち、複数のＤＦＡ命令を単一のドアベル書込みに累積させることができる。各ＤＦＡ命令は、ＤＦＡモジュール１３４がＤＴＥ３７０を開始し、入力データを読み出し、ＬＬＭ１１８内に格納されているＤＦＡグラフ２００を走査し、かつ結果をＬ２／ＤＲＡＭ（１３０／１０８）に書き込む必要があるという情報を含む。ＤＦＡ命令のフォーマットについては、図８Ａに関連して後述する。 FIG. 3B is a block diagram illustrating the DFA module 134 of FIG. 3A. The DFA module 134 has a low latency DRAM controller 360, at least one DFA thread engine (DTE) 370 (16 are shown), and instruction input logic 380. Instruction input logic 380 includes a DFA instruction queue 382 and a doorbell 384. The DFA instruction queue 382 queues (queues) DFA instructions stored in the L2 / DRAM (130/108), and the doorbell indicates the number of DFA instructions stored in the DFA instruction queue 382. . The core 120 software can issue a doorbell write for each individual DFA instruction, ie, multiple DFA instructions can be accumulated in a single doorbell write. Each DFA instruction says that DFA module 134 needs to start DTE 370, read input data, scan DFA graph 200 stored in LLM 118, and write the result to L2 / DRAM (130/108). Contains information. The format of the DFA instruction will be described later with reference to FIG. 8A.

ＤＴＥ３７０は、パターン探索の実行に使用されることができる。一般に、ＤＴＥ３７０は、パケットデータ（Ｌ２／ＤＲＡＭ（１３０／１０８）内に存在）内の特定の表現を探索するために、入力パケットデータを用いてＤＦＡグラフ２００（図２）（ＬＬＭ１１８内に存在）を走査する。例えば、ネットワークサービスプロセッサは、別個のＤＴＥに送られた各入力ストリームを用いて同時に最大1000のＴＣＰ入力ストリームを追跡して、特定の表現を探索することができる。走査に先行して、コア１２０内のソフトウェアは、最初に、（i）ＬＬＭ１１８内のＤＦＡグラフをＬＬＭバス１５８を介して事前にロードし、（ii）Ｌ２／ＤＲＡＭ（１３０／１０８）内のＤＦＡ命令を事前にロードし、かつ（iii）ＩＯＢ１４２を介してＤＦＡ命令をＤＦＡモジュール１３４に送出する必要がある。ＤＦＡ命令は、入力パケットデータを用いて走査するＤＦＡグラフ２００を示す。この後、ＤＦＡモジュール１３４はＤＦＡ命令をフェッチして待ち行列に入れ、各ＤＦＡ命令を利用可能な１６個のＤＴＥ３７０のうちの１つにスケジューリングする。任意のＤＦＡ命令が任意の利用可能なＤＴＥ３７０にスケジューリングされることができるように、ＤＴＥ３７０は全て同一であり、等価である。命令を受信すると、ＤＴＥ３７０は、同時に、（ａ）ＩＯＢ１４２を介してＬ２／ＤＲＡＭ（１３０／１０８）からパケットデータをフェッチし、（ｂ）パケットデータのバイト毎に、このバイトについてＤＦＡグラフの次の状態へ移るようにＬＬＭＤＲＡＭロード命令を発行し、（ｃ）ＩＯＢ１４２を介して元のＬ２／ＤＲＡＭ（１３０／１０８）に中間および最終結果を戻して書き込む。 The DTE 370 can be used to perform a pattern search. In general, the DTE 370 uses the input packet data to search for a specific representation in the packet data (present in the L2 / DRAM (130/108)) and DFA graph 200 (FIG. 2) (present in the LLM 118). Scan. For example, the network service processor can track up to 1000 TCP input streams simultaneously with each input stream sent to a separate DTE to search for a particular representation. Prior to the scan, the software in core 120 first (i) preloads the DFA graph in LLM 118 via LLM bus 158 and (ii) DFA in L2 / DRAM (130/108). It is necessary to preload instructions and (iii) send DFA instructions to the DFA module 134 via the IOB 142. The DFA instruction indicates a DFA graph 200 that is scanned using input packet data. Thereafter, DFA module 134 fetches and queues DFA instructions and schedules each DFA instruction to one of the 16 available DTEs 370. All DTEs 370 are identical and equivalent so that any DFA instruction can be scheduled to any available DTE 370. Upon receipt of the instruction, the DTE 370 simultaneously (a) fetches packet data from the L2 / DRAM (130/108) via the IOB 142 and (b) for each byte of packet data, The LLM DRAM load instruction is issued so as to shift to the state, and (c) the intermediate and final results are written back to the original L2 / DRAM (130/108) via the IOB 142.

一般に、ＤＴＥ３７０は、ハードウェア、ソフトウェアまたはハードウェア／ソフトウェアの組合せを使用して実現できる状態マシンである。実施形態によっては、ＤＴＥ３７０は、組合せ論理を使用してハードウェアで実現される。他の実施形態では、各ＤＴＥ３７０は異なるプロセッサ上でそれぞれ実現される。さらに他の実施形態では、ＤＴＥ３７０は共通のプロセッサを使用して実現される。例えば、各ＤＴＥ３７０は、共有のマルチタスク環境を提供するように適合化された共通のプロセッサ上で実行される個別のタスク（すなわち、命令シーケンス）であってもよい。マルチタスキングは、オペレーティングシステムにおいて複数の独立したジョブ（すなわち、ＤＴＥ３７０）間で単一のプロセッサを共有する技術である。さらに、もしくは代替として、ＤＴＥ３７０のそれぞれは、マルチスレッディング能力を提供するように適合化された共通プロセッサ上で実行される個別のプロセススレッドであってもよい。マルチスレッディングとマルチタスキングとの差は、スレッドが一般に、マルチタスキング下でタスクを実行するのに比べてその環境を互いに多く共用する点である。例えば、複数スレッドが単一アドレス空間および全体変数セットを共有する一方で、スレッドのプログラムカウンタおよびスタックポインタの値によって各スレッドを区別することもできる。 In general, DTE 370 is a state machine that can be implemented using hardware, software, or a combination of hardware / software. In some embodiments, DTE 370 is implemented in hardware using combinatorial logic. In other embodiments, each DTE 370 is implemented on a different processor. In yet other embodiments, DTE 370 is implemented using a common processor. For example, each DTE 370 may be a separate task (ie, instruction sequence) that executes on a common processor that is adapted to provide a shared multitasking environment. Multitasking is a technology that shares a single processor among multiple independent jobs (ie, DTE 370) in an operating system. Additionally or alternatively, each DTE 370 may be a separate process thread executing on a common processor adapted to provide multithreading capabilities. The difference between multithreading and multitasking is that threads generally share more of their environment with each other than performing tasks under multitasking. For example, multiple threads may share a single address space and global variable set, while each thread can be distinguished by the thread's program counter and stack pointer values.

図４Ａは、Ｌ２／ＤＲＡＭ（１３０／１０８）に格納されるＤＦＡ命令待ち行列４００の構造を示す。各命令待ち行列は、チャンクまたはバッファ４０２のリンクされたリストである。各チャンク４０２は少なくとも３つのＤＦＡ命令４０４を含み、これら３つの命令が全体チャンクサイズ４０６を構成する。別のチャンク（例えば、４０２’）が存在すれば、チャンク４０２内の最後のＤＦＡ命令４０４の直後に、次のチャンクバッファポインタ４０８が続く。 FIG. 4A shows the structure of the DFA instruction queue 400 stored in the L2 / DRAM (130/108). Each instruction queue is a linked list of chunks or buffers 402. Each chunk 402 includes at least three DFA instructions 404, and these three instructions constitute an overall chunk size 406. If there is another chunk (eg, 402 '), the next chunk buffer pointer 408 immediately follows the last DFA instruction 404 in the chunk 402.

ＤＦＡ命令待ち行列４００にパケットを挿入するには、コア１２０のソフトウェアはＤＦＡ命令４０４をＤＦＡ命令待ち行列４００に書き込み、必要であればチャンクを割り当て、次いでＤＦＡドアベル３８４に、ＤＦＡ命令待ち行列４００に追加されたＤＦＡ命令４０４の数を書き込む。ＤＦＡモジュール１３４は、ＤＦＡ命令待ち行列４００を読み出し（テール４１０から開始）、チャンクの最後の命令（例えば、４０４／４０４’’’）に到達すると、次のチャンク（例えば、４０２’／４０２’’）を指す次のチャンクバッファポインタ４０８を走査する。このようにしてチャンク４０２をジャンプすると、ＤＦＡモジュール１３４は、前のチャンク（例えば、４０２／４０２’）をＦＰＡ１２８（図１Ｂ）に解放する。 To insert a packet into the DFA instruction queue 400, the core 120 software writes the DFA instruction 404 into the DFA instruction queue 400, assigns a chunk if necessary, and then into the DFA doorbell 384, into the DFA instruction queue 400. Write the number of DFA instructions 404 added. The DFA module 134 reads the DFA instruction queue 400 (starting from the tail 410) and upon reaching the last instruction in the chunk (eg 404/404 ′ ″), the next chunk (eg 402 ′ / 402 ″). The next chunk buffer pointer 408 pointing to) is scanned. When the chunk 402 is thus jumped, the DFA module 134 releases the previous chunk (eg, 402/402 ') to the FPA 128 (FIG. 1B).

ＤＦＡモジュール１３４は、ＤＦＡ命令待ち行列４００のテールポインタ４１０を保持し、コア１２０のソフトウェアはＤＦＡ命令待ち行列４００のヘッドポインタ４１２を保持する。テールポインタ４１０とヘッドポインタ４１２との距離は、ＤＦＡ命令待ち行列４００のサイズ、および未処理のドアベルカウント数の両方に等しい。ＤＦＡ命令待ち行列４００のサイズは、利用可能なメモリおよびＤＦＡ命令待ち行列４００についての２０ビットである処理中ドアベルカウンタによってのみ制限される。 The DFA module 134 maintains the tail pointer 410 of the DFA instruction queue 400, and the core 120 software maintains the head pointer 412 of the DFA instruction queue 400. The distance between the tail pointer 410 and the head pointer 412 is equal to both the size of the DFA instruction queue 400 and the outstanding doorbell count. The size of the DFA instruction queue 400 is limited only by the available memory and the in-process doorbell counter, which is 20 bits for the DFA instruction queue 400.

図４Ｂは、次のチャンクバッファポインタのフォーマット４５０を示す。次のチャンクバッファポインタは６４ビットワードであり、３６ビットのアドレス（Ａｄｄｒ）フィールド４５２を含む。Ａｄｄｒフィールド４５２は、次のＤＦＡ命令４０２を含む次のチャンク４００の有効なＬ２／ＤＲＡＭ（１３０／１０８）バイト位置を選択する。Ａｄｄｒフィールド４５２はバイトアドレスであるが、これは、その最小位の７ビットをゼロに設定することによって、１２８バイトのキャッシュブロックの境界上に当然に揃えられる。 FIG. 4B shows the next chunk buffer pointer format 450. The next chunk buffer pointer is a 64-bit word and includes a 36-bit address (Addr) field 452. Addr field 452 selects a valid L2 / DRAM (130/108) byte location of the next chunk 400 containing the next DFA instruction 402. The Addr field 452 is a byte address, which is naturally aligned on a 128-byte cache block boundary by setting its least significant 7 bits to zero.

図５Ａは、ＬＬＭ１１８に格納されるＤＦＡグラフ５００の構造を示す。ＤＦＡグラフ５００は、５１０ａ〜５１０ｎまでのＮ個のノードを含む。ＤＦＡグラフ５００内のノード５１０はそれぞれ、２５６個の次のノードポインタ５１２の単なるアレイであり、この次のノードポインタのそれぞれは入力バイト値に固有である。各次のノードポインタ５１２は、入力バイトの次のノード／状態を直接指定する次のノードＩＤ５１４を含む。 FIG. 5A shows the structure of the DFA graph 500 stored in the LLM 118. The DFA graph 500 includes N nodes 510a to 510n. Each node 510 in the DFA graph 500 is simply an array of 256 next node pointers 512, each of which is unique to the input byte value. Each next node pointer 512 includes a next node ID 514 that directly specifies the next node / state of the input byte.

ＤＦＡモジュール１３４は、１８ビットの次のノードポインタ格納フォーマット５１６または３６ビットの次のノードポインタ格納フォーマット５１８のいずれかをサポートする。１８ビットのポインタの場合、各ノード５１０は、１８＊２５６ビットすなわち５１２バイトのＬＬＭ１１８格納を必要とする。各次のノードポインタ５１６は１７ビットの次のノードＩＤおよび１ビットのパリティビットである。パリティは、偶数である（すなわち、Ｐは１７ビットの次のノードＩＤ５１４における全ビットのＸＯＲ（排他的ＯＲ））。３６ビットポインタの場合、各ノード５１０は、３６＊２５６ビットすなわち約１キロバイトのＬＬＭ１１８格納を必要とする。したがって、複製は、格納に必要な容量要件を増大させる。各次のノードポインタ５１８は、２０ビットの次のノードＩＤ、２ビットのタイプ値、７ビットのＳＥＣＤＥＤＥＣＣコード、およびゼロに設定される未使用の７ビットである。ＤＴＥ３７０は、３６ビットポインタ内のＳＥＣＤＥＤＥＣＣコードを使用して全てのシングルビットエラーを自動的に修正し、全てのダブルビットエラーを検出する。タイプ値は次のノードのタイプ、例えば、０＝ノーマル、１＝マーク付け、２＝終端、を表示する。 The DFA module 134 supports either an 18-bit next node pointer storage format 516 or a 36-bit next node pointer storage format 518. For an 18-bit pointer, each node 510 requires 18 * 256 bits or 512 bytes of LLM 118 storage. Each next node pointer 516 is a 17-bit next node ID and a 1-bit parity bit. The parity is an even number (ie, P is an XOR (exclusive OR) of all bits in the 17-bit next node ID 514). For a 36-bit pointer, each node 510 requires 36 * 256 bits or approximately 1 kilobyte of LLM 118 storage. Thus, duplication increases the capacity requirements for storage. Each next node pointer 518 is a 20 bit next node ID, a 2 bit type value, a 7 bit SECDED ECC code, and an unused 7 bit set to zero. The DTE 370 automatically corrects all single bit errors using the SECDED ECC code in the 36 bit pointer and detects all double bit errors. The type value indicates the type of the next node, for example, 0 = normal, 1 = marked, 2 = terminal.

ＤＴＥ３７０は、以下に示す３つの特別なノードポインタ条件をサポートする。
１．ＰＥＲＲ − 次のノードポインタがエラーを含む。ＤＴＥ３７０は、欠陥のあるＬＬＭ１１８の位置を表示する結果ワードを生成する。ＤＴＥ３７０は、グラフ５００の走査を停止する。
２．ＴＥＲＭ − 次のノードは終端ノードであり、グラフの走査は停止すべきである。ＤＴＥ３７０は、終端ノードへ走査したバイト、前のノードＩＤおよび次のノードＩＤを表示する結果ワードを生成する。ＤＴＥ３７０は、グラフ５００の走査を停止する。
３．ＭＡＲＫＥＤ − この遷移は、コア１２０のソフトウェアによる後の分析用にマーク付けされる。ＤＴＥ３７０は、上記マークされたノードへ走査したバイト、前のノードＩＤおよび次のノードＩＤを表示する結果ワードを生成する。ＤＴＥ３７０は、グラフ５００の走査を続行する。 DTE 370 supports the following three special node pointer conditions:
1. PERR-the next node pointer contains an error. DTE 370 generates a result word that displays the location of the defective LLM 118. The DTE 370 stops scanning the graph 500.
2. TERM—The next node is a terminal node and the graph traversal should stop. The DTE 370 generates a result word that displays the scanned byte to the end node, the previous node ID, and the next node ID. The DTE 370 stops scanning the graph 500.
3. MARKED-This transition is marked for later analysis by the core 120 software. The DTE 370 generates a result word that displays the scanned byte to the marked node, the previous node ID, and the next node ID. DTE 370 continues scanning graph 500.

１８ビットモードの場合、ＤＴＥ３７０は、次のノードＩＤを比較することによって、特殊なＴＥＲＭおよびＭＡＲＫＥＤ条件を判断する。この場合、マーク付けされたノードに入る全ての遷移がマーク付けされる。３６ビットモードの場合、ＤＴＥ３７０は、特殊なＴＥＲＭおよびＭＡＲＫＥＤ条件を次のノードポインタ内のタイプフィールドから直接判断する。３６ビットモードでは、個々のノードそのものではなく、個々の遷移にマーク付けすることができる。 For 18-bit mode, DTE 370 determines special TERM and MARKED conditions by comparing the next node ID. In this case, all transitions that enter the marked node are marked. For 36-bit mode, DTE 370 determines special TERM and MARKED conditions directly from the type field in the next node pointer. In 36-bit mode, individual transitions can be marked rather than individual nodes themselves.

図５Ｂは、可能な１７ビットノードＩＤの全て、およびこれらが１８ビットモードにおいていかに分類されるかを明らかにしている。終端ノードＩＤ５０２は、ＬＬＭ１１８内に実際に格納されることによるバックアップはされない。しかし、ノーマルノード５０４およびマーク付けされたノード５０６は、実際のＬＬＭ１１８への格納によってバックアップされる。ＤＦＡ命令４０４は、ＩＷＯＲＤ３（図８Ａ）に格納されるTSize（図８Ａ）である終端ノードの数５０３と、同様にＩＷＯＲＤ３に格納されるMSize（図８Ａ）であるマークされたノードの数５０７とを含む。 FIG. 5B reveals all of the possible 17-bit node IDs and how they are classified in 18-bit mode. The end node ID 502 is not backed up by being actually stored in the LLM 118. However, the normal node 504 and the marked node 506 are backed up by storage in the actual LLM 118. The DFA instruction 404 includes the number 503 of terminal nodes that are TSize (FIG. 8A) stored in IWORD3 (FIG. 8A) and the number of marked nodes 507 that are MSize (FIG. 8A) stored in IWORD3. including.

ＤＴＥ３７０は、グラフ５００を走査する間、例外的な条件が発生すると結果ワードを生成する。ＭＡＲＫＥＤ、ＴＥＲＭまたはＰＥＲＲである次のノードポインタは、このような例外的条件である。例外的な条件としては、さらに、入力データの完了および結果スペースの消耗という２つがある。入力バイトに関するグラフの走査は複数の例外的条件をもたらすこともあるが、単一の入力バイトから生成できる結果ワードは多くても１つである。例えば、最後の入力バイトが入力データの完了条件に遭遇すると、結果ワードを生成する。最後の入力バイトもマーク付けされた次のノードに遭遇することがあるが、第２の結果ワードは生成されない。グラフの走査は、（優先順に）ＰＥＲＲ、ＴＥＲＭ、入力データの完了および結果スペースの消耗という例外条件が発生した時点で停止し、ＤＴＥ３７０はその最も高い優先順位条件を報告する。例えば、図２のグラフを参照すると、次のノードはノード「ｃ」に到達した時点の終端ノードであり、ＤＴＥ３７０はグラフの走査を停止する。 The DTE 370 generates a result word when an exceptional condition occurs while scanning the graph 500. The next node pointer that is MARKED, TERM or PERR is such an exceptional condition. There are two more exceptional conditions: input data completion and result space exhaustion. While scanning the graph for input bytes may result in multiple exceptional conditions, at most one result word can be generated from a single input byte. For example, when the last input byte encounters a completion condition for input data, it generates a result word. The next node marked with the last input byte may also be encountered, but no second result word is generated. The graph scan stops (in order of preference) at the occurrence of exceptional conditions such as PERR, TERM, input data completion and result space exhaustion, and DTE 370 reports its highest priority condition. For example, referring to the graph of FIG. 2, the next node is the terminal node when node “c” is reached, and DTE 370 stops scanning the graph.

各ＤＦＡ命令は、ＤＴＥ３７０によって、処理されるデータをいかにＬ２／ＤＲＡＭに格納するか（ダイレクトまたはギャザー）を指定できる。いずれの場合も、ＤＦＡモジュール１３４はＬ２／ＤＲＡＭ（１３０／１０８）からバイトを読み出す。 Each DFA instruction can specify by DTE 370 how data to be processed is stored in L2 / DRAM (direct or gathered). In either case, the DFA module 134 reads bytes from the L2 / DRAM (130/108).

図６は、ＤＴＥ３７０によって処理されるデータを取得する例示的なダイレクトモード６００を示す。ＤＦＡ命令４０４は、開始位置およびバイト数を直接指定する。対応するＤＦＡ命令４０４を処理するＤＴＥ３７０は、Ｌ２／ＤＲＡＭ（１３０／１０８）から連続するバイトを読み出し、これらを処理する。 FIG. 6 shows an exemplary direct mode 600 for acquiring data processed by the DTE 370. The DFA instruction 404 directly specifies the start position and the number of bytes. The DTE 370 that processes the corresponding DFA instruction 404 reads consecutive bytes from the L2 / DRAM (130/108) and processes them.

図７Ａは、ＤＴＥ３７０によって処理されるデータを取得する例示的なギャザーモード７００を示す。ＤＦＡ命令４０４は、ＤＦＡギャザーポインタ７１０のリストの開始位置およびサイズを直接指定する。ＤＦＡギャザーポインタ７１０リストの各エントリは、ＤＴＥ３７０が処理する開始位置およびバイト数を指定する。ＤＴＥ３７０の全体入力バイトストリームは、ＤＦＡギャザーポインタ７１０リストの各エントリによって指定されるバイトの連結である。 FIG. 7A shows an exemplary gather mode 700 for acquiring data processed by the DTE 370. The DFA instruction 404 directly specifies the starting position and size of the list of DFA gather pointers 710. Each entry in the DFA gather pointer 710 list specifies the starting position and the number of bytes that the DTE 370 processes. The entire input byte stream of DTE 370 is a concatenation of bytes specified by each entry in the DFA gather pointer 710 list.

図７Ｂは、６４ビットＤＦＡギャザーポインタ７１０のフォーマットを示す。ＤＦＡギャザーポインタ７１０は、長さ７１２（バイト数）およびアドレスフィールド７１４（Ｌ２／ＤＲＡＭアドレス）を含む。ＤＦＡギャザーポインタ７１０は６４ビットの境界上に当然に揃えられるが、これらポインタが指すＬ２／ＤＲＡＭ内のバイトは任意の位置であってもよく、当然に揃えられなくてもよい。ギャザーモード７００では、合計バイト数は全てのＤＦＡギャザーポインタ７１０における長さフィールドの総和である。 FIG. 7B shows the format of a 64-bit DFA gather pointer 710. The DFA gather pointer 710 includes a length 712 (number of bytes) and an address field 714 (L2 / DRAM address). The DFA gather pointer 710 is naturally aligned on a 64-bit boundary, but the bytes in the L2 / DRAM pointed to by these pointers may or may not be aligned. In the gather mode 700, the total number of bytes is the sum of the length fields in all DFA gather pointers 710.

図４Ａを再度参照して、各ＤＦＡ命令４０４は、ＤＦＡモジュール１３４によって必要とされる情報を提供して、（i）ＤＴＥ３７０を開始し、（ii）入力データを読み出し、（iii）ＬＬＭ１１８内のグラフ２００を走査し、（iv）結果を書き込む。ＤＦＡ命令４０４は、図８Ａに示す例示的なＤＦＡ命令フォーマットのような複数の命令ワードを含んでもよい。各ＤＦＡ命令４０４は、４つの独立したワード、４５５’、４５５’’、４５５’’’、４５５’’’’（総称して４５５）を含む。これらのワードはそれぞれ６４ビットを含み、レベル２キャッシュメモリ１３０またはＤＲＡＭ１０８内の合計３２バイトで表現される。好ましくは、各ＤＦＡ命令４０４は３２バイトの境界上に当然に揃えられる。ＤＦＡ命令４０４は、この命令がスケジューリングされている個々のＤＴＥ３７０によって処理される。ＤＦＡ命令４０４は、入力バイトの位置および結果の位置の両方を特定するフィールドを含む。 Referring back to FIG. 4A, each DFA instruction 404 provides the information needed by the DFA module 134 to (i) start the DTE 370, (ii) read the input data, and (iii) in the LLM 118 The graph 200 is scanned and (iv) the result is written. The DFA instruction 404 may include multiple instruction words, such as the exemplary DFA instruction format shown in FIG. 8A. Each DFA instruction 404 includes four independent words, 455 ', 455 ", 455" ", 455" "(collectively 455). Each of these words contains 64 bits and is represented by a total of 32 bytes in the level 2 cache memory 130 or DRAM 108. Preferably, each DFA instruction 404 is naturally aligned on a 32-byte boundary. The DFA instruction 404 is processed by the individual DTE 370 for which this instruction is scheduled. The DFA instruction 404 includes a field that identifies both the position of the input byte and the position of the result.

動作中、ＤＦＡモジュール１３４は、ＤＦＡ命令待ち行列３８２が有効なＤＦＡ命令４０４を有していれば、レベル２キャッシュメモリ１３０またはＤＲＡＭ１０８からＤＦＡ命令４０４および入力データを読み出し、結果を生成すると、その結果を書き込む（例えばバイト毎に）。また、ＤＦＡモジュール１３４は、終了後にＰＯＷ１４８（図１Ｂ）によってスケジューリングされるワーク待ち行列エントリを随時に送出でき、したがって、ＤＦＡ命令４０４はワーク待ち行列ポインタのためのフィールドを含むこともできる。 In operation, the DFA module 134 reads the DFA instruction 404 and input data from the level 2 cache memory 130 or the DRAM 108 and generates a result if the DFA instruction queue 382 has a valid DFA instruction 404 and generates a result. Is written (for example, every byte). The DFA module 134 can also send work queue entries that are scheduled by the POW 148 (FIG. 1B) after termination at any time, so the DFA instruction 404 can also include a field for the work queue pointer.

さらに詳細には、第１のＤＦＡ命令ワード４５５’は、その第１ノードによって使用される特定のＤＦＡグラフを指定する開始ノードＩＤ４６０を含む。第１ワード４５５’はまた、ＬＬＭ１１８内に格納される特定されたグラフの複製数に対応する複製値を格納する複製フィールド４６２などの追加情報も提供する。また、使用されるアドレス指定のタイプを示すタイプ値４６４（１８または３６ビット）が提供されてもよい。例示的な６４ビットワードは、１つまたは複数の予約フィールドも含む。 More specifically, the first DFA instruction word 455 'includes a start node ID 460 that specifies the particular DFA graph used by that first node. The first word 455 ′ also provides additional information such as a duplicate field 462 that stores a duplicate value corresponding to the number of duplicates of the identified graph stored in the LLM 118. A type value 464 (18 or 36 bits) indicating the type of addressing used may also be provided. An exemplary 64-bit word also includes one or more reserved fields.

第２のＤＦＡ命令ワード４５５’’は、ＤＦＡモジュール１３４によって処理されるバイト数を特定する長さフィールド４７０と、処理されるパケットデータのレベル２キャッシュメモリ１３０またはＤＲＡＭ１０８における位置を特定するアドレスフィールド４７４とを含む。 The second DFA instruction word 455 ″ includes a length field 470 that specifies the number of bytes processed by the DFA module 134 and an address field 474 that specifies the location of the processed packet data in the level 2 cache memory 130 or DRAM 108. Including.

第３のＤＦＡ命令ワード４５５’’’は、任意の結果が書き込まれるアドレス（例えば、レベル２キャッシュメモリ１３０またはＤＲＡＭ１０８内のアドレス）を特定する結果アドレスフィールド４８２と、許容される最大結果数を示す値を格納する最大結果フィールド４８０とを含む。さらに、ＤＦＡモジュール１３４は、終了後に随意にワーク待ち行列エントリを送出できるため、ＤＦＡ命令４０４は１つまたは複数のワーク待ち行列ポインタに対するワーク待ち行列処理（ＷＱＰ）フィールド４９０を含む。 The third DFA instruction word 455 ′ ″ indicates a result address field 482 that identifies the address (eg, an address in the level 2 cache memory 130 or DRAM 108) where any result is written, and the maximum number of results allowed. And a maximum result field 480 for storing values. Further, since the DFA module 134 can optionally send a work queue entry after termination, the DFA instruction 404 includes a work queue processing (WQP) field 490 for one or more work queue pointers.

図８Ｂは、ＤＦＡ命令４０４の結果フォーマット８００を示す。ＤＦＡ結果８００は、Ｌ２／ＤＲＡＭ（１３０／１０８）内に２つ以上の６４ビットのワードを有する。各ワードは、Ｌ２／ＤＲＡＭ（１３０／１０８）内に当然に揃えられる。ＤＦＡモジュール１３４は、ＤＦＡ命令４０４の処理中および処理後に、これらのワードをＬ２／ＤＲＡＭ（１３０／１０８）に書き込む。この構造体は可変長であって、可変数のマーク付けされたノードにヒットするＤＦＡ命令４０４を収容するが、結果の長さはＤＦＡ命令フィールドの最大結果の上限によって制限される場合がある。 FIG. 8B shows the result format 800 of the DFA instruction 404. The DFA result 800 has two or more 64-bit words in L2 / DRAM (130/108). Each word is naturally aligned in the L2 / DRAM (130/108). The DFA module 134 writes these words to the L2 / DRAM (130/108) during and after the processing of the DFA instruction 404. This structure is variable length and contains a DFA instruction 404 that hits a variable number of marked nodes, but the length of the result may be limited by the upper limit of the maximum result in the DFA instruction field.

先に述べたとおり、３６ビットのポインタ５１８（図５Ａ）を備えたタイプフィールドを使用することによって、ノードタイプをＤＦＡグラフのノードのうちの任意の１つまたはそれ以上に関連づけることが可能である。ＤＴＥ３７０は、グラフを走査する間に、例外的条件が発生すると結果ワードを生成する。少なくとも１つの例外的条件は、終端ノードである。ＤＴＥ３７０が終端ノードに達すると、このノードはＤＦＡグラフが終わりに到達したことを表し、ＤＴＥ３７０による走査は停止する。例外的条件の別の例は、マーク付けされたノードである。終端ノードとは対照的に、ＤＴＥ３７０がマーク付けされたノードに達しても、グラフの走査は必ずしも停止しない。しかし、出力ワードに、後の分析のために特定の上記マーク付けされたノードを特定する結果が書き込まれる。したがって、マーク付けされたノードを使用して、グラフ内の対応するノードが走査される時点を特定できる。 As mentioned earlier, a node type can be associated with any one or more of the nodes of the DFA graph by using a type field with a 36-bit pointer 518 (FIG. 5A). . DTE 370 generates a result word when an exceptional condition occurs while scanning the graph. At least one exceptional condition is a terminal node. When DTE 370 reaches the end node, this node indicates that the DFA graph has reached the end, and scanning by DTE 370 stops. Another example of an exceptional condition is a marked node. In contrast to the terminal node, the scanning of the graph does not necessarily stop when the DTE 370 reaches the marked node. However, the output word is written with results identifying the particular marked node for later analysis. Thus, the marked nodes can be used to identify when the corresponding node in the graph is scanned.

ＤＦＡ結果８００のＷＯＲＤ０は、ＤＦＡモジュール１３４によって２回以上書き込まれてもよい。ＷＯＲＤ０への最後の書込みのみが、有効なＤＦＡ結果８００を含む。ＤＦＡモジュール１３４は、ＷＯＲＤ０を複数回書き込むことができるが、ビット１６を設定できるのは最後の書込みのみである。すなわち、ビット１６は、ＤＦＡモジュール１３４がＤＦＡ命令４０４を完了するまでＤＦＡモジュール１３４によって設定されない。ＤＦＡモジュール１３４にＤＦＡ命令４０４を渡する前にＷＯＲＤ０の結果ビット１６にゼロを書き込むことによって、ソフトウェアは、ＷＯＲＤ０のビット１６をポーリングして、ＤＦＡモジュール１３４がＤＦＡ命令４０４を完了した時点を判断することができる。ＤＦＡ結果のＷＯＲＤ０が設定されているときは、全体結果が存在する。 WORD 0 of the DFA result 800 may be written more than once by the DFA module 134. Only the last write to WORD0 contains a valid DFA result 800. The DFA module 134 can write WORD0 multiple times, but bit 16 can only be set for the last write. That is, bit 16 is not set by DFA module 134 until DFA module 134 completes DFA instruction 404. By writing zero to WORD0 result bit 16 before passing DFA instruction 404 to DFA module 134, the software polls WORD0 bit 16 to determine when DFA module 134 has completed DFA instruction 404. be able to. When WORD0 of the DFA result is set, the entire result exists.

図２Ｂに示す別の例では、２つの異なる文字列「ａｂｃｄ」および「ａｂｃｅ」の１つまたは複数の出現を発見するために、図２Ａのグラフが拡張されたものである。したがって、図２Ａのグラフに２つの追加ノード、すなわち、ノード４および５が追加されており（例えば、「ｄ」に対するノード４および「ｅ」に対するノード５）、１つのノードが各字列の第４文字に対するものである。いずれの文字列も最初の３文字が同一であるため、ノード４および５は、図のようにノード３に接続されている。好ましくは、いずれの文字列の場合もその出現は全て、入力文字列を通る単一の「パス」上で識別される。 In another example shown in FIG. 2B, the graph of FIG. 2A is expanded to find one or more occurrences of two different strings “abcd” and “abce”. Thus, two additional nodes have been added to the graph of FIG. 2A, ie, nodes 4 and 5 (eg, node 4 for “d” and node 5 for “e”), one node for each string For 4 characters. Since each character string has the same first three characters, the nodes 4 and 5 are connected to the node 3 as shown in the figure. Preferably, all occurrences of any string are identified on a single “path” through the input string.

文字列「ｘｗａｂｃｄ４５４ａｂｃｅａｂｃｄｓｆｋ」などの例示的な入力文字列は、ＤＦＡを通過し、結果として３つの「マーク付けされた」遷移が生成される。マーク付けされた遷移は、入力文字列内に配置される文字列セグメントの終わりで発生する（例えば、「ｄ」または「ｅ」が存在している各位置で１つ）。したがって、３つのマーク付けされた遷移は、３つの文字列が発見されたことを表す。最初と最後のマークはノード３からノード４への遷移を示し、入力文字列内の文字列「ａｂｃｄ」の存在および位置を表す（すなわち、ＤＴＥバイト＝５、先行＝３、次＝４およびＤＴＥバイト＝１７、先行＝３、次＝４）。中央のマーク付けされたノードは、ノード３からノード５への遷移を示し、入力文字列内の文字列「ａｂｃｅ」の存在を表す（すなわち、ＤＴＥバイト＝１３、先行＝３、次＝５）。ノード４および５は１８ビットポインタを使用してマーク付けされ、ノード３からノード４および５までの弧は３６ビットポインタを使用してマーク付けされる。したがって、ＤＦＡマーキング技術をＤＦＡスレッドエンジンと組み合せて使用することによって、同一入力文字列内の複数の異なる文字列の存在および位置を、上記入力文字列を通る１回のパスで発見することができる。 An exemplary input string, such as the string “xwabcd454abceabcdfsfk”, passes through the DFA, resulting in three “marked” transitions. The marked transition occurs at the end of the string segment placed in the input string (eg, one at each position where “d” or “e” is present). Thus, the three marked transitions represent that three strings have been found. The first and last marks indicate the transition from node 3 to node 4 and represent the presence and position of the string “abcd” in the input string (ie, DTE byte = 5, leading = 3, next = 4 and DTE Byte = 17, predecessor = 3, next = 4). The center marked node indicates the transition from node 3 to node 5 and represents the presence of the string “abce” in the input string (ie DTE byte = 13, predecessor = 3, next = 5). . Nodes 4 and 5 are marked using an 18-bit pointer, and the arc from node 3 to nodes 4 and 5 is marked using a 36-bit pointer. Thus, by using DFA marking technology in combination with the DFA thread engine, the presence and position of multiple different strings within the same input string can be found in a single pass through the input string. .

本出願は、2004年9月10日に出願された米国仮特許出願第60/609,211号、2004年12月28日に出願された米国特許出願第11/024,002号、2005年4月8日に出願された「Deterministic Finite Automata (DFA) Instruction」と題する米国仮特許出願第60/669,603号および2005年4月8日に出願された「Selective Replication of Data Structures」と題する米国仮特許出願第60/669,655号に関連するものである。上述の出願の全内容は参照により本明細書に引用したものとする。 This application is based on US Provisional Patent Application No. 60 / 609,211 filed on September 10, 2004, US Patent Application No. 11 / 024,002 filed on December 28, 2004, April 8, 2005. U.S. Provisional Patent Application No. 60 / 669,603 entitled `` Deterministic Finite Automata (DFA) Instruction '' and U.S. Provisional Patent Application No. 60 / entitled `` Selective Replication of Data Structures '' filed April 8, 2005 Related to 669,655. The entire contents of the above-mentioned applications are hereby incorporated by reference.

以上、本発明をその好ましい実施形態に関連して詳細に示しかつ説明してきたが、当業者には、添付の特許請求の範囲に包含される本発明の範囲を逸脱することなく形態および細部に様々な変更を行ない得ることが理解されるであろう。 Although the invention has been shown and described in detail in connection with preferred embodiments thereof, those skilled in the art will recognize in form and detail without departing from the scope of the invention as encompassed by the appended claims. It will be understood that various changes may be made.

本発明の原理によるネットワークサービスプロセッサを含むネットワークサービス処理システムを示すブロック図である。1 is a block diagram illustrating a network service processing system including a network service processor according to the principles of the present invention. 図１Ａに示すネットワークサービスプロセッサを示すブロック図である。FIG. 1B is a block diagram illustrating the network service processor shown in FIG. 1A. 例示的なＤＦＡグラフを示す概略図である。FIG. 3 is a schematic diagram illustrating an exemplary DFA graph. 例示的なＤＦＡグラフを示す概略図である。FIG. 3 is a schematic diagram illustrating an exemplary DFA graph. 本発明の原理による縮小命令セットコンピュータ（ＲＩＳＣ）プロセッサを示すブロック図である。1 is a block diagram illustrating a reduced instruction set computer (RISC) processor in accordance with the principles of the present invention. FIG. 図３ＡのＤＦＡモジュールを示すブロック図である。3B is a block diagram illustrating the DFA module of FIG. 3A. FIG. ＤＦＡ命令待ち行列の構造体を示す図である。FIG. 6 illustrates a structure of a DFA instruction queue. 次のチャンクバッファポインタの命令フォーマットを示す図である。It is a figure which shows the command format of the next chunk buffer pointer. 典型的なＤＦＡグラフの別の実施形態を示す図である。FIG. 6 illustrates another embodiment of a typical DFA graph. 図５ＡのＤＦＡグラフの異なる可能なノードＩＤを示す図である。FIG. 5B shows different possible node IDs of the DFA graph of FIG. 5A. ＤＴＥによって処理されるデータを構成するための直接モードの一例を示す図であえる。FIG. 6 is a diagram illustrating an example of a direct mode for configuring data processed by a DTE. ＤＴＥによって処理されるデータを構成するためのギャザーモードの一例を示す図である。It is a figure which shows an example of the gather mode for comprising the data processed by DTE. ＤＦＡギャザーポインタの命令フォーマットを示す図である。It is a figure which shows the command format of a DFA gather pointer. ＤＦＡ命令フォーマットを示す図である。It is a figure which shows a DFA instruction format. ＤＦＡ結果フォーマットを示す図である。It is a figure which shows a DFA result format.

Explanation of symbols

１１０プロセッサ
１２０プロセッサコア
４００ＤＦＡモジュール 110 processor 120 processor core 400 DFA module

Claims

At least one processor core;
A deterministic finite automaton (DFA) module operating asynchronously with the at least one processor core, wherein a plurality of at least one DFA graph stored in non-cache memory in response to an instruction from the at least one processor core And a DFA module that scans the nodes using packet data stored in cache coherent memory.

The DFA module according to claim 1, wherein
A non-cache memory controller adapted to access a memory storing the DFA graph;
At least one DFA thread engine in communication with the non-cache memory controller;
A network processor having instruction input logic adapted to schedule instructions from the at least one processor core to the at least one DFA thread engine.

The network processor according to claim 2, further comprising an instruction queue, wherein the at least one processor core sends a DFA instruction to the DFA module to the instruction queue.

4. The network processor according to claim 3, wherein the DFA module maintains a pointer to the instruction queue.

The network processor according to claim 3, wherein the DFA instruction directs packet data stored in cache coherent memory for use and directs at least one DFA graph stored in non-cache memory for scanning.

4. The network processor according to claim 3, wherein the DFA module schedules DFA instructions to the at least one DFA thread engine.

7. The at least one DFA thread engine according to claim 6,
Fetching packet data stored in the cache coherent memory;
Issuing a non-cache memory load instruction for each byte of packet data received from the cache coherent memory, and scanning the next state of the DFA graph stored in the non-cache memory;
A network processor that writes intermediate and final results to the cache coherent memory.

8. The network processor according to claim 7, further comprising a result word into which the intermediate and final results are written.

9. The network processor of claim 8, wherein the result word includes an instruction completion field, and this field indicates completion of a DFA instruction if this field is set.

The DFA module according to claim 1, wherein
A plurality of DFA thread engines associated with the at least one processor core in a common environment setting, wherein each DFA thread engine scans at least one DFA graph stored in the non-cache memory; Network processor that is adapted.

The network processor according to claim 1, further comprising a node type identifier that identifies a node type of the DFA graph.

12. The network processor of claim 11, wherein the node type identifier is a marked node, and scanning of the marked node does not interfere with graph scanning.

A method of scanning a DFA graph using input packet data,
Storing at least one DFA graph in a non-cache coherent memory;
Storing a DFA instruction in a cache coherent memory, wherein the DFA instruction directs the packet data stored in the cache coherent memory for use, and at least one DFA graph stored in the non-cache memory. Directing for scanning; and
Scanning the DFA graph with the stored packet data and writing intermediate and final results to the cache coherent memory.

14. The DFA graph scanning method according to claim 13, wherein at least one processor core sends a DFA instruction to the DFA module.

15. The DFA graph scanning method according to claim 14, wherein a DFA module schedules a DFA instruction to the at least one DFA thread engine.

16. The at least one DFA thread engine according to claim 15, wherein
Fetching packet data stored in the cache coherent memory;
Issuing one non-cache memory load instruction for each byte of packet data fetched from the cache coherent memory;
According to the fetched packet data byte, the next state of the DFA graph stored in the non-cache memory is scanned,
A DFA graph scanning method for writing intermediate and final results to the cache coherent memory.

17. The DFA graph scanning method of claim 16, further comprising providing a node type identifier that identifies an individual node type for each node of the DFA graph, wherein the intermediate and final results are determined from the node type identifier.

Means for storing at least one DFA graph in non-cache coherent memory;
Means for storing a DFA instruction in a cache coherent memory, wherein the DFA instruction directs the packet data stored in the cache coherent memory for use and is stored in the non-cache memory; Means for directing for scanning;
A network processor comprising means for scanning the DFA graph using the stored packet data and writing intermediate and final results to the cache coherent memory.