JP3788822B2

JP3788822B2 - Computer system and failure recovery method in the system

Info

Publication number: JP3788822B2
Application number: JP15124996A
Authority: JP
Inventors: 美生増渕
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1996-06-12
Filing date: 1996-06-12
Publication date: 2006-06-21
Anticipated expiration: 2016-06-12
Also published as: JPH09330303A

Description

【０００１】
【発明の属する技術分野】
この発明はコンピュータシステムおよびその障害回復方法に関し、特にメインメモリの固定故障に起因する障害を回復できるように改良されたコンピュータシステムおよびその障害回復方法に関する。
【０００２】
【従来の技術】
一般に、コンピュータシステムにおいては、メモリ故障に対する信頼性を高めるために、パリティ付きのメモリが採用されている。パリティ付きメモリを有するメモリサブシステムでは、データ読み出し時にそのデータのチェックサムが算出され、そのチェックサムとパリティビットとの比較によってメモリエラーの検出が行われる。これにより、誤ったメモリデータの使用を未然に防止することができる。
【０００３】
ところが、パリティだけでは、メモリデータのどのビットに誤りがあるかを特定することはできないので、エラー訂正を行うことはできない。
これに対し、ＳＥＣ−ＤＥＤ符号などの冗長コードを用いたメモリサブシステムを用いた場合には、１ビットの誤り訂正と、２ビットの誤り検出が可能となる。すなわち、恒久的にデータが１ビット誤っても、これを自動的に訂正して処理を継続できる。したがって、高信頼性が必要とされるコンピュータシステムでは、パリティ付きメモリよりも、ＳＥＣ−ＤＥＤ符号などの冗長コードを用いたメモリサブシステムを採用することが望ましい。
【０００４】
しかし、パリティ付きメモリを採用した既存のコンピュータシステムにＳＥＣ−ＤＥＤ符号などの冗長コードを用いたメモリサブシステムを導入する場合には、既存のパリティ付きメモリをそのまま使用することはできないので、大容量のメインメモリをＳＥＣ−ＤＥＤ符号に対応するように新たに再構築することが必要とされる。したがって、その導入のためには多くの費用が必要となる。
【０００５】
一方、フォールトトレラントコンピュータシステムでは、メモリの２重化によってすべてのメモリ故障をマスクする構成が採用されている。このメモリ２重化構成によれば、常に同一のデータが２つのメモリに保持されているため、データ誤りが検出された場合には、もう一方のメモリのデータを用いることにより処理を継続することができる。
【０００６】
しかし、メモリを二重化する必要があるためにハード量が非常に多くなる、誤りを検出した場合のアクセスメモリ切り替えなどに特殊な構造が必要となる、などの欠点もある。
【０００７】
そこで、最近では、メモリを２重化することなく、一般的な故障からの回復を付加ハードで実現するための方式として、メインメモリの更新履歴情報を格納するためのログメモリを使用したチェックポイントリスタート方式が提案されている。このチェックポイントリスタート方式では、プロセスの再実行に必要な情報がチェックポイント毎にメインメモリに保存され、またあるチェックポイントから次のチェックポイントまでの期間において、プロセス実行に伴ってメインメモリが更新される度にその更新前データなどが前述の更新履歴情報としてログメモリに採取される。コンピュータシステムに障害が発生したとき、ログメモリの内容を使用することによってメインメモリを障害発生前のチェックポイントの時点に復元することができる。従って、ログメモリを使用したチェックポイントリスタート方式を採用することにより、２重化メモリを用いることなく、少ないハードウェアでメモリの内容を復元することができる。
【０００８】
ところが、この方式では、メモリの内容が恒久的に書き変わってしまったような障害が発生した場合、たとえその検出ができても回復できない場合が存在する。すなわち、直前のチェックポイント以前にメモリデータの値が書き変わっている場合には、故障が検出されたときに直前のチェックポイントに戻って処理を再開しても、再び誤ったメモリデータが読まれることになるため、回復することができなくなる。
【０００９】
【発明が解決しようとする課題】
上述したように、パリティ付きメモリを採用した既存のコンピュータシステムにＳＥＣ−ＤＥＤ符号などの冗長コードを用いたメモリサブシステムを導入する場合には、既存のパリティ付きメモリをそのまま使用することはできないので、大容量のメインメモリをＳＥＣ−ＤＥＤ符号に対応するように新たに再構築することが必要とされ、その導入のためには多くの費用が必要となる欠点がある。
【００１０】
また、メモリを２重化することなく、一般的な故障からの回復を付加ハードで実現するための方式として、メインメモリの更新履歴情報を格納するためのログメモリを使用したチェックポイントリスタート方式があるが、この方式では、恒久的にデータが書き変わるようなメモリ故障に対応することができない場合があるという問題がある。
【００１１】
この発明はこのような点に鑑みなされたもので、既存のパリティ付きメモリなどの資源をそのまま使用した状態で誤り訂正機能を持つメモリサブシステムを付加ハードによって構築できるようにし、メモリ故障に対する信頼性の高いコンピュータシステムを提供することを目的とする。
【００１２】
また、この発明は、ログメモリを使用したチェックポイントリスタート方式では回復できないようなメモリ故障が発生した際にも処理を継続できるようにし、メモリを２重化することなく、少ないハードウェアで十分な耐故障性能を実現できるコンピュータシステムおよび障害回復方法を提供することを目的とする。
【００１３】
【課題を解決するための手段】
この発明は、１以上のＣＰＵと、前記ＣＰＵとバスを介して接続され、パリティが付加されたデータを記憶するパリティ付きメインメモリと有するコンピュータシステムにおいて、キャッシュメモリと、前記バスと前記メインメモリとの間に設けられ、前記メインメモリを制御するメモリコントローラと、前記メインメモリに対するリードライトアクセスの単位となる番地それぞれに対応して設けられた複数の記憶領域を有し、各記憶領域内にその記憶領域に対応する前記メインメモリの番地に格納されているデータの一部に生じた誤りを訂正可能な冗長コードを保持する冗長コードメモリと、前記バスと前記冗長コードメモリとの間に設けられ、前記冗長コードメモリを制御する制御装置であって、前記バス上に発行されるバストランザクションを監視し、前記キュッシュメモリから前記メインメモリへのライトバックが実行されるときに、前記バス上のデータの値からそのデータに対応する冗長コードを生成し、その冗長コードを前記単位データの書き込み番地に対応する前記冗長コードメモリの記憶領域に格納する制御装置と、前記メインメモリからのデータ読み出し時にその読み出しデータのデータエラーが前記メモリコントローラによるパリティのチェックによって検出された場合、前記読み出しデータと当該読み出しデータに対応する前記冗長コードメモリに格納された冗長コードとから正しいデータを再構成する誤り訂正手段とを具備することを特徴とする。
【００１４】
このコンピュータシステムにおいては、冗長コードメモリとその制御のための制御装置とが設けられており、バス上に発行されるバストランザクションの監視によってキュッシュメモリからメインメモリへのライトバックが実行されることが検出されると、バス上のデータの値からそのデータに対応する冗長コードが制御装置内で自動的に生成され、それが冗長コードメモリの該当する記憶領域に格納される。またメインメモリからのデータ読み出し時にその読み出しデータのデータエラーがメモリコントローラによるパリティのチェックによって検出された場合には、読み出しデータと当該読み出しデータに対応する冗長コードとから正しいデータが再構成される。このように、メインメモリのデータに誤りが検出されたとき、そのデータと、これに対応する冗長コードとから正しいデータを再構成できるため、前述の冗長コードメモリと制御装置とを付加ハードとして設けるだけで誤り訂正機能を持つメモリサブシステムを構築できるようになり、メモリ故障に対する信頼性の高いコンピュータシステムを実現できる。
【００１５】
また、前記制御手段に接続され、前記メインメモリの更新履歴情報を格納するログメモリをさらに具備し、前記制御手段は、前記ＣＰＵによる前記メインメモリに対するデータ書き込みが実行される前に、データ書き込みが実行される番地に対応する前記メインメモリの更新前データおよびその更新前データに対応する冗長コードを前記メインメモリおよび前記冗長コードメモリからそれぞれ読み出し、それら更新前データおよび冗長コードを前記更新履歴情報として前記ログメモリに格納し、前記メインメモリの内容を障害発生前の状態に復元することが必要な障害が発生したとき、前記ログメモリに格納されている各更新履歴情報を構成する更新前データおよび冗長コードを前記メインメモリおよび前記冗長コードメモリにそれぞれ書き戻して、前記メインメモリを障害発生前の状態に復元すると共に、前記冗長コードメモリの内容を前記復元されたメインメモリの内容に対応する状態に戻すように構成することが好ましい。
【００１６】
この構成により、ログメモリの内容を使用してメインメモリの内容を故障発生前の状態に復元できると共に、メモリ故障発生時にも、冗長コードを使用することにより正しいデータを再構成することができる。この場合、冗長コードを使用して正しいデータを再構成した後に障害発生前のチェックポイントから処理を再開することで、チェックポイントリスタート方式だけでは回復できないようなメモリ故障が発生した場合であっても、処理を継続できるようになる。
【００１７】
また、更新前データだけでなく、その更新前データに対応する冗長コードも一緒にログメモリに格納することにより、メインメモリの内容を障害発生前の状態に復元することが必要な障害が発生したときは、ログメモリに格納されている各更新履歴情報を構成する更新前データおよび冗長コードをメインメモリおよび冗長コードメモリにそれぞれ書き戻すことで、メインメモリを障害発生前の状態に復元でき、且つ冗長コードメモリの内容についても復元されたメインメモリの内容に対応する状態に戻すことが可能となる。
【００１８】
また、ログメモリに対する更新前データを含む更新履歴情報の格納はメインメモリに対するデータ書き込みが実行される前に行う必要があるが、キャッシュメモリを有するシステムにおいては、ＣＰＵによるキャッシュメモリに対するデータ書き込みが実行されたとき、そのデータ書き込みが実行された番地に対応するメインメモリの更新前データと、それに対応する冗長コードをメインメモリおよび冗長コードメモリからそれぞれ読み出して、それら更新前データおよび冗長コードを更新履歴情報としてログメモリに格納することによって、メインメモリに対するデータ書き込みが実行される前に更新履歴情報の格納を容易に行うことができる。
【００１９】
また、前述の冗長コードメモリの代わりに、メインメモリの連続アクセスされる複数のデータ列から構成される単位データブロックそれぞれに対応して設けられた複数の記憶領域を有し、各記憶領域内に、その記憶領域に対応する単位データブロックに属する複数のデータ列それぞれの同一ビット位置についての垂直パリティデータを保持する垂直パリティメモリを使用することにより、パリティ処理という簡単な処理により、メモリ故障発生時にも正しいブロックデータを再現することが可能となる。
【００２０】
さらに、垂直パリティメモリに代えて、メインメモリ上の複数のデータ列を各々が有する複数の単位データブロックを１組とする複数のデータブロックグループそれぞれに対応して設けられた複数の記憶領域を有し、各記憶領域内に、その記憶領域に対応するデータブロックグループに属する複数の単位データブロックそれぞれの同一ビット位置についての垂直パリティデータから構成されるブロックパリティデータを保持するブロックパリティメモリを採用することにより、メモリモジュール単位などの広範囲にわたるメモリ故障発生時にも正しいブロックデータを再現することが可能となる。
【００２１】
【発明の実施の形態】
以下、図面を参照してこの発明の実施形態を説明する。
図１には、この発明の第１実施形態に係るコンピュータシステムの構成が概念的に示されている。このコンピュータシステムは、チェックポイント毎に障害回復に必要な情報をメインメモリに格納し、障害発生時にはログメモリに格納されているメインメモリの更新履歴情報を使用してメインメモリの内容を障害発生前のチェックポイントの時点に復元するという障害回復方式を採用したマルチプロセッサシステムであり、図示のように、プロセッサバス１０、ＣＰＵ１１−１〜１１−ｎ、キャッシュメモリ１２−１〜１２−ｎ、メインメモリ１４、冗長コードメモリ（ＣＭ）１６、およびビフォアイメージバッファ（ＢＩＢ）１７を備えている。
【００２２】
キャッシュメモリ１２−１〜１２−ｎは、メインメモリ１４を共有するＣＰＵ１１−１〜１１−ｎそれぞれの１次キャッシュまたは２次キャッシュとして使用されるものであり、チェックポイント取得時には、キャッシュメモリ１２−１〜１２−ｎの各々について、メインメモリ１４に未反映のデータがメインメモリ１４に書き込まれる。
【００２３】
メインメモリ１４は、パリティ付きメモリなどのようにエラー検出機能を有するメモリであり、ＣＰＵによる１回のメモリアクセスでリードライトされるデータ単位であるワード単位でそのワードのデータ列に対してパリティビットが付加される。
【００２４】
冗長コードメモリ（ＣＭ）１６は、エラー検出機能を有するメインメモリ１４に対してエラー訂正機能を付加するために設けられたものであり、メインメモリ１４のワード数分のエントリを有している。各エントリには、メインメモリ１４の対応するワードの誤り訂正冗長コードが格納される。例えば、メインメモリ１４のワードＮについての誤り訂正冗長コードは、冗長コードメモリ１６の第Ｎエントリに格納される。
【００２５】
冗長コードメモリ（ＣＭ）１６に対する誤り訂正冗長コードの書き込みは、バス１０上にメインメモリ１４にデータを書き込むためのバストランザクションが発行されたとき、そのバストランザクションに応答して実行される。この場合、バス１０上に出力されるデータから誤り訂正冗長コードが生成され、またバス１０上に出力されるメモリアドレスから誤り訂正冗長コードを書き込むべき冗長コードメモリ（ＣＭ）１６のエントリ位置が決定される。
【００２６】
ビフォアイメージバッファ（ＢＩＢ）１７は、あるチェックポイントから次のチェックポイントまでの期間におけるメインメモリ１４の更新履歴情報を保持するためのログメモリとして使用されるものであり、メインメモリ１４に対するデータ書き込みが行われる度、そのデータ書き込みに先立って、メインメモリ１４のデータ書き込み番地を示すアドレスと、更新前データと、その更新前データに対応する冗長コードメモリ１６の誤り訂正冗長コードとが、更新履歴情報としてビフォアイメージバッファ（ＢＩＢ）１７にスタック形式で蓄積される。誤り訂正冗長コードを、アドレスおよび更新前データと一緒に格納するのは、ビフォアイメージバッファ（ＢＩＢ）１７の更新履歴情報を使用してメインメモリ１４の内容を復元するときに、それに合わせて冗長コードメモリ（ＣＭ）１６の内容も一緒に復元できるようにするためである。
【００２７】
メインメモリ１４がパリティによる１ビット誤り検出能力しか持たない場合には、冗長コードメモリ１６に格納する誤り訂正冗長コードとしては１ビット誤り訂正符号を用いることができる。
【００２８】
次に、図２乃至図４を参照して、図１のシステムにおけるデータの流れを具体的に説明する。
まず、図２を参照して、冗長コードメモリ（ＣＭ）１６に対する誤り訂正冗長コードの書き込み動作について具体的に説明する。
【００２９】
図２に示されているように、バス１０とメインメモリ１４との間にはメインメモリコントローラ（ＭＭコントローラ）１３が設けられており、メインメモリ１４のリードライト制御はそのメインメモリコントローラ（ＭＭコントローラ）１３によって実行される。また、冗長コードメモリ（ＣＭ）１６とビフォアイメージバッファ（ＢＩＢ）１７それぞれとバス１０との間にはＢＩＢ／ＣＭコントローラ１５が共通に設けられており、それら冗長コードメモリ（ＣＭ）１６とビフォアイメージバッファ（ＢＩＢ）１７それぞれのリードライト制御はそのＢＩＢ／ＣＭコントローラ１５によって実行される。
【００３０】
以下、ＣＰＵ１１−１からのデータをメインメモリ１４のアドレスＮに書き込む場合に行われる動作について説明する。
バス１０上のトランザクション、つまりバス１０上の各種コマンドやアドレスおよびデータはＢＩＢ／ＣＭコントローラ１５によって監視されており、ＣＰＵ１１−１からメインメモリ１４にデータを書き込むためのトランザクションがバス１０上に発行されると、その時のメモリアドレス（Ｎ）とデータ（Ｄｏｌｄ１）がＢＩＢ／ＣＭコントローラ１５によって取得される。このバストランザクションは、実際には、キャッシュメモリ１２−１からメインメモリ１４にデータをライトバックするときに行われる。
【００３１】
一方、ＭＭコントローラ１３は、メインメモリ１４にデータを書き込むためのトランザクションに応答して、アドレス（Ｎ）で指定されるワードＮの番地にデータ（Ｄｏｌｄ１）を書き込む。この場合、データ（Ｄｏｌｄ１）の値からそれに対応するエラー検出ビット（Ｐ）がＭＭコントローラ１３内部で生成され、データ（Ｄｏｌｄ１）はエラー検出ビット（Ｐ）が付加された状態でメインメモリ１４に書き込まれる。
【００３２】
ＢＩＢ／ＣＭコントローラ１５においては、データ（Ｄｏｌｄ１）の値からそれの一部に生じた誤りを訂正することが可能な冗長コード（Ｃｏｌｄ１）がＥＣＣ演算などによって生成され、その冗長コード（Ｃｏｌｄ１）がアドレス（Ｎ）に対応する冗長コードメモリ（ＣＭ）１６のエントリ（Ｎ）に書き込まれる。
【００３３】
前述したように、冗長コードメモリ（ＣＭ）１６はメインメモリ１４のワード数と同数のエントリを有しており、メインメモリ１４の各ワードと冗長コードメモリ（ＣＭ）１６のエントリとは１対１で対応している。したがって、メインメモリ１４のワードＮからのデータ（Ｄｏｌｄ１）の読み出し時に、そのデータエラーがＭＭコントローラ１３によって検出された場合には、エラー処理用ソフトウェアなどが、冗長コードメモリ（ＣＭ）１６のエントリ（Ｎ）の冗長コード（Ｃｏｌｄ１）とエラー検出されたデータとから正しいデータを再構成することにより、メインメモリ１４のワードＮのデータ（Ｄｏｌｄ１）を修復することができる。
【００３４】
次に、図３を参照して、ビフォアイメージバッファ（ＢＩＢ）１７に対する更新履歴情報の書き込み動作について説明する。
ここでは、ＣＰＵ１１−１が、メインメモリ１４のアドレス（Ｎ）で指定されるワードＮの番地に書き込まれているデータをＤｏｌｄ１からＤｎｅｗ１に更新する場合を例にとって説明する。
【００３５】
この場合、Ｄｎｅｗ１がキャッシュメモリ１２−１に書き込まれた時、ＢＩＢ／ＣＭコントローラ１５によって、データ（Ｄｏｌｄ１）とそれに対応する冗長コード（Ｃｏｌｄ１）がそれぞれメインメモリ１４および冗長コードメモリ１６から読み出される。そして、アドレス（Ｎ）、更新前データ（Ｄｏｌｄ１）、冗長コード（Ｃｏｌｄ１）から構成される更新履歴情報が、ビフォアイメージバッファ（ＢＩＢ）１７に格納される。
【００３６】
次に、図４を参照して、ビフォアイメージバッファ（ＢＩＢ）１７に蓄積されている更新履歴情報を使用してメインメモリ１４の内容を復元する動作について説明する。
【００３７】
メインメモリ１４の内容を障害発生前の状態に復元することが必要な障害が発生したとき、エラー処理用ソフトウェアなどの制御の下、ビフォアイメージバッファ（ＢＩＢ）１７から更新履歴情報が逐次読み出され、更新前データおよび冗長コードをそれぞれメインメモリ１４および冗長コードメモリ１６の該当する格納位置に書き戻す処理が行われる。
【００３８】
例えば、ビフォアイメージバッファ（ＢＩＢ）１７に図示のような４つの更新履歴情報が蓄積されている場合には、まず、４つ目の更新履歴情報（アドレスＮ、更新前データＤｄ、冗長コードＣｄ）の書き戻し処理が行われ、メインメモリ１４のアドレスＮに更新前データＤｄが書き込まれると共に、冗長コードメモリ１６のエントリＮに冗長コードＣｄが書き込まれる。次に、３つ目の更新履歴情報（アドレス２、更新前データＤｃ、冗長コードＣｃ）の書き戻し処理が行われ、メインメモリ１４のアドレス２に更新前データＤｃが書き込まれると共に、冗長コードメモリ１６のエントリ２に冗長コードＣｃが書き込まれる。以下、同様にして、２つ目の更新履歴情報および１つ目の更新履歴情報の書き戻し処理が順次実行される。
【００３９】
このようにして、メインメモリ１４を障害発生前の状態に復元でき、且つ冗長コードメモリ１６の内容についても復元されたメインメモリ１４の内容に対応する状態に戻される。
【００４０】
以上、ワード単位のメモリアクセスの場合について説明したが、メインメモリ１４に対するアクセスがキャッシュブロック単位で行われる場合についても、同様にして誤り訂正冗長コードの書き込みおよび更新履歴情報の書き込みなどを行うことができる。すなわち、キャッシュブロックがｎワードから構成されているとすると、１回のアクセスに対して前述の処理をｎ回繰り返し実行すればよい。
【００４１】
次に、図５を参照して、ＢＩＢ／ＣＭコントローラ１５の具体的なハードウェア構成について説明する。
ＢＩＢ／ＣＭコントローラ１５は、図示のように、バスインターフェース制御部１０１、バストランザクション応答制御部１０２、バストランザクション発行制御部１０３、バッファアクセスコントローラ１０４、状態保存制御部１０５、およびコードメモリコントローラ１０６から構成されている。
【００４２】
バスインターフェース制御部１０１はバス１０上に定義された各種信号ラインに接続され、そのバス１０との間でアドレス、データ、および各種ステータスを授受する。このバス１０上には、図示のように、バス１０上のデータ転送のために使用されるアドレス／データバス（ａｄｄｒｅｓ／ｄａｔａ）、およびコマンドライン（ｃｏｍｍａｎｄ）を始め、キャッシュ制御のためのステータスライン（ｓｈａｒｅｄ，ｍｏｄｉｆｉｅｄ）などが定義されている。ｓｈａｒｅｄラインは、メモリリードトランザクションで要求されたメモりデータのコピーをクリーンな状態で共有しているステータス（ｓｈａｒｅｄｃｌｅａｎ）を示す。ｍｏｄｉｆｉｅｄラインは、メモリリードトランザクションで要求されたメモりデータのコピーを変更した状態で共有しているステータス（ｍｏｄｉｆｉｅｄ）を示す。
【００４３】
バス１０上のこれら各種信号ラインの状態をバスインターフェース制御部１０１を通じてモニタすることによって、キャッシュステータスおよびバストランザクションのスヌープがＢＩＢ／ＣＭコントローラ１５によって行われる。
【００４４】
バストランザクション応答制御部１０２は、バスインターフェース制御部１０１を介して受け取った所定のバストランザクションに応答して動作するものであり、例えば、障害発生時には、ある任意のＣＰＵによってバス１０上に発行されるワードライトトランザクションに応答してそのトランザクションをアボートするなどの処理を行う。
【００４５】
バストランザクション発行制御部１０２は、バス１０上にメモリリード／ライトなどのトランザクションを発行するものであり、例えば、バスインターフェース制御部１０１を介して受け取ったバス１０上の信号ラインの状態からキャッシュメモリへの書き込みが行われたことが検出されたときは、メインメモリ１４から更新前データをリードするためのトランザクションを開始する。
【００４６】
状態保存制御部１０５は、ビフォアイメージバッファ（ＢＩＢ）１７に更新履歴情報を保存する位置を指定するポインタ値の制御などを行うものであり、ビフォアイメージバッファ（ＢＩＢ）１７に更新履歴情報を格納する度に、ポインタ値を＋１更新する。また、ビフォアイメージバッファ（ＢＩＢ）１７の更新履歴情報を用いてメインメモリ１４を復旧する場合には、状態保存制御部１０５は、更新履歴情報の読み出しの度にポインタ値を現在の値から−１ずつ更新するなどの制御を行う。
【００４７】
バッファアクセスコントローラ１０４は、ビフォアイメージバッファ（ＢＩＢ）１７との間に設けられたアドレスライン（ＢＩＢａｄｄｒｅｓｓ）、データライン（ＢＩＢｄａｔａ）、リードライト制御ライン（ＢＩＢＲＡＳ＃、ＣＡＳ＃、ＷＥ＃）を使用してビフォアイメージバッファ（ＢＩＢ）１７に対するデータ書き込みおよび読み出しを制御する。
【００４８】
コードメモリコントローラ１０６は、冗長コードメモリ（ＣＭ）１６との間に設けられたアドレスライン（ＣＭａｄｄｒｅｓｓ）、データライン（ＣＭｄａｔａ）、リードライト制御ライン（ＣＭＲＡＳ＃、ＣＡＳ＃、ＷＥ＃）を使用して冗長コードメモリ（ＣＭ）１６に対するデータ書き込みおよび読み出しを制御する。書き込み処理においては、コードメモリコントローラ１０６は、バスインターフェース制御部１０１を介して受け取ったバス１０上のデータから冗長コードを演算によって生成し、それを冗長コードメモリ（ＣＭ）１６に書き込む。
【００４９】
次に、図６乃至図９を参照して、図５のシステムの具体的な動作について説明する。
図６のタイミングチャートには、任意のキャッシュメモリからメインメモリ１４にデータをライトバックするときに実行される一連の動作が示されている。
【００５０】
キャッシュメモリからメインメモリ１４にデータをライトバックするときは、そのキャッシュメモリあるいはそれに対応するＣＰＵによって、コマンドライン（ＣＯＭＭＡＮＤ）上にキャッシュラインの書き戻しを示すコマンド（ｗｒｉｔｅ−ｌｉｎｅ）が発行され、またアドレスバス（ａｄｄｒｅｓｂｕｓ）にはメモリアドレス（Ａ）、データバス（ｄａｔａｂｕｓ）にはライトデータ（Ｄｎｅｗ）が出力される。キャッシュブロックが４ワードから構成される場合には、バースト転送が行われ、データＤｎｅｗ１〜Ｄｎｅｗ４が連続的にデータバス（ｄａｔａｂｕｓ）上に出力される。
【００５１】
このバストランザクションに応答して、メインメモリコントローラ１３およびＢＩＢ／ＣＭコントローラ１５が動作する。
メインメモリコントローラ１３は、メインメモリ１４との間に設けられたアドレスライン（ＭＭａｄｄｒｅｓｓ）、データライン（ＭＭｄａｔａ）、リードライト制御ライン（ＭＭＲＡＳ＃、ＣＡＳ＃、ＷＥ＃）を制御して、メインメモリ１４のアドレス（Ａ）から始まる連続する４つの番地にデータ（Ｄｎｅｗ１〜Ｄｎｅｗ４）を書き込む。
【００５２】
一方、ＢＩＢ／ＣＭコントローラ１５においては、コードメモリコントローラ１０６が動作し、まず、バス１０上のデータ（Ｄｎｅｗ１〜Ｄｎｅｗ４）からそれに対応する冗長コード（Ｃｎｅｗ１〜Ｃｎｅｗ４）が演算によって生成される。そして、その冗長コード（Ｃｎｅｗ１〜Ｃｎｅｗ４）がデータライン（ＣＭｄａｔａ）上に出力されると共に、バス１０から受け取ったアドレス（Ａ）から生成されたロウアドレス（Ａｒ）およびカラムアドレス（Ａｃ１〜Ａｃ４）がアドレスライン（ＣＭａｄｄｒｅｓｓ）に出力されて、アドレス（Ａ）に対応する冗長コードメモリ１６のエントリに冗長コード（Ｃｎｅｗ１〜Ｃｎｅｗ４）が書き込まれる。
【００５３】
このように、冗長コードメモリ１６に対する冗長コードの書き込みは、キャッシュメモリからメインメモリ１４にデータをライトバックするときに、そのライトバック処理と並行して、コードメモリコントローラ１０６によって自動的に実行される。
【００５４】
図７には、任意のＣＰＵがそれに対応するキャッシュメモリ中のｓｈａｒｅｄキャッシュラインに対する書き込みを行う場合に実行される一連の処理手順が示されている。
【００５５】
ｓｈａｒｅｄキャッシュラインに対する書き込みが行われると、共有データが変更されることをほかのキャッシュメモリに通知するために、バス１０上のコマンドライン（ｃｏｍｍａｎｄ）上にはインバリデートコマンド（ｉｎｖａｌｉｄａｔｅ）が、アドレスバス（ａｄｄｒｅｓｓｂｕｓ）上には共有データのアドレス（Ａ）がそれぞれ発行されて、インバリデートプロトコルが実行される。このインバリデートプロトコルでは、他のキャッシュメモリが共有データのコピーを無効化するまで、ｓｈａｒｅｄキャッシュラインに対する書き込みは待たされる。
【００５６】
ＢＩＢ／ＣＭコントローラ１５のバストランザクション発行制御部１０３は、インバリデートコマンドを確認すると、その時のアドレス（Ａ）を使用して、メインメモリ１４からアドレス（Ａ）の更新前データ（Ｄ１〜Ｄ４）を読み出すためのメモリリードトランザクションを開始する。このとき、バス１０上のコマンドライン（ｃｏｍｍａｎｄ）上に発行されるコマンドはリードノンスヌープであり、各キャッシュメモリはそのリードサイクルに対してはスヌープ動作を行わない。
【００５７】
メインメモリコントローラ１３は、メモリリードトランザクションに応答して、アドレスライン（ＭＭａｄｄｒｅｓｓ）、データライン（ＭＭｄａｔａ）、リードライト制御ライン（ＭＭＲＡＳ＃、ＣＡＳ＃、ＷＥ＃）を制御して、メインメモリ１４のアドレス（Ａ）からデータ（Ｄ１〜Ｄ４）を読み出し、それをバス１０のデータバス（ｄａｔａｂｕｓ）上に出力する。
【００５８】
一方、ＢＩＢ／ＣＭコントローラ１５においては、バッファアクセスコントローラ１０４およびコードメモリコントローラ１０６にもアドレス（Ａ）が渡される。コードメモリコントローラ１０６は、アドレスバス（ＣＭａｄｄｒｅｓｓ）上にアドレス（Ａ）から生成されたロウアドレス（Ａｒ）およびカラムアドレス（Ａｃ１〜Ａｃ４）を出力して、冗長コードメモリ１６のエントリＡから更新前データ（Ｄ１〜Ｄ４）に対応する冗長コード（Ｃ１〜Ｃ４）を読み出す。
【００５９】
この後、バッファアクセスコントローラ１０４は、アドレス（Ａ）と、バス１０のデータバス（ｄａｔａｂｕｓ）上に出力されたデータ（Ｄ１〜Ｄ４）と、コードメモリコントローラ１０６によって読み出された冗長コード（Ｃ１〜Ｃ４）とを更新履歴情報のデータ格納形式に組立てて、ポインタ値（Ｐ）で指定されるビフォアイメージバッファ（ＢＩＢ）１７のエントリに書き込む。
【００６０】
このように、ビフォアイメージバッファ（ＢＩＢ）１７に対する更新履歴情報の書き込みは、キャッシュメモリにデータが書き込まれるとき、つまりキャッシュメモリからメインメモリ１４にデータがライトバックされる前に、バストランザクション発行制御部１０３、バッファアクセスコントローラ１０４およびコードメモリコントローラ１０６によって自動的に実行される。
【００６１】
図８には、メインメモリ１４からのデータ読み出し時に、その読み出しデータのデータエラーが検出された場合の回復処理の流れが示されている。
ここでは、メインメモリ１４の内容をエラー発生前のチェックポイントの時点に復元することなく、正しく元の命令へ復帰可能な場合について説明する。
【００６２】
すなわち、ある時点でメインメモリ１４に書き込まれているデータ（Ｄ）について、その後にそれをメインメモリ１４から読み出したとき、もしそのデータ（Ｄ）がメモリエラーなどによって誤ったデータ値（Ｄ’）に置き換えられていると、メインメモリコントローラ１３によるエラー検出コードのチェックによってメモリデータエラーの発生が検出される。このメモリデータエラーの発生は、ハードウェア割り込み信号などによって所定のＣＰＵに通知され、そのＣＰＵにてエラー割り込みルーチンが実行される。
【００６３】
エラー割り込みルーチンを実行するＣＰＵは、エラー割り込みが再度発生しないようにマスクし（ステップＳ１０）、その後、メインメモリ１４のエラーが発したアドレスに格納されているデータ（Ｄ’）をリードし、次いでそれに対応する冗長コード（Ｃ）を冗長コードメモリ１６からリードする（ステップＳ１１、Ｓ１２）。この後、そのＣＰＵは、データ（Ｄ’）と冗長コード（Ｃ）とから正しいデータ（Ｄ）を再構成し（ステップＳ１３）、そのデータ（Ｄ）をメインメモリ１４のエラーが発したアドレスに格納する（ステップＳ１４）。
【００６４】
この回復処理の手順は、ビフォアイメージバッファ（ＢＩＢ）１７を使用してないため、ビフォアイメージバッファ（ＢＩＢ）１７を使用したチェックポイントリスタート方式を採用してないシステムにおいても適用することができる。
【００６５】
図９には、メインメモリ１４からのデータ読み出し時に、その読み出しデータのデータエラーが検出された場合の回復処理の第２の例が示されている。
ここでは、あるチェックポイントＣＰ１の以前にメインメモリ１４に書き込まれているデータ（Ｄ）について、そのチェックポイントＣＰ１の取得後にそれをメインメモリ１４から初めて読み出したときに、そのデータ（Ｄ）がメモリエラーなどによって誤ったデータ値（Ｄ’）に置き換えられていることが検出された場合を想定する。
【００６６】
このメモリエラーの発生は、メインメモリコントローラ１３によるデータ値（Ｄ’）のエラー検出コードのチェックによって検出され、ハードウェア割り込み信号などによって所定のＣＰＵに通知される。そして、そのＣＰＵにてリカバリールーチンが実行される。
【００６７】
リカバリールーチンを実行するＣＰＵは、エラー割り込みが再度発生しないようにマスクし（ステップＳ２０）、そしてメインメモリ１４のエラーが発したアドレスに格納されているデータ（Ｄ’）をリードし、次いでそれに対応する冗長コード（Ｃ）を冗長コードメモリ１６からリードする（ステップＳ２１、Ｓ２２）。この後、そのＣＰＵは、データ（Ｄ’）と冗長コード（Ｃ）とから正しいデータ（Ｄ）を再構成し（ステップＳ２３）、そのデータ（Ｄ）をメインメモリ１４のエラーが発したアドレスに格納する（ステップＳ２４）。
【００６８】
次いで、そのＣＰＵは、ＢＩＢ／ＣＭコントローラ１５を制御して、ビフォアイメージバッファ（ＢＩＢ）１７の更新前データをメインメモリ１４に書き戻し、冗長コードについては冗長コードメモリ１６に書き戻す（ステップＳ２５、Ｓ２６）。この後、チェックポイントＣＰ１で採取されたプロセス状態が各ＣＰＵに復元され、そのチェックポイントＣＰ１から処理が再開される。
【００６９】
このようにメモリエラーを修正した後にメインメモリ１４の内容を障害発生前のチェックポイントの時点に復元することにより、再び誤ったメモリデータが読まれることによる同一障害の再発を防止できるようになる。従って、チェックポイントリスタート方式だけでは回復できないようなメモリ故障が発生した際にも、処理を継続できるようになる。
【００７０】
図１０には、この発明の第２実施形態に係るコンピュータシステムの構成が示されている。
このコンピュータシステムは、第１実施形態のシステムに設けられていた冗長コードメモリ１６の代わりに垂直パリティメモリ２１を採用し、ワード単位ではなく、バースト転送などのＣＰＵによる連続アクセスでリードライトされる単位データブロック（キャッシュブロック）の単位でその誤り訂正のために使用される垂直パリティデータを管理するように構成されている。
【００７１】
すなわち、メインメモリ１４は、パリティ付きメモリなどのようにエラー検出機能を有するメモリであり、ＣＰＵによる１回のメモリアクセスでリードライトされるデータ単位であるワード単位でそのワードのデータ列に対してパリティビットが付加される。
【００７２】
垂直パリティメモリ２１は、エラー検出機能を有するメインメモリ１４に対してエラー訂正機能を付加するために設けられたものであり、メインメモリ１４に格納可能な単位データブロック数分のエントリを有している。各エントリには、メインメモリ１４の対応する単位データブロックに属するデータ列間において、それらデータ列それぞれの同一ビット位置におけるビット配列から算出した垂直パリティデータが格納される。例えば、図１１に示されているように、メインメモリ１４のキャッシュブロックＮの単位データブロックが各４バイトのデータＤ０〜Ｄ３から構成され、データＤ０〜Ｄ３それぞれに４ビットの水平パリティビットＰ０〜Ｐ３が付加されている場合には、垂直パリティメモリ２１のエントリＮには、データＤ０〜Ｄ３の同一ビット位置毎に算出された４バイトの垂直パリティＤｐと水平パリティビットＰ０〜Ｐ３の同一ビット位置毎に算出された４ビットの垂直パリティＰｐとを含む垂直パリティデータが格納されることになる。
【００７３】
このように、水平パリティビットによって誤り検出が可能なデータ単位で単位データブロックを分割し、これらに対して計算した垂直パリティデータを垂直パリティメモリ２１に格納することにより、エラー発生が検出されたデータについてそのどのビット位置がエラーしているかを垂直パリティデータから求めることができ、エラー訂正が可能となる。
【００７４】
垂直パリティメモリ２１に対する垂直パリティデータの書き込みは、キャッシュメモリのあるキャッシュラインをメインメモリ１４にライトバックするためのバストランザクションがバス１０上に発行されたとき、そのバストランザクションに応答して実行される。この場合、バス１０上に連続的に出力される１キャッシュライン分の単位データブロックから垂直パリティデータが生成され、またバス１０上に出力される単位ブロックアドレスから垂直パリティデータを書き込むべき垂直パリティメモリ２１のエントリ位置が決定される。
【００７５】
また、ワード単位の書き込みによってメインメモリのある単位データブロックに属する一部のデータだけが更新される場合については、更新対象の単位データブロックがメインメモリ１４から読み出され、その単位データブロックと書き込みデータとの差分と、その読み出した単位データブロックに対応する垂直パリティメモリ２１の垂直パリティデータとから、新たな垂直パリティデータが求められる。そして、その垂直パリティデータが、書き込みデータが属する単位データブロックに対応する垂直パリティメモリ２１のエントリに書き込まれる。
【００７６】
ビフォアイメージバッファ（ＢＩＢ）１７は、第１実施形態と同様に、あるチェックポイントから次のチェックポイントまでの期間におけるメインメモリ１４の更新履歴情報を保持するためのログメモリとして使用されるものであり、メインメモリ１４に対するデータ書き込みが行われる度、そのデータ書き込みに先立って、データ書き込みが行われる番地が属するメインメモリ１４のキャッシュブロックアドレスと、更新前単位データブロックと、その更新前単位データブロックに対応する垂直パリティデータとが、更新履歴情報としてビフォアイメージバッファ（ＢＩＢ）１７にスタック形式で蓄積される。
【００７７】
メインメモリ１４のリードデータに誤りが検出された場合は、メインメモリ１４の水平パリティによる誤り検出結果と垂直パリティデータとから誤りを生じたビット位置が特定され、正しいデータの再構築が行われる。そして、それがメインメモリ１４に書き戻される。
【００７８】
この第２実施形態においては、垂直パリティデータの生成、ビフォアイメージバッファ（ＢＩＢ）１７に対する更新履歴情報のリードライト制御は図５で説明した第１実施形態と同様のハードウェアによって実現される。すなわち、図５のシステムにおける冗長コードメモリ１６を垂直パリティメモリ２１に置き換えてその動作を説明すると、垂直パリティメモリ２１に対する垂直パリティデータの書き込みは、キャッシュメモリからメインメモリ１４にデータをライトバックするときに、そのライトバック処理と並行して、コードメモリコントローラ１０６によって自動的に実行される。また、ビフォアイメージバッファ（ＢＩＢ）１７に対する更新履歴情報の書き込みも、キャッシュメモリにデータが書き込まれるとき、つまりキャッシュメモリからメインメモリ１４にデータがライトバックされる前に、バストランザクション発行制御部１０３、バッファアクセスコントローラ１０４およびコードメモリコントローラ１０６によって自動的に実行される。
【００７９】
また、第２実施形態における障害回復処理についても、図８および図９で説明した第１実施形態と同様の手順で行うことができる。すなわち、メインメモリ１４の内容を障害発生前のチェックポイントの状態に復元する場合には、垂直パリティデータを使用して正しいデータを再構築した後、ビフォアイメージバッファ（ＢＩＢ）１７から更新履歴情報が逐次読み出され、更新前単位データブロックおよび垂直パリティデータをそれぞれメインメモリ１４および垂直パリティメモリ２１の該当する格納位置に書き戻す処理が行われる。
【００８０】
以上では、メインメモリのデータはパリティを持ち１ビット誤り検出ができる場合について説明したが、ＳＥＣ−ＤＥＤ符号を用いる場合も同様の構成が可能である。この場合には、メインメモリのデータリードの際に２ビット誤りが検出された時に上記と同様の方法で正しいデータを再構築して障害回復が可能となる。
【００８１】
図１２には、この発明の第３実施形態に係るコンピュータシステムの構成が示されている。
このコンピュータシステムは、第１実施形態のシステムに設けられていた冗長コードメモリ１６の代わりにブロックパリティメモリ２２を採用し、ワード単位ではなく、バースト転送などのＣＰＵによる連続アクセスでリードライトされる単位データブロック（キャッシュブロック）を４つで１組とするデータブロックグループ単位でその誤り訂正のために使用されるブロックパリティデータを管理するように構成されている。
【００８２】
すなわち、メインメモリ１４は、パリティ付きメモリなどのようにエラー検出機能を有するメモリであり、ＣＰＵによる１回のメモリアクセスでリードライトされるデータ単位であるワード単位でそのワードのデータ列に対してパリティビットが付加される。
【００８３】
ブロックパリティメモリ２２は、エラー検出機能を有するメインメモリ１４に対してエラー訂正機能を付加するために設けられたものであり、メインメモリ１４に格納可能なデータブロックグループ数分のエントリを有している。各エントリには、メインメモリ１４の対応するデータブロックグループに属するデータブロック間において、それらデータそれぞれの同一ビット位置におけるビット配列から算出した垂直パリティデータが格納される。
【００８４】
このように、１回のキャッシュライン操作でリードライトできる単位データブロック単位でブロックデータグループを分割し、これらに対して計算した垂直パリティデータをブロックパリティメモリ２２に格納することにより、エラー発生が検出された単位データブロックの単位データについてはメインメモリ１４の水平パリティによって検出でき、そのどのビット位置がエラーしているかについてはブロックパリティデータから求めることができ、これによってエラー訂正が可能となる。
【００８５】
ブロックパリティメモリ２２に対するブロックパリティデータの書き込みは、キャッシュメモリのあるキャッシュラインをメインメモリ１４にライトバックするためのバストランザクションがバス１０上に発行されたことが検出されたときに、実行される。この場合、バス１０上に連続的に出力される１キャッシュライン分の単位データブロックによって更新されるデータブロックがメインメモリ１４から読み出され、そのデータブロックと書き込まれる単位データブロックとの差分（排他的論理和）と、そのデータブロックグループに対応するブロックパリティメモリ２２のブロックパリティデータとから、新たなブロックパリティデータが生成される。そして、そのブロックパリティデータが、書き込み対象の単位データブロックが属するデータブロックグループに対応するブロックパリティメモリ２２のエントリに書き込まれる。
【００８６】
ビフォアイメージバッファ（ＢＩＢ）１７は、第１実施形態と同様に、あるチェックポイントから次のチェックポイントまでの期間におけるメインメモリ１４の更新履歴情報を保持するためのログメモリとして使用されるものであり、メインメモリ１４に対するデータ書き込みが行われる度、そのデータ書き込みに先立って、データ書き込みが行われる番地が属するメインメモリ１４のブロックデータグループのアドレスと、更新前データブロックグループと、その更新前データブロックグループに対応するブロックパリティデータとが、更新履歴情報としてビフォアイメージバッファ（ＢＩＢ）１７にスタック形式で蓄積される。
【００８７】
なお、この場合、前記の新たなブロックパリティデータを生成する際に必要な更新前のデータブロックとブロックパリティデータは、更新履歴情報としてビフォアイメージバッファ（ＢＩＢ）に格納するために読み出されるため、両者は兼用することができ、各々１回のアクセスで済ませるように制御することが可能である。
【００８８】
メインメモリ１４のリードデータに誤りが検出された場合は、メインメモリ１４の水平パリティによる誤り検出結果とブロックパリティデータとから誤りを生じたビット位置が特定され、正しいデータの再構築が行われる。そして、それがメインメモリ１４に書き戻される。具体的には、エラー検出されたデータが属するデータブロックグループのすべての単位データブロックがメインメモリ１４から読み出され、これらと対応するブロックパリティデータとから正しいブロックデータグループの再生が行われる。
【００８９】
この第３実施形態においては、ブロックパリティデータの生成、ビフォアイメージバッファ（ＢＩＢ）１７に対する更新履歴情報のリードライト制御は図５で説明した第１実施形態と同様のハードウェアによって実現される。すなわち、図５のシステムにおける冗長コードメモリ１６をブロックパリティメモリ２２に置き換えてその動作を説明すると、ブロックパリティメモリ２２に対するブロックパリティデータの書き込みは、キャッシュメモリからメインメモリ１４にデータをライトバックするときに、そのライトバック処理と並行して、コードメモリコントローラ１０６によって自動的に実行される。また、ビフォアイメージバッファ（ＢＩＢ）１７に対する更新履歴情報の書き込みも、キャッシュメモリにデータが書き込まれるとき、つまりキャッシュメモリからメインメモリ１４にデータがライトバックされる前に、バストランザクション発行制御部１０３、バッファアクセスコントローラ１０４およびコードメモリコントローラ１０６によって自動的に実行される。
【００９０】
また、第３実施形態における障害回復処理についても、図８および図９で説明した第１実施形態と同様の手順で行うことができる。すなわち、メインメモリ１４の内容を障害発生前のチェックポイントの状態に復元する場合には、ブロックパリティデータを使用して正しいデータを再構築した後、ビフォアイメージバッファ（ＢＩＢ）１７から更新履歴情報が逐次読み出され、更新前データブロックグループおよびブロックパリティデータをそれぞれメインメモリ１４およびブロックパリティメモリ２２の該当する格納位置に書き戻す処理が行われる。
【００９１】
以上では、メインメモリのデータはパリティを持ち１ビット誤り検出ができる場合について説明したが、ＳＥＣ−ＤＥＤ符号を用いる場合も同様の構成が可能である。この場合には、メインメモリのデータリードの際に２ビット誤りが検出された時に上記と同様の方法で正しいデータを再構築して障害回復が可能となる。
【００９２】
なお、以上の説明ではどの実施形態においても、ビフォアイメージバッファ（ＢＩＢ）１７に対して更新前データとそれに対応する誤り訂正のための冗長コード（ＥＣＣ、垂直パリティ、ブロックパリティ）とを同時に書き込む場合を説明したが、冗長コードについては、それを格納するための冗長コードメモリ１６、垂直パリティメモリ２１、またはブロックパリティメモリ２２の更新時に行ってもよい。この場合、新たな冗長コードの書き込みによって更新される冗長コードが冗長コードメモリ１６、垂直パリティメモリ２１、またはブロックパリティメモリ２２から読み出され、それがビフォアイメージバッファ（ＢＩＢ）１７に書き込まれる。
【００９３】
【発明の効果】
以上説明したように、この発明によれば、既存のパリティ付きメモリなどの資源をそのまま使用した状態で誤り訂正機能を持つメモリサブシステムを付加ハードによって構築できるようになり、メモリ故障に対する信頼性の高いコンピュータシステムを実現できる。また、ログメモリを使用したチェックポイントリスタート方式では回復できないようなメモリ故障が発生した際にも処理を継続できるようになり、メモリを２重化することなく、少ないハードウェアで十分な耐故障性能を実現できる。
【図面の簡単な説明】
【図１】この発明の第１の実施形態に係るコンピュータシステムの構成を示すブロック図。
【図２】同第１実施形態のシステムにおける冗長コードメモリに対する誤り訂正コードの書き込み動作を説明するための図。
【図３】同第１実施形態のシステムにおけるＢＩＢメモリに対する更新履歴情報の書き込み動作を説明するための図。
【図４】同第１実施形態のシステムにおけるメインメモリおよび冗長コードメモリの復元動作を説明するための図。
【図５】同第１実施形態のシステムで採用される具体的なハードウェア構成を示すブロック図。
【図６】図５のシステムにおいてキャッシュからメインメモリに対するライトバック処理で実行される一連の動作を説明するタイミングチャート。
【図７】図５のシステムにおいてキャッシュ中のｓｈａｒｅｄラインに対する書き込み処理で実行される一連の動作を説明するタイミングチャート。
【図８】図５のシステムで実行される障害回復処理の第１の手順を説明するフローチャート。
【図９】図５のシステムで実行される障害回復処理の第２の手順を説明するフローチャート。
【図１０】この発明の第２の実施形態に係るコンピュータシステムの構成を示すブロック図。
【図１１】同第２実施形態のシステムにおける垂直パリティデータの生成原理を説明するための図。
【図１２】この発明の第３の実施形態に係るコンピュータシステムの構成を示すブロック図。
【符号の説明】
１０…プロセッサバス、１１−１〜１１−ｎ…ＣＰＵ、１２−１〜１２−ｎ…キャッシュメモリ、１３…メインメモリコントローラ、１４…メインメモリ、１５…ＢＩＢ／ＣＭコントローラ、１６…冗長コードメモリ（ＣＭ）、１７…ビフォアイメージバッファ（ＢＩＢ）、２１…垂直パリティメモリ、２２…ブロックパリティメモリ、１０１…バスインターフェース制御部、１０２…バストランザクション応答制御部、１０３…バストランザクション発行制御部、１０４…バッファアクセスコントローラ、１０５…状態保存制御部、１０６…コードメモリコントローラ。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a computer system and a failure recovery method thereof, and more particularly to an improved computer system and a failure recovery method thereof that can recover from a failure caused by a fixed failure of a main memory.
[0002]
[Prior art]
In general, in a computer system, a memory with parity is employed in order to increase the reliability against a memory failure. In a memory subsystem having a memory with parity, a checksum of the data is calculated when the data is read, and a memory error is detected by comparing the checksum with a parity bit. Thereby, the use of incorrect memory data can be prevented beforehand.
[0003]
However, error correction cannot be performed because it is impossible to specify which bit of memory data has an error only by using parity.
On the other hand, when a memory subsystem using a redundant code such as an SEC-DED code is used, 1-bit error correction and 2-bit error detection are possible. That is, even if one bit of data is permanently in error, this can be corrected automatically and processing can be continued. Therefore, in a computer system that requires high reliability, it is desirable to employ a memory subsystem using a redundant code such as a SEC-DED code rather than a memory with parity.
[0004]
However, when a memory subsystem using a redundant code such as a SEC-DED code is introduced into an existing computer system that employs a memory with parity, the existing memory with parity cannot be used as it is. It is necessary to newly reconstruct the main memory to correspond to the SEC-DED code. Therefore, a lot of cost is required for the introduction.
[0005]
On the other hand, the fault tolerant computer system employs a configuration in which all memory failures are masked by duplication of memory. According to this dual memory configuration, the same data is always held in the two memories, so that if a data error is detected, the processing is continued by using the data in the other memory. Can do.
[0006]
However, there are disadvantages such as a large amount of hardware because it is necessary to duplicate the memory, and a special structure is required for switching the access memory when an error is detected.
[0007]
Therefore, recently, a checkpoint using a log memory for storing update history information of the main memory is a method for realizing recovery from a general failure with additional hardware without duplicating the memory. A restart method has been proposed. In this checkpoint restart method, information necessary for process re-execution is stored in the main memory for each checkpoint, and the main memory is updated as the process is executed during the period from one checkpoint to the next. Each time the data is updated, the data before update is collected in the log memory as the update history information. When a failure occurs in the computer system, the main memory can be restored to the time of the checkpoint before the failure by using the contents of the log memory. Therefore, by adopting a checkpoint restart method using a log memory, the contents of the memory can be restored with a small amount of hardware without using a dual memory.
[0008]
However, in this method, when a failure occurs in which the contents of the memory are permanently rewritten, there is a case where it cannot be recovered even if the failure can be detected. In other words, if the value of the memory data has been rewritten before the previous checkpoint, the wrong memory data will be read again even if the process is resumed by returning to the previous checkpoint when a failure is detected. Will not be able to recover.
[0009]
[Problems to be solved by the invention]
As described above, when a memory subsystem using a redundant code such as a SEC-DED code is introduced into an existing computer system employing a memory with parity, the existing memory with parity cannot be used as it is. However, it is necessary to newly reconstruct a large-capacity main memory so as to correspond to the SEC-DED code, and there is a disadvantage that a lot of cost is required for its introduction.
[0010]
Checkpoint restart method using log memory to store update history information of main memory as a method to realize recovery from general failure with additional hardware without duplicating memory However, this method has a problem that it may not be possible to cope with a memory failure in which data is permanently rewritten.
[0011]
The present invention has been made in view of the above points, and makes it possible to construct a memory subsystem having an error correction function by using additional hardware while using resources such as an existing memory with parity as it is, and to improve reliability against a memory failure. An object of the present invention is to provide a computer system with high accuracy.
[0012]
In addition, the present invention enables processing to continue even when a memory failure occurs that cannot be recovered by the checkpoint restart method using a log memory, and sufficient hardware is sufficient without duplicating the memory. It is an object of the present invention to provide a computer system and a failure recovery method capable of realizing excellent fault tolerance performance.
[0013]
[Means for Solving the Problems]
The present invention is connected to one or more CPUs and the CPU via a bus, With parity to store data with added parity In a computer system having a main memory, A cache memory, a memory controller provided between the bus and the main memory and controlling the main memory; A plurality of storage areas are provided corresponding to each address serving as a unit of read / write access to the main memory, and each storage area corresponds to the storage area. Of the main memory Redundancy that can correct errors that occur in some of the data stored at the address code A redundant code memory, and the bus and the redundant code memory Between Control device for controlling the redundant code memory Because Monitor bus transactions issued on the bus; When a write back from the cache memory to the main memory is executed, A redundant code corresponding to the data is generated from the value of the data on the bus, and the redundant code is stored in the storage area of the redundant code memory corresponding to the unit data write address. When data error of the read data is detected by the controller and parity check by the memory controller when reading data from the main memory, the read data and the redundant code memory corresponding to the read data are stored. Error correction means for reconstructing correct data from the redundant code It is characterized by that.
[0014]
In this computer system, a redundant code memory and a control device for controlling the redundant code memory are provided. By monitoring a bus transaction issued on the bus, Write back from cache memory to main memory is executed When this is detected, a redundant code corresponding to the data is automatically generated from the value of the data on the bus in the control device and stored in the corresponding storage area of the redundant code memory. Further, when a data error of the read data is detected by the parity check by the memory controller when reading data from the main memory, correct data is reconstructed from the read data and the redundant code corresponding to the read data. in this way, When an error is detected in the data in the main memory, correct data can be reconstructed from that data and the corresponding redundant code, so error correction is possible simply by providing the redundant code memory and control device described above as additional hardware. A memory subsystem having functions can be constructed, and a computer system with high reliability against a memory failure can be realized.
[0015]
Further, the apparatus further comprises a log memory connected to the control means for storing update history information of the main memory, wherein the control means performs data writing before data writing to the main memory by the CPU is executed. Read the pre-update data of the main memory corresponding to the address to be executed and the redundant code corresponding to the pre-update data from the main memory and the redundant code memory, respectively, and use the pre-update data and redundant code as the update history information Stored in the log memory, When a failure that requires restoring the contents of the main memory to the state before the failure occurs, the pre-update data and redundant code constituting each update history information stored in the log memory are stored in the main memory and Each of the redundant code memories is written back to restore the main memory to the state before the failure, and the contents of the redundant code memory are restored to the state corresponding to the restored contents of the main memory. It is preferable to configure as described above.
[0016]
With this configuration, the content of the main memory can be restored to the state before the failure using the content of the log memory, and correct data can be reconstructed by using the redundant code even when a memory failure occurs. In this case, a memory failure that cannot be recovered only by the checkpoint restart method by reconstructing the correct data using the redundant code and restarting the processing from the checkpoint before the failure occurred. Will be able to continue processing.
[0017]
In addition to storing the pre-update data as well as the redundant code corresponding to the pre-update data in the log memory, a failure that requires restoring the contents of the main memory to the state before the failure occurred When writing the pre-update data and redundant code constituting each update history information stored in the log memory back to the main memory and redundant code memory, respectively, the main memory can be restored to the state before the failure occurred, and The contents of the redundant code memory can be returned to the state corresponding to the restored contents of the main memory.
[0018]
In addition, storage of update history information including pre-update data to the log memory must be performed before data write to the main memory is executed. However, in a system having a cache memory, the CPU writes data to the cache memory. Read the data before update of the main memory corresponding to the address where the data was written and the corresponding redundant code from the main memory and the redundant code memory, respectively, and update history of the data before update and the redundant code. By storing the information in the log memory as the information, the update history information can be easily stored before the data writing to the main memory is executed.
[0019]
Further, instead of the above-described redundant code memory, the main memory has a plurality of storage areas provided corresponding to each unit data block composed of a plurality of data strings that are continuously accessed. By using a vertical parity memory that holds vertical parity data for the same bit position of each of a plurality of data strings belonging to a unit data block corresponding to the storage area, a simple process called a parity process can be used. It is possible to reproduce correct block data.
[0020]
Furthermore, instead of the vertical parity memory, there are a plurality of storage areas provided corresponding to a plurality of data block groups each including a plurality of unit data blocks each having a plurality of data strings on the main memory. In each storage area, a block parity memory that holds block parity data composed of vertical parity data for the same bit position of each of a plurality of unit data blocks belonging to the data block group corresponding to the storage area is adopted. As a result, it is possible to reproduce correct block data even when a wide range of memory failures such as memory module units occur.
[0021]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 conceptually shows the configuration of a computer system according to the first embodiment of the present invention. This computer system stores information necessary for failure recovery at each checkpoint in the main memory. When a failure occurs, the main memory update history information stored in the log memory is used to store the contents of the main memory before the failure occurs. Is a multiprocessor system that employs a failure recovery method of restoring at the time of a checkpoint, and as shown, a processor bus 10, CPUs 11-1 to 11-n, cache memories 12-1 to 12-n, main memory 14, a redundant code memory (CM) 16, and a before image buffer (BIB) 17.
[0022]
The cache memories 12-1 to 12-n are used as primary caches or secondary caches of the CPUs 11-1 to 11-n sharing the main memory 14, and at the time of checkpoint acquisition, the cache memory 12- For each of 1 to 12 -n, data not reflected in the main memory 14 is written into the main memory 14.
[0023]
The main memory 14 is a memory having an error detection function such as a memory with parity, and a parity bit for the data string of the word in units of words that is a data unit read / written by one memory access by the CPU. Is added.
[0024]
The redundant code memory (CM) 16 is provided to add an error correction function to the main memory 14 having an error detection function, and has entries for the number of words of the main memory 14. Each entry stores an error correction redundancy code of a corresponding word in the main memory 14. For example, the error correction redundant code for the word N of the main memory 14 is stored in the Nth entry of the redundant code memory 16.
[0025]
The writing of the error correcting redundant code to the redundant code memory (CM) 16 is executed in response to the bus transaction when a bus transaction for writing data to the main memory 14 is issued on the bus 10. In this case, an error correcting redundant code is generated from the data output on the bus 10, and the entry position of the redundant code memory (CM) 16 to which the error correcting redundant code is written is determined from the memory address output on the bus 10. Is done.
[0026]
The before image buffer (BIB) 17 is used as a log memory for holding update history information of the main memory 14 during a period from one check point to the next check point, and data writing to the main memory 14 is performed. Before each data write, the update history information includes the address indicating the data write address of the main memory 14, the pre-update data, and the error correction redundant code of the redundant code memory 16 corresponding to the pre-update data. Are stored in the before image buffer (BIB) 17 in a stack format. The error correction redundant code is stored together with the address and the pre-update data when the contents of the main memory 14 are restored using the update history information of the before image buffer (BIB) 17 and the redundant code is adjusted accordingly. This is because the contents of the memory (CM) 16 can be restored together.
[0027]
When the main memory 14 has only a 1-bit error detection capability based on parity, a 1-bit error correction code can be used as an error correction redundancy code stored in the redundancy code memory 16.
[0028]
Next, the data flow in the system of FIG. 1 will be described in detail with reference to FIGS.
First, with reference to FIG. 2, the error correction redundant code writing operation to the redundant code memory (CM) 16 will be specifically described.
[0029]
As shown in FIG. 2, a main memory controller (MM controller) 13 is provided between the bus 10 and the main memory 14, and read / write control of the main memory 14 is performed by the main memory controller (MM controller). ) 13 is executed. A BIB / CM controller 15 is commonly provided between the redundant code memory (CM) 16, the before image buffer (BIB) 17, and the bus 10, and the redundant code memory (CM) 16 and the before image are provided. The read / write control of each buffer (BIB) 17 is executed by the BIB / CM controller 15.
[0030]
Hereinafter, an operation performed when data from the CPU 11-1 is written to an address N of the main memory 14 will be described.
Transactions on the bus 10, that is, various commands, addresses and data on the bus 10 are monitored by the BIB / CM controller 15, and a transaction for writing data to the main memory 14 is issued on the bus 10 from the CPU 11-1. Then, the memory address (N) and data (Dold1) at that time are acquired by the BIB / CM controller 15. This bus transaction is actually performed when data is written back from the cache memory 12-1 to the main memory 14.
[0031]
On the other hand, the MM controller 13 writes data (Dold1) at the address of the word N specified by the address (N) in response to a transaction for writing data to the main memory 14. In this case, an error detection bit (P) corresponding to the value of the data (Dold1) is generated inside the MM controller 13, and the data (Dold1) is written to the main memory 14 with the error detection bit (P) added. It is.
[0032]
In the BIB / CM controller 15, a redundant code (Cold1) that can correct an error generated in a part of the value of the data (Dold1) is generated by an ECC operation or the like, and the redundant code (Cold1) is generated. It is written in the entry (N) of the redundant code memory (CM) 16 corresponding to the address (N).
[0033]
As described above, the redundant code memory (CM) 16 has the same number of entries as the number of words in the main memory 14, and each word in the main memory 14 and the entry in the redundant code memory (CM) 16 have a one-to-one relationship. It corresponds with. Therefore, when the data error is detected by the MM controller 13 when reading the data (Dold1) from the word N of the main memory 14, the error processing software or the like enters the entry (in the redundant code memory (CM) 16). By reconstructing correct data from the redundant code (Cold 1) of N) and the error detected data, the data (Dold 1) of the word N in the main memory 14 can be restored.
[0034]
Next, with reference to FIG. 3, an operation of writing update history information to the before image buffer (BIB) 17 will be described.
Here, a case where the CPU 11-1 updates the data written in the address of the word N designated by the address (N) of the main memory 14 from Dold1 to Dnew1 will be described as an example.
[0035]
In this case, when Dnew1 is written into the cache memory 12-1, the BIB / CM controller 15 reads the data (Dold1) and the corresponding redundant code (Cold1) from the main memory 14 and the redundant code memory 16, respectively. Then, update history information including an address (N), pre-update data (Dold1), and a redundant code (Cold1) is stored in the before image buffer (BIB) 17.
[0036]
Next, an operation for restoring the contents of the main memory 14 using the update history information stored in the before image buffer (BIB) 17 will be described with reference to FIG.
[0037]
When a failure that requires restoring the contents of the main memory 14 to the state before the failure occurs, update history information is sequentially read from the before image buffer (BIB) 17 under the control of error processing software or the like. Then, the process of writing the pre-update data and the redundant code back to the corresponding storage locations in the main memory 14 and the redundant code memory 16 is performed.
[0038]
For example, when four pieces of update history information as shown in the figure are stored in the before image buffer (BIB) 17, first, the fourth update history information (address N, pre-update data Dd, redundant code Cd) And the pre-update data Dd is written to the address N of the main memory 14 and the redundant code Cd is written to the entry N of the redundant code memory 16. Next, the third update history information (address 2, pre-update data Dc, redundant code Cc) is written back, the pre-update data Dc is written to address 2 of the main memory 14, and the redundant code memory Redundant code Cc is written in 16 entries 2. Thereafter, similarly, the second update history information and the first update history information are written back sequentially.
[0039]
In this way, the main memory 14 can be restored to the state before the failure, and the contents of the redundant code memory 16 are also restored to the state corresponding to the restored contents of the main memory 14.
[0040]
As described above, the case of memory access in units of words has been described. However, in the case where access to the main memory 14 is performed in units of cache blocks, it is possible to write error correction redundancy code and update history information in the same manner. it can. In other words, if the cache block is composed of n words, the above-described process may be repeated n times for one access.
[0041]
Next, a specific hardware configuration of the BIB / CM controller 15 will be described with reference to FIG.
The BIB / CM controller 15 includes a bus interface control unit 101, a bus transaction response control unit 102, a bus transaction issue control unit 103, a buffer access controller 104, a state storage control unit 105, and a code memory controller 106, as shown in the figure. Has been.
[0042]
The bus interface control unit 101 is connected to various signal lines defined on the bus 10, and exchanges addresses, data, and various statuses with the bus 10. On this bus 10, as shown in the figure, an address / data bus (address / data) used for data transfer on the bus 10, a command line, and a status line for cache control. (Shared, modified) and the like are defined. The shared line indicates a status (shared clean) in which a copy of the memory data requested in the memory read transaction is shared in a clean state. The modified line indicates a status (modified) that is shared while the copy of the memory data requested by the memory read transaction is changed.
[0043]
By monitoring the state of these various signal lines on the bus 10 through the bus interface control unit 101, the BIB / CM controller 15 snoops the cache status and the bus transaction.
[0044]
The bus transaction response control unit 102 operates in response to a predetermined bus transaction received via the bus interface control unit 101. For example, when a failure occurs, the bus transaction response control unit 102 is issued on the bus 10 by a certain CPU. In response to a word write transaction, the transaction is aborted.
[0045]
The bus transaction issue control unit 102 issues a transaction such as memory read / write on the bus 10. For example, the state of the signal line on the bus 10 received via the bus interface control unit 101 is transferred to the cache memory. Is detected, the transaction for reading the pre-update data from the main memory 14 is started.
[0046]
The state storage control unit 105 performs control of a pointer value for designating a position where update history information is stored in the before image buffer (BIB) 17, and stores the update history information in the before image buffer (BIB) 17. Each time, the pointer value is updated by +1. When the main memory 14 is restored using the update history information of the before image buffer (BIB) 17, the state storage control unit 105 changes the pointer value from the current value to −1 each time the update history information is read. Control such as updating each time.
[0047]
The buffer access controller 104 is provided with an address line (BIB address), a data line (BIB data), and a read / write control line (BIB RAS #, CAS #, WE #) provided with the before image buffer (BIB) 17. Used to control data writing and reading to the before image buffer (BIB) 17.
[0048]
The code memory controller 106 includes an address line (CM address), a data line (CM data), and a read / write control line (CM RAS #, CAS #, WE #) provided between the code memory controller 106 and the redundant code memory (CM) 16. Used to control data writing and reading to the redundant code memory (CM) 16. In the writing process, the code memory controller 106 generates a redundant code by calculation from data on the bus 10 received via the bus interface control unit 101, and writes it into the redundant code memory (CM) 16.
[0049]
Next, specific operations of the system of FIG. 5 will be described with reference to FIGS.
The timing chart of FIG. 6 shows a series of operations executed when data is written back from an arbitrary cache memory to the main memory 14.
[0050]
When data is written back from the cache memory to the main memory 14, a command (write-line) indicating a write back of the cache line is issued on the command line (COMMAND) by the cache memory or the CPU corresponding thereto, and The memory address (A) is output to the address bus (address bus), and the write data (Dnew) is output to the data bus (data bus). When the cache block is composed of 4 words, burst transfer is performed, and data Dnew1 to Dnew4 are continuously output on the data bus (data bus).
[0051]
In response to this bus transaction, the main memory controller 13 and the BIB / CM controller 15 operate.
The main memory controller 13 controls an address line (MM address), a data line (MM data), and a read / write control line (MM RAS #, CAS #, WE #) provided with the main memory 14, Data (Dnew1 to Dnew4) is written in four consecutive addresses starting from the address (A) in the main memory 14.
[0052]
On the other hand, in the BIB / CM controller 15, the code memory controller 106 operates, and first, redundant codes (Cnew1 to Cnew4) corresponding thereto are generated from the data (Dnew1 to Dnew4) on the bus 10 by calculation. The redundant codes (Cnew1 to Cnew4) are output on the data line (CM data), and the row address (Ar) and column address (Ac1 to Ac4) generated from the address (A) received from the bus 10 are also displayed. Is output to the address line (CM address), and redundant codes (Cnew1 to Cnew4) are written in the entries of the redundant code memory 16 corresponding to the address (A).
[0053]
As described above, the writing of the redundant code to the redundant code memory 16 is automatically executed by the code memory controller 106 in parallel with the write back process when data is written back from the cache memory to the main memory 14. .
[0054]
FIG. 7 shows a series of processing procedures executed when an arbitrary CPU writes to a shared cache line in the corresponding cache memory.
[0055]
When writing to the shared cache line is performed, an invalidate command (invalidate) is sent to the address bus on the command line (command) on the bus 10 in order to notify other cache memory that the shared data is changed. The address (A) of the shared data is issued on (address bus), and the invalidation protocol is executed. In this invalidation protocol, writing to the shared cache line is waited until another cache memory invalidates the copy of the shared data.
[0056]
When the bus transaction issuance control unit 103 of the BIB / CM controller 15 confirms the invalidate command, the pre-update data (D1 to D4) of the address (A) is obtained from the main memory 14 using the address (A) at that time. Start a memory read transaction for reading. At this time, the command issued on the command line (command) on the bus 10 is read non-snoop, and each cache memory does not perform the snoop operation for the read cycle.
[0057]
The main memory controller 13 controls the address line (MM address), the data line (MM data), and the read / write control line (MM RAS #, CAS #, WE #) in response to the memory read transaction. The data (D1 to D4) are read from the address (A) of 14 and output on the data bus (data bus) of the bus 10.
[0058]
On the other hand, in the BIB / CM controller 15, the address (A) is also passed to the buffer access controller 104 and the code memory controller 106. The code memory controller 106 outputs the row address (Ar) and the column addresses (Ac1 to Ac4) generated from the address (A) on the address bus (CM address), and before updating from the entry A of the redundant code memory 16 The redundant codes (C1 to C4) corresponding to the data (D1 to D4) are read out.
[0059]
Thereafter, the buffer access controller 104 reads the address (A), the data (D1 to D4) output on the data bus (data bus) of the bus 10, and the redundant code (C1) read by the code memory controller 106. ˜C4) are assembled into the data storage format of the update history information and written to the entry of the before image buffer (BIB) 17 designated by the pointer value (P).
[0060]
As described above, the update history information is written to the before image buffer (BIB) 17 when the data is written to the cache memory, that is, before the data is written back from the cache memory to the main memory 14. 103, automatically executed by the buffer access controller 104 and the code memory controller 106.
[0061]
FIG. 8 shows the flow of recovery processing when a data error in the read data is detected when data is read from the main memory 14.
Here, a case will be described in which the contents of the main memory 14 can be correctly restored to the original instruction without being restored to the time of the checkpoint before the error occurred.
[0062]
That is, when data (D) written in the main memory 14 at a certain point in time is subsequently read out from the main memory 14, if the data (D) is an incorrect data value (D ') due to a memory error or the like. In the case of the replacement, the occurrence of a memory data error is detected by checking the error detection code by the main memory controller 13. The occurrence of this memory data error is notified to a predetermined CPU by a hardware interrupt signal or the like, and an error interrupt routine is executed by the CPU.
[0063]
The CPU executing the error interrupt routine masks the error interrupt so that it does not occur again (step S10), and then reads the data (D ′) stored at the address where the error occurred in the main memory 14, and then The corresponding redundant code (C) is read from the redundant code memory 16 (steps S11 and S12). Thereafter, the CPU reconstructs the correct data (D) from the data (D ′) and the redundant code (C) (step S13), and sets the data (D) to the address where the error occurred in the main memory 14. Store (step S14).
[0064]
This procedure of the recovery process can be applied to a system that does not employ the checkpoint restart method using the before image buffer (BIB) 17 because the before image buffer (BIB) 17 is not used.
[0065]
FIG. 9 shows a second example of the recovery process in the case where a data error of the read data is detected when reading data from the main memory 14.
Here, when data (D) written in the main memory 14 before a certain checkpoint CP1 is read from the main memory 14 for the first time after the checkpoint CP1 is obtained, the data (D) is stored in the memory. A case is assumed in which it is detected that an incorrect data value (D ′) is replaced due to an error or the like.
[0066]
The occurrence of this memory error is detected by checking the error detection code of the data value (D ′) by the main memory controller 13 and notified to a predetermined CPU by a hardware interrupt signal or the like. Then, the recovery routine is executed by the CPU.
[0067]
The CPU that executes the recovery routine masks the error interrupt so as not to occur again (step S20), reads the data (D ′) stored in the address where the error occurred in the main memory 14, and then responds to it. The redundant code (C) to be read is read from the redundant code memory 16 (steps S21 and S22). Thereafter, the CPU reconstructs the correct data (D) from the data (D ′) and the redundant code (C) (step S23), and sets the data (D) to the address where the error occurred in the main memory 14. Store (step S24).
[0068]
Next, the CPU controls the BIB / CM controller 15 to write the pre-update data of the before image buffer (BIB) 17 back to the main memory 14 and write back the redundant code to the redundant code memory 16 (step S25, S26). Thereafter, the process state collected at the checkpoint CP1 is restored to each CPU, and the processing is resumed from the checkpoint CP1.
[0069]
Thus, by correcting the memory error and then restoring the contents of the main memory 14 to the time of the checkpoint before the failure occurs, it becomes possible to prevent the same failure from recurring due to erroneous reading of the memory data. Therefore, even when a memory failure occurs that cannot be recovered by the checkpoint restart method alone, the processing can be continued.
[0070]
FIG. 10 shows the configuration of a computer system according to the second embodiment of the present invention.
This computer system employs a vertical parity memory 21 in place of the redundant code memory 16 provided in the system of the first embodiment, and is a unit read / written by continuous access by the CPU such as burst transfer instead of word units. It is configured to manage vertical parity data used for error correction in units of data blocks (cache blocks).
[0071]
That is, the main memory 14 is a memory having an error detection function, such as a memory with parity, and is a word unit that is a data unit read / written by one memory access by the CPU with respect to a data string of the word. Parity bits are added.
[0072]
The vertical parity memory 21 is provided to add an error correction function to the main memory 14 having an error detection function, and has entries for the number of unit data blocks that can be stored in the main memory 14. Yes. In each entry, vertical parity data calculated from the bit arrangement at the same bit position of each data string among the data strings belonging to the corresponding unit data block of the main memory 14 is stored. For example, as shown in FIG. 11, the unit data block of the cache block N of the main memory 14 is composed of 4-byte data D0 to D3, and each of the data D0 to D3 has 4 horizontal parity bits P0 to P3. When P3 is added, the entry N of the vertical parity memory 21 has the same bit position of the 4-byte vertical parity Dp and the horizontal parity bits P0 to P3 calculated for each same bit position of the data D0 to D3. Vertical parity data including 4-bit vertical parity Pp calculated every time is stored.
[0073]
In this way, the unit data block is divided into data units that can be detected by the horizontal parity bit, and the vertical parity data calculated for these is stored in the vertical parity memory 21, whereby the data in which the occurrence of the error is detected. It is possible to determine from the vertical parity data which bit position is in error, and error correction becomes possible.
[0074]
The writing of the vertical parity data to the vertical parity memory 21 is executed in response to the bus transaction when a bus transaction for writing back a cache line having the cache memory to the main memory 14 is issued on the bus 10. . In this case, vertical parity data is generated from the unit data block for one cache line continuously output on the bus 10 and the vertical parity data is to be written from the unit block address output on the bus 10. 21 entry positions are determined.
[0075]
When only a part of data belonging to a unit data block in the main memory is updated by word unit writing, the unit data block to be updated is read from the main memory 14, and the unit data block and the data are written. New vertical parity data is obtained from the difference from the data and the vertical parity data of the vertical parity memory 21 corresponding to the read unit data block. Then, the vertical parity data is written to the entry of the vertical parity memory 21 corresponding to the unit data block to which the write data belongs.
[0076]
The before image buffer (BIB) 17 is used as a log memory for holding update history information of the main memory 14 during a period from a certain checkpoint to the next checkpoint, as in the first embodiment. Each time data is written to the main memory 14, before the data is written, the cache block address of the main memory 14 to which the address where the data is written belongs, the unit data block before update, and the unit data block before update are stored. Corresponding vertical parity data is stored in the before image buffer (BIB) 17 in a stack format as update history information.
[0077]
When an error is detected in the read data of the main memory 14, the bit position where the error has occurred is specified from the error detection result by the horizontal parity of the main memory 14 and the vertical parity data, and correct data is reconstructed. Then, it is written back to the main memory 14.
[0078]
In the second embodiment, generation of vertical parity data and read / write control of update history information for the before image buffer (BIB) 17 are realized by hardware similar to that of the first embodiment described with reference to FIG. That is, the operation is described by replacing the redundant code memory 16 in the system of FIG. 5 with the vertical parity memory 21. When writing the vertical parity data to the vertical parity memory 21, the data is written back from the cache memory to the main memory 14. In parallel with the write back processing, the code memory controller 106 automatically executes the processing. Also, the update history information is written to the before image buffer (BIB) 17 when the data is written to the cache memory, that is, before the data is written back from the cache memory to the main memory 14, the bus transaction issuance control unit 103, It is automatically executed by the buffer access controller 104 and the code memory controller 106.
[0079]
Also, the failure recovery processing in the second embodiment can be performed in the same procedure as in the first embodiment described with reference to FIGS. That is, when restoring the contents of the main memory 14 to the checkpoint state before the failure, the correct data is reconstructed using the vertical parity data, and then the update history information is received from the before image buffer (BIB) 17. A process of sequentially reading and writing back the pre-update unit data block and the vertical parity data to the corresponding storage positions in the main memory 14 and the vertical parity memory 21, respectively, is performed.
[0080]
The case where the data in the main memory has parity and can detect a 1-bit error has been described above, but the same configuration is possible when the SEC-DED code is used. In this case, when a 2-bit error is detected when reading data from the main memory, the correct data is reconstructed by the same method as described above, and the failure can be recovered.
[0081]
FIG. 12 shows the configuration of a computer system according to the third embodiment of the present invention.
This computer system employs a block parity memory 22 in place of the redundant code memory 16 provided in the system of the first embodiment, and is a unit read / written by continuous access by the CPU such as burst transfer instead of word units. It is configured to manage block parity data used for error correction in units of data block groups each including four data blocks (cache blocks).
[0082]
That is, the main memory 14 is a memory having an error detection function, such as a memory with parity, and is a word unit that is a data unit read / written by one memory access by the CPU with respect to a data string of the word. Parity bits are added.
[0083]
The block parity memory 22 is provided to add an error correction function to the main memory 14 having an error detection function, and has entries for the number of data block groups that can be stored in the main memory 14. Yes. Each entry stores vertical parity data calculated from the bit arrangement at the same bit position of each data block belonging to the corresponding data block group in the main memory 14.
[0084]
In this way, the occurrence of an error is detected by dividing the block data group into unit data blocks that can be read and written by one cache line operation and storing the calculated vertical parity data in the block parity memory 22. The unit data of the unit data block thus detected can be detected by the horizontal parity of the main memory 14, and which bit position is in error can be obtained from the block parity data, thereby enabling error correction.
[0085]
Writing block parity data to the block parity memory 22 is executed when it is detected that a bus transaction for writing back a cache line in the cache memory to the main memory 14 has been issued on the bus 10. In this case, the data block updated by the unit data block for one cache line continuously output on the bus 10 is read from the main memory 14, and the difference between the data block and the unit data block to be written (exclusive) New block parity data is generated from the block parity data of the block parity memory 22 corresponding to the data block group. Then, the block parity data is written to the entry of the block parity memory 22 corresponding to the data block group to which the unit data block to be written belongs.
[0086]
The before image buffer (BIB) 17 is used as a log memory for holding update history information of the main memory 14 during a period from a certain checkpoint to the next checkpoint, as in the first embodiment. Each time data is written into the main memory 14, the address of the block data group of the main memory 14 to which the address where the data is written belongs, the data block group before update, and the data block before update are written. Block parity data corresponding to the group is accumulated in the before image buffer (BIB) 17 in a stack format as update history information.
[0087]
In this case, the pre-update data block and the block parity data necessary for generating the new block parity data are read out as update history information for storage in the before image buffer (BIB). Can be used in combination, and each can be controlled so that only one access is required.
[0088]
When an error is detected in the read data of the main memory 14, the bit position where the error has occurred is specified from the error detection result by the horizontal parity of the main memory 14 and the block parity data, and correct data is reconstructed. Then, it is written back to the main memory 14. Specifically, all unit data blocks of the data block group to which the error detected data belongs are read from the main memory 14, and the correct block data group is reproduced from the corresponding block parity data.
[0089]
In the third embodiment, generation of block parity data and read / write control of update history information for the before image buffer (BIB) 17 are realized by hardware similar to that of the first embodiment described with reference to FIG. That is, the operation will be described by replacing the redundant code memory 16 in the system of FIG. 5 with the block parity memory 22. When block parity data is written to the block parity memory 22, data is written back from the cache memory to the main memory 14. In parallel with the write back processing, the code memory controller 106 automatically executes the processing. Also, the update history information is written to the before image buffer (BIB) 17 when the data is written to the cache memory, that is, before the data is written back from the cache memory to the main memory 14, the bus transaction issuance control unit 103, It is automatically executed by the buffer access controller 104 and the code memory controller 106.
[0090]
Also, the failure recovery process in the third embodiment can be performed in the same procedure as in the first embodiment described with reference to FIGS. That is, when restoring the contents of the main memory 14 to the checkpoint state before the failure occurs, the correct data is reconstructed using the block parity data, and then the update history information is received from the before image buffer (BIB) 17. A process of sequentially reading and writing back the pre-update data block group and the block parity data to the corresponding storage positions in the main memory 14 and the block parity memory 22 is performed.
[0091]
The case where the data in the main memory has parity and can detect a 1-bit error has been described above, but the same configuration is possible when the SEC-DED code is used. In this case, when a 2-bit error is detected when reading data from the main memory, the correct data is reconstructed by the same method as described above, and the failure can be recovered.
[0092]
In the above description, in any embodiment, the before-update data and the corresponding error correction redundant code (ECC, vertical parity, block parity) are simultaneously written in the before image buffer (BIB) 17. However, the redundant code may be performed when the redundant code memory 16, the vertical parity memory 21, or the block parity memory 22 for storing the redundant code is updated. In this case, a redundant code updated by writing a new redundant code is read from the redundant code memory 16, the vertical parity memory 21, or the block parity memory 22, and is written in the before image buffer (BIB) 17.
[0093]
【The invention's effect】
As described above, according to the present invention, a memory subsystem having an error correction function can be constructed with additional hardware while using resources such as an existing memory with parity as it is. A high computer system can be realized. In addition, processing can be continued even when a memory failure occurs that cannot be recovered by the checkpoint restart method using log memory, and sufficient fault-tolerant performance is achieved with less hardware without duplicating memory. Can be realized.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a computer system according to a first embodiment of the present invention.
FIG. 2 is a view for explaining an error correction code writing operation to a redundant code memory in the system according to the first embodiment;
FIG. 3 is a view for explaining an update history information writing operation to a BIB memory in the system according to the first embodiment;
FIG. 4 is a view for explaining the restoration operation of the main memory and the redundant code memory in the system according to the first embodiment;
FIG. 5 is a block diagram showing a specific hardware configuration employed in the system of the first embodiment.
6 is a timing chart illustrating a series of operations executed in a write-back process from a cache to a main memory in the system of FIG.
7 is a timing chart for explaining a series of operations executed in a writing process for a shared line in a cache in the system of FIG.
FIG. 8 is a flowchart for explaining a first procedure of failure recovery processing executed in the system of FIG. 5;
FIG. 9 is a flowchart for explaining a second procedure of failure recovery processing executed in the system of FIG. 5;
FIG. 10 is a block diagram showing a configuration of a computer system according to a second embodiment of the present invention.
FIG. 11 is a view for explaining the generation principle of vertical parity data in the system of the second embodiment;
FIG. 12 is a block diagram showing a configuration of a computer system according to a third embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Processor bus, 11-1 to 11-n ... CPU, 12-1 to 12-n ... Cache memory, 13 ... Main memory controller, 14 ... Main memory, 15 ... BIB / CM controller, 16 ... Redundant code memory ( CM), 17 ... before image buffer (BIB), 21 ... vertical parity memory, 22 ... block parity memory, 101 ... bus interface control unit, 102 ... bus transaction response control unit, 103 ... bus transaction issue control unit, 104 ... buffer Access controller 105... State storage control unit 106.

Claims

In a computer system having one or more CPUs and a main memory with parity that is connected to the CPUs via a bus and stores data with added parity ,
Cache memory,
A memory controller that is provided between the bus and the main memory and controls the main memory;
Data having a plurality of storage areas provided corresponding to each address serving as a unit of read / write access to the main memory , and stored in the address of the main memory corresponding to the storage area in each storage area a redundant code memory for holding a correctable redundancy code error occurring in a part of,
A control device that is provided between the bus and the redundant code memory and controls the redundant code memory, monitors a bus transaction issued on the bus, and writes data from the cache memory to the main memory When back-up is executed, a redundant code corresponding to the data is generated from the data value on the bus, and the redundant code is stored in the storage area of the redundant code memory corresponding to the unit data write address. A control device;
When a data error in the read data is detected by the memory controller when the data is read from the main memory, the read data and the redundant code stored in the redundant code memory corresponding to the read data are used. A computer system comprising: error correction means for reconstructing correct data .

A log memory connected to the control means for storing update history information of the main memory;
The control means, before executing the data write to the main memory by the CPU, the pre-update data of the main memory corresponding to the address where the data write is executed and the redundant code corresponding to the pre-update data Means for respectively reading from the main memory and the redundant code memory, storing the pre-update data and the redundant code in the log memory as the update history information,
When a failure that requires restoring the contents of the main memory to the state before the failure occurs, the pre-update data and redundant code constituting each update history information stored in the log memory are stored in the main memory and Rewriting to the redundant code memory to restore the main memory to a state before the occurrence of a failure, and further to returning the contents of the redundant code memory to a state corresponding to the restored contents of the main memory The computer system according to claim 1.

The control means includes
When data is written to the cache memory by the CPU, the pre-update data of the main memory corresponding to the address where the data is written and the redundant code corresponding to the data are updated from the main memory and the redundant code memory. 3. The computer system according to claim 2, wherein each of the data is read and the pre-update data and redundant code are stored in the log memory as the update history information.

A computer system having one or more CPUs and a main memory with parity connected to the CPUs via a bus and storing parity-added data, and information necessary for failure recovery at each checkpoint In a computer system that stores in a main memory and restores the contents of the main memory to a checkpoint before the occurrence of a failure using update history information stored in the log memory when a failure occurs,
Cache memory,
A memory controller that is provided between the bus and the main memory and controls the main memory;
Corresponding to each address as a unit of read / write access to the main memory A redundancy having a plurality of storage areas provided and holding a redundancy code capable of correcting an error occurring in a part of the data stored in the address of the main memory corresponding to the storage area in each storage area Code memory,
A log memory for storing update history information of the main memory;
A controller connected to the bus, the redundant code memory, and the log memory to control the redundant code memory and the log memory, and monitors a bus transaction issued on the bus; When write back to the main memory is executed, a redundant code corresponding to the data is generated from the value of the data on the bus, and the redundant code is stored in the redundant code memory corresponding to the unit data write address. Means for storing in a storage area, and when data write to the cache memory by the CPU is executed, the pre-update data of the main memory corresponding to the address where the data write is executed, and a redundant code corresponding thereto Read from main memory and redundant code memory respectively , A control device and means for storing the log memory them pre-update data and the redundant code as the update history information,
If a data error in the read data is detected by the memory controller when reading data from the main memory, the read data and the redundant code stored in the redundant code memory corresponding to the read data are used. Error correcting means for reconstructing correct data and writing the reconstructed data into the main memory;
When a failure that requires restoring the contents of the main memory to the state before the failure occurs, the pre-update data and redundant code constituting each update history information stored in the log memory are stored in the main memory and Means for respectively writing back to the redundant code memory to restore the main memory to the state before the failure, and to return the contents of the redundant code memory to a state corresponding to the contents of the restored main memory. A computer system characterized by that.

In a computer system having one or more CPUs and a main memory with parity connected to the CPUs via a bus and having an error detection function based on parity for each data string serving as a read / write access unit,
Cache memory,
A memory controller that is provided between the bus and the main memory and controls the main memory;
A unit data block corresponding to the storage area in each storage area, the storage area having a plurality of storage areas provided corresponding to each unit data block composed of a plurality of continuously accessed data strings in the main memory; A vertical parity memory that holds vertical parity data for the same bit position of each of a plurality of data strings belonging to
A control device that is provided between the bus and the vertical parity memory and controls the vertical parity memory, monitors a bus transaction issued on the bus, and unit data from the cache memory to the main memory When block write-back is executed, the vertical parity data corresponding to the unit data block is generated from the value of the unit data block on the bus, and the vertical parity data corresponding to the unit data block is generated. A controller for storing in a storage area of the parity memory;
When a data error in the read data is detected by the parity check by the memory controller at the time of reading data from the main memory, the unit data block to which the data in which the data error is detected belongs and the unit data block A computer system comprising means for reconstructing a correct unit data block from the vertical parity data of the vertical parity memory.

One or more CPUs, one or more CPUs, a main memory with parity connected to the CPUs via a bus and storing parity-added data, a cache memory, and between the bus and the main memory And a memory controller for controlling the main memory and a unit of read / write access to the main memory It has a plurality of storage areas provided corresponding to each address, and can correct an error occurring in a part of the data stored in the address of the main memory corresponding to the storage area in each storage area Provided between the redundant code memory that holds the redundant code, the bus and the redundant code memory, monitors a bus transaction issued on the bus, and executes write-back from the cache memory to the main memory A controller that generates a redundant code corresponding to the data from a value of the data on the bus and stores the redundant code in a storage area of the redundant code memory corresponding to a write address of the unit data; A failure recovery method for a computer system comprising:
When a data error in the read data is detected by the memory controller when the data is read from the main memory, the read data and the redundant code stored in the redundant code memory corresponding to the read data are used. Rebuild the correct data,
A failure recovery method comprising writing the reconstructed data into the main memory.

A log memory connected to the control means and storing update history information of the main memory is further provided, and the control means executes data writing before data writing to the main memory by the CPU is executed. The pre-update data of the main memory corresponding to the address and the redundant code corresponding to the pre-update data are read from the main memory and the redundant code memory, respectively, and the pre-update data and redundant code are read as the update history information in the log. Including means for storing in memory;
When a failure that requires restoring the contents of the main memory to the state before the failure occurs, the pre-update data and redundant code constituting each update history information stored in the log memory are stored in the main memory and Each of the redundant code memories is written back to restore the main memory to a state before the failure, and the contents of the redundant code memory are restored to a state corresponding to the restored contents of the main memory. The failure recovery method according to claim 6.