[go: up one dir, main page]

WO1999008190A1 - Systeme et procede de surveillance de commande de processus - Google Patents

Systeme et procede de surveillance de commande de processus Download PDF

Info

Publication number
WO1999008190A1
WO1999008190A1 PCT/US1998/015857 US9815857W WO9908190A1 WO 1999008190 A1 WO1999008190 A1 WO 1999008190A1 US 9815857 W US9815857 W US 9815857W WO 9908190 A1 WO9908190 A1 WO 9908190A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing environment
data
time
database
primary processing
Prior art date
Application number
PCT/US1998/015857
Other languages
English (en)
Inventor
Kuo-Chu Lee
Min Tae Yu
Original Assignee
Bell Communications Research, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bell Communications Research, Inc. filed Critical Bell Communications Research, Inc.
Publication of WO1999008190A1 publication Critical patent/WO1999008190A1/fr

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B19/00Programme-control systems
    • G05B19/02Programme-control systems electric
    • G05B19/04Programme control other than numerical control, i.e. in sequence controllers or logic controllers
    • G05B19/042Programme control other than numerical control, i.e. in sequence controllers or logic controllers using digital processors
    • G05B19/0421Multiprocessor system
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/20Pc systems
    • G05B2219/24Pc safety
    • G05B2219/24181Fail silent nodes, replicated nodes grouped into fault tolerant units

Definitions

  • the present invention relates to the field of process management, and more particularly to systems and methods to monitor processes to improve fault tolerance in distributed systems.
  • data In various processing environments, data must be replicated to alternative storage databases at various times to ensure that the data is not lost. Processors in many stringent environments, such as telephone networks, have very specific data integrity requirements, which significantly increase the replication burden.
  • For asynchronous data replication data is replicated periodically from one database to another. If one database fails, data is available from the other database.
  • a replication interval is defined to control how frequently replication is performed.
  • the invention includes a method of replicating data, executed by a processor, including the steps of marking data with a time-stamp corresponding to a time the data is generated or modified, periodically identifying data having a time-stamp later than a predetermined time, writing the identified data to a first replication database, and periodically writing the identified data from the first database to a second database.
  • the invention further includes a method of monitoring processes in a first processing environment, including the steps of marking data with a time-stamp corresponding to a time the data is generated or modified, periodically identifying data having a time-stamp later than a predetermined time, writing the identified data to a first database associated with the first processing environment, periodically writing the identified data from the first database to a second database associated with a second processing environment, determining at the secondary processing environment whether the time-stamps of any of the identified data correspond to a predetermined time interval, and if the time-stamps of any of the identified data do not correspond to the predetermined time interval, initiating action to correct any problems in the primary processing environment.
  • the invention further includes a method of monitoring processes in a first processing environment, including the steps of marking data with a time-stamp corresponding to a time the data is generated or modified, periodically identifying data having a time-stamp later than a predetermined time, writing the identified data to a first database associated with the first processing environment, periodically writing the identified data from the first database to a second database associated with a second processing environment, determining at the secondary processing environment whether the time-stamps of any of the identified data correspond to a predetermined time interval, if the time-stamps of any of the identified data do not correspond to the predetermined interval, determining whether the primary processing environment is on-line, if the primary processing environment is not on-line, initializing processes in the secondary processing environment and taking over processing functions of the primary processing environment at the secondary environment.
  • the invention further includes systems and devices for performing similar processing functions.
  • Fig. 1 is a block diagram of a process control monitoring system in accordance with one embodiment of the present invention
  • Fig. 2 is a process flow diagram of the operation of a primary processing environment to provide process control monitoring in accordance with one embodiment of the present invention.
  • Fig. 3 is a process flow diagram of the operation of a secondary processing environment to provide process control monitoring in accordance with one embodiment of the present invention.
  • FIG. 1 is a block diagram of a process control monitoring system in accordance with one embodiment of the present invention.
  • a process control monitoring system in accordance with the present invention is distributed across a primary processing environment 102 and a secondary processing environment 104.
  • Processing environments 102 and 104 may correspond to a computer, processor, or network component that processes information and replicates that information.
  • Primary processing environment 102 includes a plurality of processes or applications 106a-106n, a primary process control monitor (“PCM”) 108 is connected to each of the processes 106, and a shared memory 110.
  • PCM primary process control monitor
  • primary PCM 108 monitors processes 106 and determines whether each should be running, brought-up, and/or shut-down. It further detects which processes are "hung-up.” For example, at predetermined time intervals, each process 106 writes status information to primary PCM 108, which then stores the status information in shared memory 110.
  • the information in shared memory 110 can be used to monitor processes 106 and other applications (not shown).
  • the content of shared memory 110 is periodically replicated to primary database 112. This replication process may be performed as described, for example, in co-pending U.S. Patent Application Serial No. 08/907,705, filed concurrently, which is incorporated by reference.
  • secondary processing environment 104 includes secondary PCM 114 for monitoring individual processes 116a-l 16n. Secondary processing environment 104 also includes a shared memory 118 for storing any process data from processes 116a-116n. In addition, secondary processing environment 114 is connected to a secondary database 120 to replicate data for efficiency and fault tolerance.
  • Secondary PCM 114 differs from primary PCM 108, however, in accordance with one embodiment of the invention, in that secondary PCM 114 monitors primary processing environment 102 at the process or application level. As described in more detail below, in accordance with the present invention, process data stored in primary database 112 is periodically replicated to secondary database 120. Secondary PCM 114 uses this replicated data to monitor the process performance of primary processing environment 102 and take over processing where necessary.
  • Fig. 2 is a process flow diagram of the operation of primary processing environment
  • primary PCM 108 monitors the operation of processes 106 at a predetermined interval. Thus, primary PCM 108 initially determines whether monitoring interval Tj has expired (step 200). If not, the primary PCM 108 continues to monitor. If time Ti has expired, primary PCM 108 checks each process (step 202) and determines whether it is malfunctioning (step 204). If a malfunction exists, primary PCM 108 preferably restarts or corrects the process (step 206). If no malfunctions exist, primary PCM 108 determines whether it is time to replicate the shared memory 110 data to primary database 112.
  • primary PCM 108 determines whether a preselected replication time interval T 2 has expired (step 208). If not, primary PCM 108 continues to wait. Once T 2 has expired, primary PCM 108 writes the process data from shared memory 110 to primary database 112 (step 210). In one embodiment of the present invention, this replication process writes all data to primary database 112. However, in accordance with another embodiment of the invention, primary PCM 108 only writes predetermined portions of the data. For example, the replication process may be based on time-stamped data as described in the incorporated co- pending U.S. Patent Application Serial No. 08/907,705. To enable additional process monitoring, in accordance with the present invention, data from primary database 112 is further replicated to secondary database 120.
  • primary PCM 108 determines whether it is time to replicate data from primary database 112 to secondary database 120. Specifically, primary PCM 108 determines whether a third preselected time interval T has expired.
  • T 3 is configurable and is preferably selected to ensure accuracy in a fault-tolerant design depending on the system or network configuration and corresponding application. For example, in a telecommunication environment such as the telephone network where certain standards require very strict fault-tolerance, this time period T 3 would be relatively short, for example twenty seconds.
  • primary PCM 108 determines that T 3 has not expired, it continues other steps of its normal processing. However, when primary PCM 108 determines that time T 3 has expired, it replicates the primary database information to the secondary database (step 214). In a preferred embodiment, this data replication process is also performed based on the time- stamped data, much like the data replication between shared memory 110 and primary database 112. In other words, only data whose status has been changed or updated since the beginning of the replication interval T 3 is replicated from primary database 112 to secondary database 120. In accordance with the present invention, secondary PCM 114 uses the replicated data stored in secondary database 120 to monitor the processes of the primary processing environment 102. For example, secondary PCM 114 may monitor the time-stamp information corresponding to each operation.
  • the replicated data includes time-stamps corresponding to the most recent replication interval T 3 , then processes 106 of primary processing environment 102 are functioning properly and the replication process is functioning properly. However, if the time-stamps are old, then a problem exists in either the processes 106 or the replication process.
  • Fig. 3 is a process flow diagram of a process executed by a secondary PCM 114 to provide process control monitoring in accordance with one embodiment of the invention.
  • Secondary PCM 114 periodically monitors the time-stamps of data replicated to secondary database 120.
  • the time interval T 4 for this monitoring step is also configurable, and again, in networks or systems requiring strict fault tolerance, this interval would be shortened. For example, in telephone networks, the interval might be every 20 seconds.
  • Secondary PCM 114 initially determines whether the time interval T 4 has expired (step 300). If not, it continues to monitor. If time interval T 4 has expired, secondary PCM 114 checks the time-stamps on the replicated data in secondary database (step 302). Secondary PCM 114 then determines whether any time-stamps correspond to the most recent replication interval T 3 (step 304). If they are, the processes 106 and the replication process are working properly and secondary PCM 114 continues normal monitoring until the next interval T . However, if the secondary PCM 114 determines that none of the time-stamps correspond to the most recent replication interval T 3 , then a problem exists in either the processes 106 or the replication process.
  • secondary PCM 114 determines whether the primary processing environment 102 is still "alive" (functioning properly and/or on-line) (step 306). If it is, secondary PCM 114 preferably requests help (step 308). Because primary processing environment 102 is still alive but the updated information is not accurate, certain problems can be presumed. For example, certain processes 106 may be down. In one embodiment, secondary PCM 114 may take steps to automatically reinitialize various processes 106 based on these presumptions or based on various data from additional diagnostic tools or initialize certain automatic diagnostic procedures. Alternatively, secondary PCM 114 may provide an alarm indication and request manual intervention to further diagnose problems at the primary processing environment 102.
  • Secondary processing environment 104 takes over the processes of primary processing environment 102 (step 310).
  • Secondary processing environment 104 is preferably configured to include the same processes and functionality as primary processing environment 102 for fault tolerance and backup purposes.
  • primary processing environment 102 goes down, secondary processing environment 104 initializes the processes or applications necessary and substitutes itself and the corresponding processes for that of the downed primary processing environment 102.
  • secondary processing environment 104 continues performing the processes of primary processing environment 102 until primary processing environment 102 comes back on-line. In this manner, fault tolerance is highly improved as both processes and machine state are ultimately monitored.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Automation & Control Theory (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Dans un ordinateur (102) primaire, un moniteur (108) de commande de processus traite et reproduit des données qu'il envoie à une base de données (112) primaire. Périodiquement, les données provenant de la base de données primaire sont à nouveau reproduites et envoyées à une base de données (120) secondaire, associée à un ordinateur (104) secondaire. L'ordinateur (104) secondaire inclut des informations dans les données reproduites et envoyées à la base de données (120) secondaire. Sur la base de ces informations, un moniteur (114) secondaire de commande de processus peut demander de l'aide afin de réparer certains processus dans le premier ordinateur (102), ou déterminer si l'environnement (102) de traitement primaire est sous tension. Si ce dernier n'est pas sous tension, l'environnement (104) de traitement secondaire peut reprendre tous les processus jusqu'à ce que l'environnement (102) de traitement primaire soit à nouveau connecté.
PCT/US1998/015857 1997-08-07 1998-07-30 Systeme et procede de surveillance de commande de processus WO1999008190A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US90864197A 1997-08-07 1997-08-07
US08/908,641 1997-08-07

Publications (1)

Publication Number Publication Date
WO1999008190A1 true WO1999008190A1 (fr) 1999-02-18

Family

ID=25426060

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1998/015857 WO1999008190A1 (fr) 1997-08-07 1998-07-30 Systeme et procede de surveillance de commande de processus

Country Status (1)

Country Link
WO (1) WO1999008190A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6732144B1 (en) * 1999-11-19 2004-05-04 Kabushiki Kaisha Toshiba Communication method for data synchronization processing and electronic device therefor

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4751702A (en) * 1986-02-10 1988-06-14 International Business Machines Corporation Improving availability of a restartable staged storage data base system that uses logging facilities
US5404508A (en) * 1992-12-03 1995-04-04 Unisys Corporation Data base backup and recovery system and method
US5537533A (en) * 1994-08-11 1996-07-16 Miralink Corporation System and method for remote mirroring of digital data from a primary network server to a remote network server
US5592618A (en) * 1994-10-03 1997-01-07 International Business Machines Corporation Remote copy secondary data copy validation-audit function

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4751702A (en) * 1986-02-10 1988-06-14 International Business Machines Corporation Improving availability of a restartable staged storage data base system that uses logging facilities
US5404508A (en) * 1992-12-03 1995-04-04 Unisys Corporation Data base backup and recovery system and method
US5537533A (en) * 1994-08-11 1996-07-16 Miralink Corporation System and method for remote mirroring of digital data from a primary network server to a remote network server
US5592618A (en) * 1994-10-03 1997-01-07 International Business Machines Corporation Remote copy secondary data copy validation-audit function

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6732144B1 (en) * 1999-11-19 2004-05-04 Kabushiki Kaisha Toshiba Communication method for data synchronization processing and electronic device therefor

Similar Documents

Publication Publication Date Title
US7814050B2 (en) Disaster recovery
US6134673A (en) Method for clustering software applications
US6202067B1 (en) Method and apparatus for correct and complete transactions in a fault tolerant distributed database system
Kim Highly available systems for database applications
US20010056554A1 (en) System for clustering software applications
EP0481231A2 (fr) Procédé et système d'augmentation de la disponibilité opérationnelle d'un système de programmes d'ordinateur opérant dans un système distribué d'ordinateurs
US7752500B2 (en) Method and apparatus for providing updated processor polling information
KR20000011835A (ko) 네트워크의분산애플리케이션에대한고장검출및소정의복제스타일로복구하는방법및장치
US7730029B2 (en) System and method of fault tolerant reconciliation for control card redundancy
KR20010052972A (ko) 내고장성 다중-프로세서 시스템에서 프로세서들의 동기화
JPS6375963A (ja) システム回復方式
JP3447347B2 (ja) 障害検出方法
WO1999008190A1 (fr) Systeme et procede de surveillance de commande de processus
Hunter et al. Availability modeling and analysis of a two node cluster
Cisco Operational Traps
Cisco Operational Traps
Cisco Operational Traps
Cisco Operational Traps
Cisco Operational Traps
Cisco Operational Traps
Cisco Operational Traps
Cisco Operational Traps
Cisco Operational Traps
Cisco Operational Traps
Cisco Operational Traps

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA