JP2005056185A

JP2005056185A - Hierarchical type agent learning method and system

Info

Publication number: JP2005056185A
Application number: JP2003286884A
Authority: JP
Inventors: Yukinori Kakazu; 侑昇嘉数; Hiroko Ishiwaka; 裕子石若
Original assignee: TECHNO FACE KK
Current assignee: TECHNO FACE KK
Priority date: 2003-08-05
Filing date: 2003-08-05
Publication date: 2005-03-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system and control method for absorbing vagueness when both spatial vagueness and temporal vagueness exist in an evaluation signal necessary for learning. <P>SOLUTION: This system is provided with at least an educating agent, learning agent and intermediate agent to be functioned by a computer, wherein deviation exists between the recognized locations of those respective agents, the sizes of those respective agents themselves and road width enabling the learning agent to pass and the objective locations of those respective agents, the sizes of those respective agents and the road width enabling the learning agent to pass at a predetermined time. This system is configured to control the learning agent from a start position to a goal position by making those respective agents autonomously learn. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、階層型エージェント学習方法に関する。 The present invention relates to a hierarchical agent learning method.

これまであいまいさを対象とした問題やそれに対する手法が提案されてきている。例えば、境界条件があいまいな問題に対しては、ファジー理論が有名である。ファジー理論は、述語論理に基づいてあいまいな情報から意思決定をおこなうための理論である。ファジー理路は熟練運転者の永年の経験と勘を制御則としてメンバシップ関数とif, then ルールの形に体系化するのに適している(非特許文献1,2)。
入力データがあいまいな問題に対しては、人工ニューラルネットワーク(ANN)が有名である。ANNは生体の神経系の構造，機能をモデル化したもので、基本となる素子をユニットと呼び、入力はネットワークの結合状態に依存するある重み係数が掛けられて重畳され，それがある閾値を超えると出力パルスを発生する。ネットワークのニューロンの数，結合状態が決まり，重み係数と閾値が決定されるとニューラルネットワークの構造が決定される。最も有名なANNに誤差逆伝播法(BP)（非特許文献３）があるが、教師データをもとに、ネットワークの過重を修正する学習手法である。あいまいな入力データに対しても線形分離可能なANNである。
強化学習は状態空間にあいまいさが含まれる問題に対して、学習する手法である。強化学習は試行錯誤学習と言われ、報酬をもとに学習を行う。状態空間のあいまいさにも対応することが可能である。
しかし、これまで扱われてきたあいまいさは、境界条件や入力データや状態空間であり、これらの学習条件には、教師信号や報酬といった評価信号にあいまいさ（例えば、ある評価信号がどの場所やどの時点における行動に対するものか不確定であること）は含まれていないものとして扱っている。 So far, ambiguity-related problems and methods have been proposed. For example, fuzzy theory is well-known for problems with ambiguous boundary conditions. Fuzzy theory is a theory for making decisions from ambiguous information based on predicate logic. Fuzzy logic is suitable for systematizing into the form of membership functions and if, then rules using the long experience and intuition of skilled drivers as control rules (Non-patent Documents 1 and 2).
Artificial neural networks (ANN) are well known for problems where the input data is ambiguous. ANN is a model of the structure and function of the nervous system of a living body. The basic element is called a unit, and the input is multiplied by a weighting factor that depends on the connection state of the network. When it exceeds, an output pulse is generated. When the number of neurons in the network and the connection state are determined and the weighting factor and threshold are determined, the structure of the neural network is determined. The most famous ANN is the back propagation method (BP) (Non-Patent Document 3), which is a learning method that corrects network overload based on teacher data. ANN can be linearly separated even for ambiguous input data.
Reinforcement learning is a method for learning a problem in which ambiguity is included in the state space. Reinforcement learning is said to be trial and error learning, and learning is based on rewards. It is possible to deal with the ambiguity of the state space.
However, the ambiguities that have been dealt with so far are boundary conditions, input data, and state space. These learning conditions include ambiguities in evaluation signals such as teacher signals and rewards (for example, where an evaluation signal is It is considered that it is not included).

他方、人間の知識の模倣や環境適応を目的としたエージェントアーキテクチャの変遷として、これまでに大きく分けて、モジュール型、自律型、黒板型、および、人間とのインタラクション型、という４つの流れがあった。
モジュール型の例としては、知識ベースの記号推論システムに関する非特許文献４、信念・願望・意図に関する非特許文献５、協調的機構に関する非特許文献６が代表的であるが、これらの手法は知識を基に意思決定を行っているため、設定されたあるいは獲得された知識の適用が難しい新たな問題への適応、例えば、空間的、時間的なあいまいさが存在する実環境に適応することが困難であるという問題点があった。
自律型の例としては、行動指向に関する非特許文献７がある。モジュール型の知識ベースではなく、環境とのインタラクションでのみ意思決定を行うもので、実環境への適応を行ったが、逆に知識をもたないという問題あった。実環境への適応を行う点で、空間的あいまいさへの対応はある程度可能であったが、「その場限りの」制御にしか利用できず、
過去に遭遇した問題解決に関する知見を、後の類似の問題解決に利用することが困難であった。
このため、モジュール型と自律型の混合形態のものとして、ハイブリッド型と呼ばれる、非特許文献８も存在した。しかし、これらの手法では、局所的学習(ある特定の場所での学習)と、大局的学習（制御対象たる広域空間全体での学習）の同時学習が困難である、という難点があった。そのため、自律型、ハイブリッド型のいずれによっても、客観的に正確な位置関係を把握することが困難な、空間的なあいまいさを持つ系での自己学習による制御が困難であった。
黒板型の例としては、非特許文献９乃至１３が挙げられるが、これらは、様々エージェントが１つの黒板に情報を共有し、必要な情報を更新していくことによって、協調問題を解決するという問題解決の枠組みであり、実環境にも適応可能である。しかし、時間的、空間的にあいまいな評価信号への適応が困難である、という問題を持っていた。
人間とのインタラクション型の例としては、非特許文献１４および１５が挙げられるが、人間の本来もつ時間的、空間的あいまいさが機械やエージェントとのインタラクションを困難にしていた。 On the other hand, the transition of the agent architecture aimed at imitating human knowledge and adapting to the environment has been broadly divided into four types: module type, autonomous type, blackboard type, and human interaction type. It was.
Typical examples of the module type are Non-Patent Document 4 related to knowledge-based symbol inference system, Non-Patent Document 5 related to beliefs, desires and intentions, and Non-Patent Document 6 related to cooperative mechanisms. Because it makes decisions based on this, it can adapt to new problems where it is difficult to apply the knowledge that has been set or acquired, for example, to adapt to a real environment where spatial and temporal ambiguity exists. There was a problem that it was difficult.
As an example of an autonomous type, there is Non-Patent Document 7 related to action orientation. The decision was made only by interaction with the environment, not the modular knowledge base, and it was adapted to the real environment, but there was a problem that it had no knowledge. Although it was possible to cope with spatial ambiguity to some extent in terms of adaptation to the real environment, it could only be used for “on the fly” control,
It was difficult to use knowledge about problem solving encountered in the past for solving similar problems later.
For this reason, non-patent document 8 called a hybrid type also exists as a mixed type of module type and autonomous type. However, in these methods, there is a difficulty that simultaneous learning of local learning (learning at a specific place) and global learning (learning in the entire wide space to be controlled) is difficult. Therefore, it is difficult to control by self-learning in a system with spatial ambiguity where it is difficult to grasp an accurate positional relationship objectively regardless of whether it is an autonomous type or a hybrid type.
Examples of blackboard types include Non-Patent Documents 9 to 13, which are designed to solve cooperation problems by sharing information on one blackboard and updating necessary information. It is a framework for problem solving and can be adapted to the real environment. However, there is a problem that it is difficult to adapt to evaluation signals that are ambiguous in time and space.
Non-patent documents 14 and 15 can be cited as examples of the interaction type with humans, but the temporal and spatial ambiguity inherent in humans makes it difficult to interact with machines and agents.

Zadeh, Lotfi, "Fuzzy Sets as the Basis for a Theory of Possibility", Fuzzy Sets and Systems 1:3-28, 1978.Zadeh, Lotfi, "Fuzzy Sets as the Basis for a Theory of Possibility", Fuzzy Sets and Systems 1: 3-28, 1978. Dubois, Didier, and Prade, Henri, "Possibility Theory", Plenum Press, New York, 1988.Dubois, Didier, and Prade, Henri, "Possibility Theory", Plenum Press, New York, 1988. D.Rumelhart, G.Hinton and R. Williams. Learning Internal Representations by Error Propagation, In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition- Vol. 1. MIT Press, Cambridge (1986)D. Rumelhart, G. Hinton and R. Williams. Learning Internal Representations by Error Propagation, In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition- Vol. 1.MIT Press, Cambridge (1986) A．Newll、 H．A．Simon:"Computer Science as Empirical Inquiry: Symbols and Search"、 Communications of ACM、 Vol． 19、 No． 3、 pp 113-126、1976A. Newll, H. A. Simon: “Computer Science as Empirical Inquiry: Symbols and Search”, Communications of ACM, Vol. 19, No. 3, pp 113-126, 1976 Bratman、M．E．、Israel、D．J．:"Plans and Resource Bounded Practical Reasoning"、 Computational Intelligence、 Vol．4、 No．4、 pp． 349-355、 1998Bratman, M.M. E. Israel, D .; J. : "Plans and Resource Bounded Practical Reasoning", Computational Intelligence, Vol. 4, No. 4, pp. 349-355, 1998 Jennings．N． :"Cooperation in Industrial Ｍｕｌｔｉ−ＡｇｅｎｔＳｙｓｔｅｍｓ”、ＷｏｒｌｄＳｃｉｅｎｔｉｆｉｃ、 1993Jennings. N. : "Cooperation in Industrial Multi-Agent Systems", World Scientific, 1993 Brooks、 R．A．:"A Robust Layered Control System for a Mobile Robot"、 IEEE Journal of Robotics and Automation、 Vol．RA-2、No．1、 March 1986Brooks, R.C. A. : "A Robust Layered Control System for a Mobile Robot", IEEE Journal of Robotics and Automation, Vol. RA-2, No. 1, March 1986 Ferguson、I．A．:"Toward an Architecture for Adaptive、 Rational、 Mobile Agents"、 In E．Werner、 Y．Demazeau (eds．)、 Decentralized A．I．3、 Proceedings of the Third European Workshop on Modelling Autonomous Agents in a Multi-Agents World (MAAMAW'91)、pp．249-262、 North-Holland、 1992Ferguson, I. A. : "Toward an Architecture for Adaptive, Rational, Mobile Agents", In E. Werner, Y. Demazeau (eds.), Decentralized A. I. 3, Proceedings of the Third European Workshop on Modeling Autonomous Agents in a Multi-Agents World (MAAMAW'91), pp. 249-262, North-Holland, 1992 Erman、 L．、 Hayes-Roth、 F．、 Lesser、 V．、 Raj Reddy、 D． :"The Hearsay-II Speech Understanding System: Integrating Knowledge to Resolve Uncertainty"、 Computing Surveys、 Vol．12、 No．2、 pp 213-253、1980Erman, L. Hayes-Roth, F.A. , Lesser, V. Raj Reddy, D.C. : "The Hearsay-II Speech Understanding System: Integrating Knowledge to Resolve Uncertainty", Computing Surveys, Vol. 12, No. 2, pp 213-253, 1980 Corkill、D．、 Lesser、 V． Hudlicka、 E．:"Unifying Data-Directed and Goal-Direced Control: An Example and Experiments．"、 In Proceedings of AAAI-82、 pp．143-147、 1982Corkill, D.D. , Lesser, V. Hudlicka, E.M. : "Unifying Data-Directed and Goal-Direced Control: An Example and Experiments.", In Proceedings of AAAI-82, pp. 143-147, 1982 Lesser、V．、 Corkill、D．:"The Distributed Vehicle Monitoring Testbed: A Tool for Investigating Distributed Problem Solving Network"、 AI Magazine、 Vol．4、 No．3、 pp．15-33、 1983Lesser, V.M. Corkill, D. : "The Distributed Vehicle Monitoring Testbed: A Tool for Investigating Distributed Problem Solving Network", AI Magazine, Vol. 4, No. 3, pp. 15-33, 1983 Hayes-Roth、 B．:"A Blackboard Architecture for Control"、 Artificial Inteligence、 Vol．26、 pp．251-321、 1985Hayes-Roth, B. : "A Blackboard Architecture for Control", Artificial Inteligence, Vol. 26, pp. 251-321, 1985 Decker、K．S．、 Garvey、 A．J．、 Humphrey、 M．A．、Lesser、V．:"A Real-Time Control Architechture for an Approximate Processings Blackboard System"、 InternationalJournal of Pattern Recognition and ArtificialIntelligence、 Vol．7、 No．2、 pp．265-284、 1993Decker, K.M. S. Garvey, A. J. Humphrey, M. A. Lesser, V .; : "A Real-Time Control Architecture for an Approximate Processings Blackboard System", International Journal of Pattern Recognition and Artificial Intelligence, Vol. 7, No. 2, pp. 265-284, 1993 Bates、 J．、 Bryan Loyall、 A．、Scott Reilly、 W．:"An Architecture for Action、 Emotion、 and Social Behavior"、 In Proceedings of the Fourth European Workshop on Modellings Autunomous Agents in a Multi-Agent World (MAAMAW'92)、1992Bates, J.A. Bryan Loyall, A. Scott Reilly, W. : "An Architecture for Action, Emotion, and Social Behavior", In Proceedings of the Fourth European Workshop on Modelings Autunomous Agents in a Multi-Agent World (MAAMAW'92), 1992 J．Bates:"The Role of Emotion in Believable Agents"、 Communications of the ACM、 Vol．37、 No．7、pp．122-125、1994J. Bates: “The Role of Emotion in Believable Agents”, Communications of the ACM, Vol. 37, no. 7, pp. 122-125, 1994

上述のように、ファジー理論、ＡＮＮなどの手法によっては、学習条件に、教師信号や報酬といった評価信号にあいまいさは含まれていないものとして扱っているために、これらの評価信号にあいまいさが含まれている場合には、全く対応できなかった。
また、これまで提案されてきた各種のエージェントアーキテクチャーにも、評価信号のあいまいさそのものを取り扱っているものは見当らず、評価信号自体の空間的・時間的あいまいさに対処できたものは存在していなかった。
本発明では、学習に必要な評価信号に空間的なあいまいさと時間的なあいまいさの二つが存在する場合に、あいまいさを吸収することが可能なシステムおよび制御方法を提供することを目的とする。 As described above, depending on the methods such as fuzzy logic and ANN, the learning condition is treated as not including ambiguity in the evaluation signal such as the teacher signal and the reward. If it was included, it could not be handled at all.
In addition, there are no agent architectures that have been proposed so far that deal with the ambiguity of the evaluation signal itself, and there are those that can deal with the spatial and temporal ambiguity of the evaluation signal itself. It wasn't.
An object of the present invention is to provide a system and a control method capable of absorbing ambiguity when there are two ambiguities, spatial and temporal, in an evaluation signal necessary for learning. .

以上の目的等を達成するために、本発明では、コンピュータによって機能する、少なくとも、教育エージェント、学習エージェント、および中間エージェントを含み、所定の時刻における、各エージェントの認識する位置、各エージェント自身の大きさ、および、学習エージェントが通過可能な道幅と、客観的な各エージェントの位置、各エージェント自身の大きさ、および、学習エージェントが通過可能な道幅、との間にそれぞれ、ずれが存在するシステムにおいて、各エージェントが自律的に学習することによって、前記学習エージェントをスタート位置からゴール位置まで制御する方法であって、学習エージェントが、自己が採るべき行動を意思決定する学習エージェント意思決定工程（1120）と、教育エージェントが、自己が採るべき行動を意思決定する教育エージェント意思決定工程（1120）と、前記学習エージェントの意思決定と前記教育エージェントの意思決定を、それぞれ前記学習エージェントと前記教育エージェントが、中間エージェントに送る行動送信工程(130)と、前記学習エージェントの意思決定と前記教育エージェントの意思決定を同じかどうかを中間エージェントが判断する意思決定工程(1140)と、前記学習エージェントの意思決定と前記教育エージェントの意思決定が同じ場合には、前記学習エージェントが意思決定した採るべき行動を、そして、前記学習エージェントの意思決定と前記教育エージェントの意思決定が異なる場合には、中間エージェントが規則に従って、採るべき新たな行動を、前記中間エージェントが、前記学習エージェントに送る行動指示工程（1150）と、前記学習エージェントが前記採るべき行動を実行する、行動実行工程（1160）と、前記学習エージェント、前記中間エージェント、前記教育エージェントがそれぞれ、前記学習エージェントがゴールに到達したか否かを判断する、ゴール到達判断工程（1190）と、指定されたステップ数内にゴールに到達していない場合、前記学習エージェント及び前記教育エージェントが強化学習を行う、それぞれ、学習エージェント強化学習実行工程（1180）および教育エージェント強化学習実行工程（1180）と、が含まれる、階層型エージェント学習方法、を提供する。
このようにすることで、評価信号に空間的あいまいさが含まれる系に対して、あいまいさを吸収することが可能となる。更に、学習エージェント強化学習実行工程（1180）および教育エージェント強化学習実行工程（1180）は、後述のＱ-learningやV-learningのみならず、探索木、人工ニューラルネット、遺伝的アルゴリズム等の現在利用可能な全ての学習方法が利用可能である。 In order to achieve the above object and the like, the present invention includes at least an educational agent, a learning agent, and an intermediate agent that function by a computer, and the positions recognized by each agent at a predetermined time and the size of each agent itself. In a system in which there is a difference between the road width that the learning agent can pass and the position of each objective agent, the size of each agent itself, and the road width that the learning agent can pass A learning agent decision-making step (1120) in which each agent learns autonomously to control the learning agent from a start position to a goal position, and the learning agent decides an action to be taken by the learning agent (1120) Educational agents should be taken by themselves Educational agent decision making step (1120) for making a decision on behavior, and action sending step (130) for sending the decision of the learning agent and the decision of the educational agent to the intermediate agent by the learning agent and the educational agent, respectively. A decision step (1140) in which an intermediate agent determines whether the decision of the learning agent and the decision of the education agent are the same, and when the decision of the learning agent and the decision of the education agent are the same Is the action to be taken by the learning agent, and if the decision of the learning agent is different from that of the education agent, the intermediate agent takes a new action to be taken according to the rules. Actions sent by the agent to the learning agent An indication step (1150), an action execution step (1160) in which the learning agent executes the action to be taken, and whether each of the learning agent, the intermediate agent, and the education agent has reached the goal. A goal attainment determining step (1190) for determining whether or not, and when the goal is not reached within the designated number of steps, the learning agent and the educational agent perform reinforcement learning, respectively, and learning agent reinforcement learning is executed. A hierarchical agent learning method including the step (1180) and the educational agent reinforcement learning execution step (1180) is provided.
By doing in this way, it becomes possible to absorb ambiguity for a system in which spatial ambiguity is included in the evaluation signal. Furthermore, the learning agent reinforcement learning execution step (1180) and the education agent reinforcement learning execution step (1180) are currently used not only for Q-learning and V-learning described later, but also for search trees, artificial neural networks, genetic algorithms, etc. All possible learning methods are available.

好ましい態様では、前記学習エージェント意思決定工程（1120）に、前記各学習エージェントが、設定された観測範囲で自分自身が持っているＱ値テーブルの値を観測する、学習エージェントＱ値テーブル値観測工程と、前記学習エージェントが、確率（１−ε）でＱ値の最も高い行動方向を選択し、確率εでランダムな行動方向を選択する、行動選択工程と、前記中間エージェントが、前記各学習エージェントの選択した行動の方向を合成する行動合成工程、が含まれる、階層型エージェント学習方法が提供される。
このようにすることで、学習機構として実績のあるＱ−kearningを用いて、最もゴール到達の可能性の高い経路を容易に探索できる。 In a preferred embodiment, in the learning agent decision making step (1120), each learning agent observes the value of the Q value table held by itself in the set observation range. The learning agent selects a behavior direction having the highest Q value with probability (1-ε), and selects a random behavior direction with probability ε, and the intermediate agent includes the learning agents. There is provided a hierarchical agent learning method including a behavior synthesis step of synthesizing the selected behavior directions.
By doing in this way, the route with the highest possibility of goal arrival can be easily searched using Q-kearning which has a proven record as a learning mechanism.

別の好ましい態様では、前記教育エージェント意思決定工程（1120）に、前記教育エージェントが、設定された観測範囲で自分自身が持っているＶ値テーブルの値を観測する、教育エージェントＶ値テーブル値観測工程と、前記教育エージェントが、確率（１−ε）でＶ値の最も高い行動方向を選択し、確率εでランダムな行動方向を選択する、行動選択工程と、が含まれる、階層型エージェント学習方法が提供される。
このようにすることで、学習機構として実績のあるＶ−kearningを用いて、最もゴール到達の可能性の高い経路を容易に探索できる。 In another preferred embodiment, in the educational agent decision making step (1120), the educational agent observes the value of the V value table held by the educational agent within a set observation range. Hierarchical agent learning comprising: a step of selecting a behavior direction having the highest V value with probability (1-ε), and selecting a random behavior direction with probability ε. A method is provided.
By doing in this way, the route with the highest possibility of goal arrival can be easily searched using V-kearning which has a proven record as a learning mechanism.

更に別の好ましい態様では、前記行動指示工程（1150）に教育エージェントが示した行動の中で、最も高いＱ値の学習エージェントの行動を、前記中間エージェントが選択する、第１の行動生成工程と、所定の試行回数後にゴールに到達していない場合には、前記学習エージェントの示す方向とＱ値を第１のベクトルとし、教育エージェントが示す方向とＶ値を第２のベクトルとし、前記中間エージェントが第１および第２のベクトルを合成した方向を選択する、第２の行動生成工程と、が含まれる、階層型エージェント学習方法が提供される。
このようにすることで、教育エージェントと学習エージェントの学習成果の双方を有効に利用して効率的な経路探索が可能である。 In yet another preferred embodiment, a first action generation step in which the intermediate agent selects the action of the learning agent having the highest Q value among the actions indicated by the education agent in the action instruction step (1150). If the goal is not reached after a predetermined number of trials, the direction and Q value indicated by the learning agent are set as a first vector, the direction and V value indicated by the education agent are set as a second vector, and the intermediate agent is set. And a second action generation step of selecting a direction in which the first and second vectors are combined, a hierarchical agent learning method is provided.
In this way, efficient route search is possible by effectively using both the learning results of the educational agent and the learning agent.

更に他の好ましい態様では、前記学習エージェント強化学習実行工程（1180）に、各学習エージェントがゴールに到達した場合に、各学習エージェント以外のエージェントおよび外部環境が)各学習エージェントに第１の所定の報酬を与える、第１の報酬付与工程と、各学習エージェントがゴールに到達せず、かつ、障害物に衝突しなかった場合に、各学習エージェント以外のエージェントおよび外部環境が各学習エージェントに第２の所定の報酬を与える、第２の報酬付与工程と、各学習エージェントがゴールに到達せず、かつ、障害物に衝突した場合に、各学習エージェント以外のエージェントおよび外部環境が各学習エージェントに第３の所定の報酬を与える、第３の報酬付与工程と、前記各報酬付与工程の後に、それぞれの報酬値に応じて、前記各学習エージェントが、観測した全てのＱ値を更新する、Ｑ値更新工程、が含まれる、階層型エージェント学習方法が提供される。 In still another preferred embodiment, in the learning agent reinforcement learning execution step (1180), when each learning agent reaches the goal, an agent other than each learning agent and an external environment) provide each learning agent with a first predetermined value. A first reward granting step for giving a reward, and when each learning agent does not reach the goal and does not collide with an obstacle, an agent other than each learning agent and the external environment are assigned to each learning agent. A second reward granting step for providing a predetermined reward, and when each learning agent does not reach the goal and collides with an obstacle, an agent other than each learning agent and an external environment are assigned to each learning agent. After the third reward granting step for giving a predetermined reward of 3 and each reward granting step, Te, each learning agent, to update all Q values observed, Q value update step includes, hierarchical agent learning method is provided.

別の好ましい態様では、前記教育エージェント強化学習実行工程（1180）に、各学習エージェントがゴールに到達した場合に、外部環境が教育エージェントに第１の所定の報酬を与える、第１の報酬付与工程と、各学習エージェントがゴールに到達せず、かつ、障害物に衝突しなかった場合に、外部環境が、教育エージェントに第２の所定の報酬を与える、第２の報酬付与工程と、各学習エージェントがゴールに到達せず、かつ、障害物に衝突した場合に、外部環境が、教育エージェントに第３の所定の報酬を与える、第３の報酬付与工程と、前記各報酬付与工程の後に、前記教育エージェントがそれぞれの報酬値に応じて、観測した全てのＶ値を更新する、Ｖ値更新工程、が含まれる、階層型エージェント学習方法が提供される。 In another preferred embodiment, in the educational agent reinforcement learning execution step (1180), when each learning agent reaches a goal, the external environment gives the educational agent a first predetermined reward. A second reward granting step in which the external environment gives the education agent a second predetermined reward when each learning agent does not reach the goal and does not collide with an obstacle; When the agent does not reach the goal and collides with an obstacle, the external environment gives the education agent a third predetermined reward. There is provided a hierarchical agent learning method including a V value updating step in which the educational agent updates all observed V values according to respective reward values.

更に別の好ましい態様では、前記Ｑ値更新工程に、
Q(s,a)←(1-α)Q(s,a)+α[r+γmax_a'Q(s_t+1,a_t+1)]
（ここで、ｓは状態、ａは行動、Ｑ（ｓ，ａ）は、状態がｓ、行動がａのときのＱ値、αはステップサイズパラメータ、γは割引率、α、γの値の範囲は0≦α、γ≦1であり、α=γである必要は無い、rは外部環境が与える報酬のことで、ペナルティの場合は通常マイナス、報酬の場合はプラスの値を与る、Ｑ（s_t+1,a_t+1）は、状態がｓ_t+1、行動がａ_t+1のときのＱ値を意味する。「状態がｓ_t+1、行動がａ_t+1のとき」とは、次に取りうる状態、行動、の対を意味する。max_a'は、次に取ることが可能な全ての状態、行動、の対において、最もＱ値が最大になるような行動を選択することを意味する。ｍａｘ_a'Ｑ(s_t+1,a_t+1)は、次のステップで取りうる全ての状態、行動、の対において、Ｑ値が最大となるような行動ａを取ったときのＱ値である。）
なる式にしたがって、前記学習エージェントが、Ｑ値を更新する、第１のＱ値更新工程が含まれる、階層型エージェント学習方法が提供される。 In still another preferred embodiment, the Q value updating step includes
Q (s, a) ← (1-α) Q (s, a) + α [r + γmax _{a '} Q (s _{t + 1} , a _{t + 1} )]
(Where s is the state, a is the action, Q (s, a) is the Q value when the state is s and the action is a, α is the step size parameter, γ is the discount rate, and α and γ are values of The range is 0 ≦ α, γ ≦ 1, and it is not necessary that α = γ, r is the reward given by the external environment, usually giving a negative value for a penalty, and a positive value for a reward, _{Q (s t + 1, a} t + 1) , the state is s _{t + 1,} action means the Q value at the time of a _{t + 1.} "state s _{t + 1,} action is a _{t + 1} "When" means the next possible state / action pair. Max _{a '} is the largest Q value in all possible state / action pairs next. Max _{a ′} Q (s _{t + 1} , a _{t + 1} ) is such that the Q value is maximized in all possible state / action pairs in the next step. Q value when taking action a.)
A hierarchical agent learning method is provided that includes a first Q value updating step in which the learning agent updates the Q value according to the following equation.

別の好ましい態様では、前記Ｖ値更新工程に、
V(s_t)←V(s_t)+α[r_t+γV(s_t+1)-V(s_t)]
（ここで、ｓ_tは時刻ｔにおける状態ｓ、s_t+1は、時刻ｔ＋１における状態ｓ、Ｖ（ｓ_t）は、状態ｓ_tのときのＶ値、Ｖ（s_t+1）は、状態s_t+1のときのＶ値、αはステップサイズパラメータ、γは割引率、α、γの値の範囲は0≦α、γ≦1となり、α=γである必要は無い、rは外部環境が与える報酬のことで、ペナルティの場合は通常マイナス、報酬の場合はプラスの値を与える、を意味する。本実施例では、Ｖ（ｓ）は、Ｖ（x，ｙ）の２次元配列である（（ｘ、ｙ）は、教育エージェントの中心座標である）。）
なる式にしたがって、前記教育エージェントがＶ値を更新する、第１のＶ値更新工程が含まれる、階層型エージェント学習方法が提供される。 In another preferred embodiment, the V value updating step includes
V (s _t ) ← V (s _t ) + α [r _t + γV (s _{t + 1} ) -V (s _t )]
(Here, s _t a state at time t s, s _{t + 1,} the state s at time _{t + 1, V (s t} ) is V value in the state _{_{s t, V (s t +}} 1) is V value at the time of state _{st + 1} , α is a step size parameter, γ is a discount rate, α, γ value ranges are 0 ≦ α, γ ≦ 1, and it is not necessary that α = γ, r is This is a reward given by the external environment, which means that a penalty is usually a negative value, and a reward is a positive value.In this embodiment, V (s) is a two-dimensional V (x, y). An array ((x, y) is the center coordinate of the education agent).
A hierarchical agent learning method including a first V value updating step in which the educational agent updates the V value according to the following equation is provided.

本発明の別の特徴によれば、人間またはロボットたる教育エージェントと、コンピュータによって機能する、少なくとも、学習エージェント、および中間エージェントを含み、所定の空間位置に存在する各エージェントの認識する時刻と、客観的な時刻との間にずれが存在する、システムにおいて、各エージェントが自律的に学習することによって、前記学習エージェントをスタート位置からゴール位置まで制御する方法であって、前記学習エージェントが環境を観測する環境観測工程（2715）と、前記学習エージェントが、前方に障害物が存在するか否かを判断する障害物確認工程（2717）と、前記障害物確認工程で、障害物が存在すると判断された場合には、外部環境または教育エージェントがマイナスの報酬を前記学習エージェントに与える、第１の報酬更新工程（2719）と、前記障害物確認工程で、障害物が存在すると判断された場合には、前記第１の報酬付与工程の後に、さもなければ前記障害物確認工程の後に、前記学習エージェントが強化学習を行う第１の強化学習実行工程（2721）と、前記学習エージェントが、移動する方向を決定する意思決定工程（2723）と、前記学習エージェントが、前記意思決定工程で決定された移動量で行動を採る、行動実行工程（2725）と、前記行動の結果、教育エージェントから、前記学習エージェントがマイナスの報酬を受けたか否かを前記学習エージェントが、判断する、ペナルティー判断工程（2727）と、前記学習エージェントが前記マイナスの報酬を受けた場合に、環境として認識される人またはロボットたる教育エージェントからの、既に設定されていた前記学習エージェントの有する報酬値に前記マイナスの報酬値の入力を促す、第２の報酬更新工程（2729）と、前記ペナルティー判断工程で、前記学習エージェントが、マイナスの報酬を受けた場合には、前記報酬更新工程の後に、さもなければ前記ペナルティー判断工程の後に、前記学習エージェントが強化学習を行う第２の強化学習実行工程（2731）と、前記学習エージェントが、ゴールに到達したか否かを判断する、ゴール到達判断工程（2737、2735）と、前記学習エージェントがゴールに到達していない場合、以上の工程を所定の回数反復し、その後にゴールに到達していない場合に、前記中間エージェントにより、前記第１の報酬更新工程、および、前記第２の報酬更新工程で、前記学習エージェントに報酬を与えるタイミングを変更する、報酬更新ルール変更工程（2741）と、前記学習エージェントが、履歴から学習する、履歴学習工程（2745）と、が含まれることを特徴とする、階層型エージェント学習方法が提供される。 According to another feature of the present invention, an educational agent that is a human or a robot, and at least a learning agent and an intermediate agent that function by a computer, the time recognized by each agent existing in a predetermined spatial position, and an objective In this system, each agent learns autonomously and controls the learning agent from the start position to the goal position, and the learning agent observes the environment. An environment observation step (2715), an obstacle confirmation step (2717) in which the learning agent determines whether there is an obstacle ahead, and the obstacle confirmation step. If the external environment or educational agent gives a negative reward to the learning agent In the first reward renewal step (2719) and the obstacle confirmation step, if it is determined that an obstacle is present, the obstacle confirmation step after the first reward provision step , A first reinforcement learning execution step (2721) in which the learning agent performs reinforcement learning, a decision making step (2723) in which the learning agent determines a moving direction, and the learning agent determines the decision making The learning agent determines whether or not the learning agent has received a negative reward from the education agent as a result of the action, taking an action with the movement amount determined in the process, Penalty judgment step (2727), and when the learning agent receives the negative reward, from the education agent who is a person or robot recognized as the environment In the second reward update step (2729) that prompts the negative reward value to be input to the reward value that the learning agent has already set, and in the penalty determination step, the learning agent receives a negative reward. If the learning agent has reached the goal after the reward renewal step, or after the penalty determination step, the second reinforcement learning execution step (2731) in which the learning agent performs reinforcement learning, and Goal achievement judgment process (2737, 2735) to determine whether or not, and when the learning agent has not reached the goal, the above process is repeated a predetermined number of times, and then the goal has not been reached In addition, a reward is given to the learning agent by the intermediate agent in the first reward update step and the second reward update step. A hierarchical agent learning method comprising: a reward renewal rule changing step (2741) for changing a timing to be read and a history learning step (2745) for the learning agent to learn from a history. Is done.

さらに、前記意思決定工程（2723）に、学習エージェントがＱテーブルの値を観測する、Ｑテーブル観測工程と、greedy方策に従って、学習エージェントがＱ値の最も高い行動のみを選択する、行動選択工程と、が含まれる、階層型エージェント学習方法が提供される。 Furthermore, the decision making step (2723) includes a Q table observation step in which the learning agent observes the value of the Q table, and an action selection step in which the learning agent selects only the action having the highest Q value according to the greedy policy. A hierarchical agent learning method is provided.

また、別の好ましい態様によれば、前記行動実行工程（2725）に、学習エージェントを所定の角度だけ回転させる、回転工程と、前記回転終了後、前記学習エージェントを所定の移動量だけ進行させる、移動工程と、が含まれる、階層型エージェント学習方法が提供される。 According to another preferred embodiment, in the action execution step (2725), the learning agent is rotated by a predetermined angle, and after the rotation is completed, the learning agent is advanced by a predetermined movement amount. There is provided a hierarchical agent learning method including a moving step.

更に別の好ましい態様によれば、前記第１の強化学習実行工程（2721）、および／または、第２の強化学習実行工程（2731）に、Ｑ値をｔ_p＜ｔ＜ｔ_nの範囲で、
Ｑ（ｓ，ａ）←（１−α）Ｑ（ｓ，ａ）＋α[ｒ＋γ・ｍａｘ_a’・Ｑ（ｓ_t+1，ａ_t+1）]
（ここで、ｓは状態、ａは行動、Ｑ（ｓ，ａ）は、状態がｓ、行動がａのときのＱ値、αはステップサイズパラメータ、γは割引率、α、γの値の範囲は0≦α、γ≦1となり、α=γである必要は無い、rは外部環境が与える報酬のことで、ペナルティの場合は通常マイナス、報酬の場合はプラスの値を与える、Ｑ（s_t+1,a_t+1）は、状態がｓ_t+1、行動がａ_t+1のときのＱ値を意味する。「状態がｓ_t+1、行動がａ_t+1のとき」とは、次に取りうる状態、行動、の対を意味する。max_a'は、次に取ることが可能な全ての状態、行動、の対において最もＱ値が最大になるような行動を選択することを意味する。max_a'Q(s_t+1,a_t+1)は、次のステップで取りうる全ての状態、行動、の対においてＱ値が最大となるような行動ａを取ったときのＱ値である。）なる式に従って更新する、Ｑ値更新工程と、が含まれる、階層型エージェント学習方法が提供される。 According to still another preferred aspect, in the first reinforcement learning execution step (2721) and / or the second reinforcement learning execution step (2731), the Q value is in the range of t _p <t <t _n . ,
Q (s, a) ← (1-α) Q (s, a) + α [r + γ · max _{a ′} · Q (s _{t + 1} , a _{t + 1} )]
(Where s is the state, a is the action, Q (s, a) is the Q value when the state is s and the action is a, α is the step size parameter, γ is the discount rate, and α and γ are values of The range is 0 ≦ α, γ ≦ 1, and it is not necessary that α = γ, r is a reward given by the external environment, usually a negative value for a penalty, a positive value for a reward, Q ( _{_{s t + 1, a t +}} 1) is, s _{t + 1} state, action means the Q value at the time of a _{t + 1.} "state s _{t + 1,} when the action of a _{t + 1} "Means _a pair of states and actions that can be taken next. Max _{a '} means an action that has the highest Q value in all the pairs of states and actions that can be taken next. Max _{a ′} Q (s _{t + 1} , a _{t + 1} ) means an action a having a maximum Q value in all possible states / action pairs in the next step. Q value when taken.) Therefore updating, the Q value update step includes, hierarchical agent learning method is provided.

更に別の好ましい態様では、報酬更新ルール変更工程（2741）に、Ｑ値の時間的更新範囲ｔ_p、ｔ_nを照合する、時間的更新範囲照合工程と、ｔ_p←ｔ_p−ｉ，ｔ_n←ｔ_n＋ｊ
（ここで、ｉ,jは正の整数とし、t_p<t_nを満たす。）なる式に基づいて前記パラメータｔ_p、ｔ_nを加減することによって、更新範囲を変更する、更新範囲変更工程と、が含まれる、階層型エージェント学習方法が提供される。 In yet another preferred embodiment, the reward update rule changing step (2741) is performed by comparing the temporal update range t _p and t _n of the Q value with a temporal update range collating step, and t _p ← t _p −i, t _n ← t _n + j
(Where i, j are positive integers and satisfy t _p <t _n ) An update range changing step of changing the update range by adding or subtracting the parameters t _p and t _n based on the formula And a hierarchical agent learning method is provided.

更に別の好ましい態様では、前記履歴学習工程（2745）に、前記学習エージェントに対する前記マイナスの報酬が入った状態、行動、の対、の時刻ｔを、前記中間エージェントが照合する、時刻照合工程と、前記学習エージェントが一度もゴールに到達しなかったかを、前記中間エージェントが判断する、ゴール到達判断工程と、前記ゴール到達判断工程で、一度もゴールに到達しなかったと判断された場合に、前記中間エージェントが、現在の試行で更新された前記学習エージェントが持つＱ値を、一つ前の試行開始時の状態に戻す、第１のＱ値復帰工程と、前記ゴール到達判断工程で、過去にゴールに到達したと判断された場合に、前記中間エージェントが、現在の試行で更新された前記学習エージェントが持つＱ値を、直近にゴールに到達した試行終了後のＱ値の状態に戻してから、前記報酬更新ルール変更工程でで変更された新たな更新ルールで、次の試行のQ値を更新する第２のＱ値復帰工程と、前記第１または第２のＱ値復帰工程の後に、
Ｑ（ｓ，ａ）←（１−α）Ｑ（ｓ，ａ）＋α[ｒ＋γ・ｍａｘ_a’・Ｑ（ｓ_t+1，ａ_t+1）]
（ここで、ｓは状態、ａは行動、Ｑ（ｓ，ａ）は、状態がｓ、行動がａのときのＱ値、αはステップサイズパラメータ、γは割引率、α、γの値の範囲は0≦α、γ≦1となり、α=γである必要は無い、rは外部環境が与える報酬のことで、ペナルティの場合は通常マイナス、報酬の場合はプラスの値を与える、Ｑ（s_t+1,a_t+1）は、状態がｓ_t+1、行動がａ_t+1のときのＱ値を意味する。「状態がｓ_t+1、行動がａ_t+1のとき」とは、次に取りうる状態、行動、の対を意味する。max_a'は、次に取ることが可能な全ての状態、行動、の対において最もＱ値が最大になるような行動を選択することを意味する。max_a'Q(s_t+1,a_t+1)は、次のステップで取りうる全ての状態、行動、の対においてＱ値が最大となるような行動ａを取ったときのＱ値である。）
なる式に従って、前記学習エージェントが、Ｑ値を、更新されたｔｐ＜ｔ＜ｔｎの範囲で更新する、Ｑ値更新工程と、が含まれる、階層型エージェント学習方法が提供される。 In still another preferred embodiment, in the history learning step (2745), a time collating step in which the intermediate agent collates the time t of the pair of the state, action, and the negative reward for the learning agent. The intermediate agent determines whether the learning agent has never reached the goal, and the goal arrival determination step and the goal arrival determination step determine that the goal has never been reached, In the past, the intermediate agent returns the Q value of the learning agent updated in the current trial to the state at the start of the previous trial, in the first Q value return step and the goal arrival judgment step. When it is determined that the goal has been reached, the intermediate agent reaches the goal most recently with the Q value of the learning agent updated in the current trial. A second Q value return step of updating the Q value of the next trial with the new update rule changed in the reward update rule change step after returning to the state of the Q value after the reached trial end; After the first or second Q value return step,
Q (s, a) ← (1-α) Q (s, a) + α [r + γ · max _{a ′} · Q (s _{t + 1} , a _{t + 1} )]
(Where s is the state, a is the action, Q (s, a) is the Q value when the state is s and the action is a, α is the step size parameter, γ is the discount rate, and α and γ are values of The range is 0 ≦ α, γ ≦ 1, and it is not necessary that α = γ. R is a reward given by the external environment, usually giving a negative value for a penalty, giving a positive value for a reward, Q ( _{_{s t + 1, a t +}} 1) is, s _{t + 1} state, action means the Q value at the time of a _{t + 1.} "state s _{t + 1,} when the action of a _{t + 1} "Means _a pair of states and actions that can be taken next. Max _{a '} means an action that has the highest Q value in all the pairs of states and actions that can be taken next. Max _{a ′} Q (s _{t + 1} , a _{t + 1} ) means an action a having a maximum Q value in all possible states / action pairs in the next step. Q value when taken.)
A hierarchical agent learning method is provided, which includes a Q value updating step in which the learning agent updates the Q value in the range of updated tp <t <tn according to the following formula.

本発明の別の特徴によれば、コンピュータによって機能する、少なくとも、教育エージェント、学習エージェント、および中間エージェントを含み、所定の時刻における、各エージェントの認識する位置、各エージェント自身の大きさ、および、学習エージェントが通過可能な道幅と、客観的な各エージェントの位置、各エージェント自身の大きさ、および、学習エージェントが通過可能な道幅、との間にそれぞれ、ずれが存在し、各エージェントが自律的に学習することによって、前記学習エージェントをスタート位置からゴール位置まで制御するシステムであって、前記学習エージェントが、自己が採るべき行動を意思決定する学習エージェント意思決定手段と、前記学習エージェントの意思決定を前記学習エージェントが、前記中間エージェントに送る学習エージェント行動送信手段とを有し、前記教育エージェントが、自己が採るべき行動を意思決定する教育エージェント意思決定手段と、前記教育エージェントの意思決定を、前記教育エージェントが、前記中間エージェントに送る行動送信手段と、を有し、前記中間エージェントが、前記学習エージェントの意思決定と前記教育エージェントの意思決定を同じかどうかを判断する意思決定手段と、前記学習エージェントの意思決定と前記教育エージェントの意思決定が同じ場合には、前記学習エージェントが意思決定した採るべき行動を前記学習エージェントに送り、そして、前記学習エージェントの意思決定と前記教育エージェントの意思決定が異なる場合には、前記中間エージェントが規則に従って、採るべき新たな行動を、前記学習エージェントに送る行動指示手段と、を有し、前記学習エージェントが更に、前記採るべき行動を実行する、行動実行手段を有し、前記学習エージェント、前記中間エージェント、前記教育エージェントが更に、それぞれ、前記学習エージェントがゴールに到達したか否かを判断する、ゴール到達判断手段を有し、前記学習エージェントおよび前記教育エージェントが更に、前記学習エージェントが指定されたステップ数内にゴールに到達していない場合、それぞれ、前記学習エージェント及び前記教育エージェントが強化学習を行う、学習エージェント強化学習実行手段および教育エージェント強化学習実行手段を有する、階層型エージェント学習システムが提供される。
このようにすることで、評価信号に空間的あいまいさが含まれる系に対して、あいまいさを吸収することが可能となる。更に、学習エージェント強化学習実行工程（1180）および教育エージェント強化学習実行工程（1180）は、後述のＱ-learningやV-learningのみならず、探索木、人工ニューラルネット、遺伝的アルゴリズム等の現在利用可能な全ての学習方法が利用可能である。 According to another feature of the invention, the computer functions at least including an educational agent, a learning agent, and an intermediate agent, the position recognized by each agent at a predetermined time, the size of each agent itself, and There is a discrepancy between the road width that the learning agent can pass and the objective position of each agent, the size of each agent itself, and the road width that the learning agent can pass, and each agent is autonomous. The learning agent is controlled from the start position to the goal position by learning the learning agent, and the learning agent makes a decision on the action to be taken by the learning agent, and the decision of the learning agent The learning agent Learning agent action transmitting means for sending to a gent, wherein the educational agent makes a decision on an action to be taken by the educational agent, and the educational agent makes the decision of the educational agent, the intermediate agent Action sending means for sending to the intermediate agent, the intermediate agent determining whether the learning agent decision is the same as the learning agent decision decision, the learning agent decision decision and the education If the decision of the agent is the same, the action to be taken decided by the learning agent is sent to the learning agent, and if the decision of the learning agent and the decision of the education agent are different, the intermediate New agents to take according to the rules Action instruction means for sending an action to the learning agent, the learning agent further comprising action execution means for executing the action to be taken, wherein the learning agent, the intermediate agent, and the education agent are Furthermore, each has a goal attainment judging means for judging whether or not the learning agent has reached the goal, and the learning agent and the education agent further reach the goal within the designated number of steps. If not, a hierarchical agent learning system is provided that includes a learning agent reinforcement learning execution means and an education agent reinforcement learning execution means in which the learning agent and the education agent respectively perform reinforcement learning.
By doing in this way, it becomes possible to absorb ambiguity for a system in which spatial ambiguity is included in the evaluation signal. Furthermore, the learning agent reinforcement learning execution step (1180) and the education agent reinforcement learning execution step (1180) are currently used not only for Q-learning and V-learning described later, but also for search trees, artificial neural networks, genetic algorithms, etc. All possible learning methods are available.

更に、前記学習エージェント意思決定手段が、前記各学習エージェントが、設定された観測範囲で自分自身が持っているＱ値テーブルの値を観測する、学習エージェントＱ値テーブル値観測手段と、前記学習エージェントが、確率（１−ε）でＱ値の最も高い行動方向を選択し、確率εでランダムな行動方向を選択する、行動選択手段と、前記中間エージェントによって合成された、前記各学習エージェントの選択した行動の方向を、学習エージェント全体の選択した行動の方向とする、行動合成手段、を有する、階層型エージェント学習システムが提供される。
このようにすることで、学習機構として実績のあるＱ−kearningを用いて、最もゴール到達の可能性の高い経路を容易に探索できる。 Further, the learning agent decision making means comprises: a learning agent Q value table value observing means for observing a value of a Q value table that each learning agent has within a set observation range; and the learning agent Selects the learning direction synthesized by the intermediate agent and the behavior selection means that selects the behavior direction having the highest Q value with probability (1-ε) and selects the random behavior direction with probability ε. There is provided a hierarchical agent learning system having behavior synthesis means for setting the selected behavior direction as the selected behavior direction of the entire learning agent.
By doing in this way, the route with the highest possibility of goal arrival can be easily searched using Q-kearning which has a proven record as a learning mechanism.

また別の好ましい態様によれば、前記教育エージェント意思決定手段が、前記教育エージェントが、設定された観測範囲で自分自身が持っているＶ値テーブルの値を観測する、教育エージェントＶ値テーブル値観測手段と、前記教育エージェントが、確率（１−ε）でＶ値の最も高い行動方向を選択し、確率εでランダムな行動方向を選択する、行動選択手段と、を有する、階層型エージェント学習システムが提供される。
このようにすることで、学習機構として実績のあるＶ−kearningを用いて、最もゴール到達の可能性の高い経路を容易に探索できる。 According to another preferred aspect, the educational agent V-value table value observation, wherein the educational agent decision-making means observes the value of the V-value table that the educational agent owns within a set observation range. A hierarchical agent learning system comprising: a means for selecting an action direction having the highest V value with a probability (1-ε); and selecting a random action direction with a probability ε. Is provided.
By doing in this way, the route with the highest possibility of goal arrival can be easily searched using V-kearning which has a proven record as a learning mechanism.

別の好ましい態様によれば、前記行動指示手段が、教育エージェントが示した行動の中で、最も高いＱ値の学習エージェントの行動を、前記中間エージェントが選択する、第１の行動生成手段と、所定の試行回数後にゴールに到達していない場合には、前記学習エージェントの示す方向とＱ値を第１のベクトルとし、教育エージェントが示す方向とＶ値を第２のベクトルとし、前記中間エージェントが第１および第２のベクトルを合成した方向を選択する、第２の行動生成手段と、を有する、階層型エージェント学習システムが提供される。
このようにすることで、教育エージェントと学習エージェントの学習成果の双方を有効に利用して効率的な経路探索が可能である。 According to another preferred aspect, the action instruction means includes a first action generation means for the intermediate agent to select the action of the learning agent having the highest Q value among the actions indicated by the education agent; If the goal is not reached after a predetermined number of trials, the direction and Q value indicated by the learning agent are set as a first vector, the direction and V value indicated by the education agent are set as a second vector, and the intermediate agent There is provided a hierarchical agent learning system having second action generation means for selecting a direction in which the first and second vectors are combined.
In this way, efficient route search is possible by effectively using both the learning results of the educational agent and the learning agent.

また、別の態様によれば、前記学習エージェント強化学習実行手段が、各学習エージェントがゴールに到達した場合に、各学習エージェント以外のエージェントおよび外部環境からの各学習エージェントへの第１の所定の報酬を受け取る、第１の報酬受領手段と、各学習エージェントがゴールに到達せず、かつ、障害物に衝突しなかった場合に、各学習エージェント以外のエージェントおよび外部環境からの各学習エージェントへの第２の所定の報酬を受け取る、第２の報酬受領手段と、各学習エージェントがゴールに到達せず、かつ、障害物に衝突した場合に、各学習エージェント以外のエージェントおよび外部環境からの各学習エージェントへの第３の所定の報酬を受け取る、第３の報酬受領手段と、前記各報酬付与手段によって報酬を受け取った後に、それぞれの報酬値に応じて、前記各学習エージェントが、観測した全てのＱ値を更新する、Ｑ値更新手段、を有する、階層型エージェント学習システムが提供される。 According to another aspect, when the learning agent reinforcement learning execution means reaches the goal, each of the learning agents has a first predetermined value to an agent other than each learning agent and each learning agent from the external environment. First reward receiving means for receiving a reward, and when each learning agent does not reach the goal and does not collide with an obstacle, the agent other than each learning agent and each learning agent from the external environment The second reward receiving means for receiving the second predetermined reward and each learning from an agent other than each learning agent and the external environment when each learning agent does not reach the goal and collides with an obstacle. A third reward receiving means for receiving a third predetermined reward for the agent, and rewards by the respective reward granting means. After taking only, depending on the respective compensation values, wherein each of the learning agent, to update all Q values observed, Q value updating means having a hierarchical agent learning system is provided.

更に別の態様によれば、前記教育エージェント強化学習実行手段が、各学習エージェントがゴールに到達した場合に、外部環境からの教育エージェントに対する第１の所定の報酬を受け取る、第１の報酬受領手段と、各学習エージェントがゴールに到達せず、かつ、障害物に衝突しなかった場合に、外部環境からの、教育エージェントに対する第２の所定の報酬を受領する、第２の報酬受領手段と、各学習エージェントがゴールに到達せず、かつ、障害物に衝突した場合に、外部環境からの、教育エージェントに対する第３の所定の報酬を受領する、第３の報酬受領手段と、前記各報酬付与手段によって報酬を受け取った後に、前記教育エージェントがそれぞれの報酬値に応じて、観測した全てのＶ値を更新する、Ｖ値更新手段、を有する、階層型エージェント学習システムが提供される。 According to yet another aspect, the first reward receiving means for receiving the first predetermined reward for the education agent from the external environment when each of the learning agents reaches the goal. A second reward receiving means for receiving a second predetermined reward for the educational agent from the external environment when each learning agent does not reach the goal and does not collide with an obstacle; A third reward receiving means for receiving a third predetermined reward for the educational agent from the external environment when each learning agent does not reach the goal and collides with an obstacle; After receiving a reward by means, the education agent has V value update means for updating all observed V values according to respective reward values. Layer type agent learning system is provided.

また、別の態様によれば、前記Ｑ値更新手段が、
Q(s,a)←(1-α)Q(s,a)+α[r+γmax_a'Q(s_t+1,a_t+1)]
（ここで、ｓは状態、ａは行動、Ｑ（ｓ，ａ）は、状態がｓ、行動がａのときのＱ値、αはステップサイズパラメータ、γは割引率、α、γの値の範囲は0≦α、γ≦1であり、α=γである必要は無い、rは外部環境が与える報酬のことで、ペナルティの場合は通常マイナス、報酬の場合はプラスの値を与る、Ｑ（s_t+1,a_t+1）は、状態がｓ_t+1、行動がａ_t+1のときのＱ値を意味する。「状態がｓ_t+1、行動がａ_t+1のとき」とは、次に取りうる状態、行動、の対を意味する。max_a'は、次に取ることが可能な全ての状態、行動、の対において、最もＱ値が最大になるような行動を選択することを意味する。ｍａｘ_a'Ｑ(s_t+1,a_t+1)は、次のステップで取りうる全ての状態、行動、の対において、Ｑ値が最大となるような行動ａを取ったときのＱ値である。）
なる式にしたがって、前記学習エージェントが、Ｑ値を更新する、第１のＱ値更新手段を有する、階層型エージェント学習システムが提供される。 According to another aspect, the Q value updating means includes
Q (s, a) ← (1-α) Q (s, a) + α [r + γmax _{a '} Q (s _{t + 1} , a _{t + 1} )]
(Where s is the state, a is the action, Q (s, a) is the Q value when the state is s and the action is a, α is the step size parameter, γ is the discount rate, and α and γ are values of The range is 0 ≦ α, γ ≦ 1, and it is not necessary that α = γ, r is the reward given by the external environment, usually giving a negative value for a penalty, and a positive value for a reward, _{Q (s t + 1, a} t + 1) , the state is s _{t + 1,} action means the Q value at the time of a _{t + 1.} "state s _{t + 1,} action is a _{t + 1} "When" means the next possible state / action pair. Max _{a '} is the largest Q value in all possible state / action pairs next. Max _{a ′} Q (s _{t + 1} , a _{t + 1} ) is such that the Q value is maximized in all possible state / action pairs in the next step. Q value when taking action a.)
A hierarchical agent learning system is provided in which the learning agent has first Q value updating means for updating the Q value according to the following equation.

更に別の好ましい態様によれば、前記Ｖ値更新手段が、
V(s_t)←V(s_t)+α[r_t+γV(s_t+1)-V(s_t)]
（ここで、ｓ_tは時刻ｔにおける状態ｓ、s_t+1は、時刻ｔ＋１における状態ｓ、Ｖ（ｓ_t）は、状態ｓ_tのときのＶ値、Ｖ（s_t+1）は、状態s_t+1のときのＶ値、αはステップサイズパラメータ、γは割引率、α、γの値の範囲は0≦α、γ≦1となり、α=γである必要は無い、rは外部環境が与える報酬のことで、ペナルティの場合は通常マイナス、報酬の場合はプラスの値を与える、を意味する。本実施例では、Ｖ（ｓ）は、Ｖ（x，ｙ）の２次元配列である（（ｘ、ｙ）は、教育エージェントの位置である）。）
なる式にしたがって、前記教育エージェントがＶ値を更新する、第１のＶ値更新手段を有する、階層型エージェント学習システムが提供される。 According to still another preferred aspect, the V value updating means includes
V (s _t ) ← V (s _t ) + α [r _t + γV (s _{t + 1} ) -V (s _t )]
(Here, s _t a state at time t s, s _{t + 1,} the state s at time _{t + 1, V (s t} ) is V value in the state _{_{s t, V (s t +}} 1) is V value at the time of state _{st + 1} , α is a step size parameter, γ is a discount rate, α, γ value ranges are 0 ≦ α, γ ≦ 1, and it is not necessary that α = γ, r is This is a reward given by the external environment, which means that a penalty is usually a negative value, and a reward is a positive value.In this embodiment, V (s) is a two-dimensional V (x, y). An array ((x, y) is the location of the educational agent).
A hierarchical agent learning system having first V value updating means for updating the V value by the educational agent according to the following equation is provided.

本発明の別の特徴によれば、人間またはロボットたる教育エージェントと、コンピュータによって機能する、少なくとも、学習エージェント、および中間エージェントを含み、所定の空間位置に存在する各エージェントの認識する時刻と、客観的な時刻との間にずれが存在する、システムにおいて、各エージェントが自律的に学習することによって、前記学習エージェントをスタート位置からゴール位置まで制御するシステムであって、前記学習エージェントが、環境を観測する環境観測手段と、前方に障害物が存在するか否かを判断する障害物確認手段と、前記障害物確認手段で、障害物が存在すると判断された場合には、外部環境または教育エージェントからのがマイナスの報酬を前記学習エージェントが受け取る、第１の報酬更新手段と、前記障害物確認手段で障害物が存在すると判断された場合には前記第１の報酬の受け取りの後に、さもなければ前記障害物が存在すると確認された後に、前記学習エージェントが強化学習を行う第１の強化学習実行手段と、移動する方向を決定する意思決定手段と、前記意思決定手段で決定された移動量で行動を採る、行動実行手段と、前記行動の結果、教育エージェントから、マイナスの報酬を受けたか否かを判断する、ペナルティー判断手段と、前記学習エージェントが前記マイナスの報酬を受けた場合に、環境として認識される人またはロボットたる前記教育エージェントからの、既に設定されていた前記学習エージェントの有する報酬値に前記マイナスの報酬値の加算を促す、第２の報酬更新手段と、前記ペナルティー判断手段で、マイナスの報酬を受けたと判断した場合には前記第２の報酬更新手段によるマイナスの報酬値の加算の後に、さもなければ前記ペナルティー判断手段による判断の後に、前記学習エージェントが強化学習を行う第２の強化学習実行手段と、前記学習エージェントが、ゴールに到達したか否かを判断する、ゴール到達判断手段と、を有し、前記教育エージェントが、前記学習エージェントがゴールに到達していない場合、以上の手段による制御を所定の回数反復し、その後にゴールに到達していない場合に、前記中間エージェントにより、前記第１の報酬更新手段、および、前記第２の報酬更新手段で、前記学習エージェントに報酬を与えるタイミングを変更する、報酬更新ルール変更手段を有し、更に、前記中間エージェントが、前記学習エージェントが過去にゴールに到達したか否かの判断にしたがって、次の試行で用いるＱ値を準備する、履歴更新準備手段を有し、前記学習エージェントが、前記履歴更新準備手段によって得られたＱ値を用いて、Ｑ値を更新する履歴更新手段を有する、ことを特徴とする、階層型エージェント学習システムが提供される。 According to another feature of the present invention, an educational agent that is a human or a robot, and at least a learning agent and an intermediate agent that function by a computer, the time recognized by each agent existing in a predetermined spatial position, and an objective In a system in which there is a deviation from a specific time, each agent autonomously learns to control the learning agent from a start position to a goal position, and the learning agent An environmental observation means for observing, an obstacle confirmation means for judging whether or not there is an obstacle ahead, and if the obstacle confirmation means judges that an obstacle is present, an external environment or an educational agent First reward updating means for receiving a negative reward from the learning agent The learning agent performs reinforcement learning after receiving the first reward when the obstacle confirmation means determines that the obstacle is present, or after confirming that the obstacle exists. From the first reinforcement learning execution means, the decision making means for determining the moving direction, the action execution means for taking action with the amount of movement determined by the decision making means, and the result of the action, the education agent minus Penalty judging means for judging whether or not the reward was received, and when the learning agent received the negative reward, it was already set from the education agent as a person or robot recognized as an environment Second reward update means for urging the addition of the negative reward value to the reward value of the learning agent; and the penalty determination means The learning agent performs reinforcement learning after adding the negative reward value by the second reward update unit when it is determined that the negative reward has been received, or after the determination by the penalty determining unit. When the learning agent has not reached the goal, the reinforcement learning execution means of No. 2 and a goal attainment judging means for judging whether or not the learning agent has reached the goal. When the control by the above means is repeated a predetermined number of times and the goal has not been reached after that, the learning is performed by the intermediate reward agent in the first reward update means and the second reward update means. Remuneration update rule changing means for changing the timing for giving a reward to the agent, and the intermediate agent further learns the learning In accordance with a determination as to whether or not the agent has reached the goal in the past, the update agent has a history update preparation means for preparing a Q value to be used in the next trial. There is provided a hierarchical agent learning system characterized by having history update means for updating a Q value using a value.

この場合、前記意思決定手段が、学習エージェントがＱテーブルの値を観測する、Ｑテーブル観測手段と、greedy方策に従って、学習エージェントがＱ値の最も高い行動のみを選択する、行動選択手段と、を有する、ようにしても良い。 In this case, the decision making means comprises: a Q table observation means for the learning agent to observe the value of the Q table; and an action selection means for the learning agent to select only the action having the highest Q value according to the greedy policy. You may make it have.

別の好ましい態様によれば、前記行動実行手段が、学習エージェントが所定の角度だけ回転する、回転手段と、前記回転終了後、前記学習エージェントを所定の移動量だけ進行させる、移動手段と、を有する、階層型エージェント学習システムが提供される。 According to another preferred aspect, the action executing means includes: a rotating means for rotating the learning agent by a predetermined angle; and a moving means for causing the learning agent to advance by a predetermined movement amount after the rotation ends. A hierarchical agent learning system is provided.

また別の態様によれば、前記第１の強化学習実行手段、および／または、第２の強化学習実行手段が、Ｑ値をｔ_p＜ｔ＜ｔ_nの範囲で、
Ｑ（ｓ，ａ）←（１−α）Ｑ（ｓ，ａ）＋α[ｒ＋γ・ｍａｘ_a’・Ｑ（ｓ_t+1，ａ_t+1）]
（ここで、ｓは状態、ａは行動、Ｑ（ｓ，ａ）は、状態がｓ、行動がａのときのＱ値、αはステップサイズパラメータ、γは割引率、α、γの値の範囲は0≦α、γ≦1となり、α=γである必要は無い、rは外部環境が与える報酬のことで、ペナルティの場合は通常マイナス、報酬の場合はプラスの値を与える、Ｑ（s_t+1,a_t+1）は、状態がｓ_t+1、行動がａ_t+1のときのＱ値を意味する。「状態がｓ_t+1、行動がａ_t+1のとき」とは、次に取りうる状態、行動、の対を意味する。max_a'は、次に取ることが可能な全ての状態、行動、の対において最もＱ値が最大になるような行動を選択することを意味する。max_a'Q(s_t+1,a_t+1)は、次のステップで取りうる全ての状態、行動、の対においてＱ値が最大となるような行動ａを取ったときのＱ値である。）
なる式に従って更新する、Ｑ値更新手段、を有する、階層型エージェント学習システムが提供される。 According to another aspect, the first reinforcement learning execution means and / or the second reinforcement learning execution means has a Q value in a range of t _p <t <t _n ,
Q (s, a) ← (1-α) Q (s, a) + α [r + γ · max _{a ′} · Q (s _{t + 1} , a _{t + 1} )]
(Where s is the state, a is the action, Q (s, a) is the Q value when the state is s and the action is a, α is the step size parameter, γ is the discount rate, and α and γ are values of The range is 0 ≦ α, γ ≦ 1, and it is not necessary that α = γ. R is a reward given by the external environment, usually giving a negative value for a penalty, giving a positive value for a reward, Q ( _{_{s t + 1, a t +}} 1) is, s _{t + 1} state, action means the Q value at the time of a _{t + 1.} "state s _{t + 1,} when the action of a _{t + 1} "Means _a pair of states and actions that can be taken next. Max _{a '} means an action that has the highest Q value in all the pairs of states and actions that can be taken next. Max _{a ′} Q (s _{t + 1} , a _{t + 1} ) means an action a having a maximum Q value in all possible states / action pairs in the next step. Q value when taken.)
There is provided a hierarchical agent learning system having Q value updating means for updating according to the following formula.

更に別の態様によれば、報酬更新ルール変更手段が、Ｑ値の時間的更新範囲ｔ_p、ｔ_nを照合する、時間的更新範囲照合手段と、
ｔ_p←ｔ_p−ｉ，ｔ_n←ｔ_n＋ｊ
（ここで、ｉ,jは正の整数とし、t_p<t_nを満たす。）なる式に基づいて前記パラメータｔ_p、ｔ_nを加減することによって、更新範囲を変更する、更新範囲変更手段と、を有する、階層型エージェント学習システムが提供される。 According to still another aspect, the remuneration update rule changing unit collates the temporal update ranges t _p and t _n of the Q value,
t _p ← t _p −i, t _n ← t _n + j
(Where i, j are positive integers and satisfy t _p <t _n ) Update range changing means for changing the update range by adding or subtracting the parameters t _p and t _n based on the formula: A hierarchical agent learning system is provided.

また別の態様によれば、前記履歴更新準備手段が、前記学習エージェントに対する前記マイナスの報酬が入った状態、行動、の対、の時刻ｔを、前記中間エージェントが照合する、時刻照合手段と、前記学習エージェントが一度もゴールに到達しなかったかを、前記中間エージェントが判断する、ゴール到達判断手段と、前記ゴール到達判断手段で、一度もゴールに到達しなかったと判断された場合に、前記中間エージェントが、現在の試行で更新された前記学習エージェントが持つＱ値を、一つ前の試行開始時の状態に戻す、第１のＱ値復帰手段と、前記ゴール到達判断手段で、過去にゴールに到達したと判断された場合に、前記中間エージェントが、現在の試行で更新された前記学習エージェントが持つＱ値を、直近にゴールに到達した試行終了後のＱ値の状態に戻してから、前記報酬更新ルール変更手段で変更された新たな更新ルールで、次の試行のQ値を更新する第２のＱ値復帰手段と、を有し、前記履歴更新手段が、前記第１または第２のＱ値復帰手段による処理の後に、
Ｑ（ｓ，ａ）←（１−α）Ｑ（ｓ，ａ）＋α[ｒ＋γ・ｍａｘ_a’・Ｑ（ｓ_t+1，ａ_t+1）]
（ここで、ｓは状態、ａは行動、Ｑ（ｓ，ａ）は、状態がｓ、行動がａのときのＱ値、αはステップサイズパラメータ、γは割引率、α、γの値の範囲は0≦α、γ≦1となり、α=γである必要は無い、rは外部環境が与える報酬のことで、ペナルティの場合は通常マイナス、報酬の場合はプラスの値を与える、Ｑ（s_t+1,a_t+1）は、状態がｓ_t+1、行動がａ_t+1のときのＱ値を意味する。「状態がｓ_t+1、行動がａ_t+1のとき」とは、次に取りうる状態、行動、の対を意味する。max_a'は、次に取ることが可能な全ての状態、行動、の対において最もＱ値が最大になるような行動を選択することを意味する。max_a'Q(s_t+1,a_t+1)は、次のステップで取りうる全ての状態、行動、の対においてＱ値が最大となるような行動ａを取ったときのＱ値である。）なる式に従って、前記学習エージェントが、Ｑ値を、更新されたｔｐ＜ｔ＜ｔｎの範囲で更新する、Ｑ値更新手段、を有する、階層型エージェント学習システムが提供される。 According to another aspect, the history update preparation means includes a time collating means in which the intermediate agent collates a time t of a pair of a state, an action, and a negative reward for the learning agent. When the intermediate agent determines whether the learning agent has never reached the goal, and the goal arrival determination means and the goal arrival determination means determine that the goal has never been reached, the intermediate agent In the past, the agent uses the first Q value return means for returning the Q value of the learning agent updated in the current trial to the state at the start of the previous trial, and the goal arrival judgment means. When it is determined that the intermediate agent has reached the goal, the intermediate agent has recently reached the Q value of the learning agent updated in the current trial. A second Q value return means for updating the Q value of the next trial with the new update rule changed by the reward update rule changing means after returning to the state of the Q value after the end of the line; , The history update means, after the processing by the first or second Q value return means,
Q (s, a) ← (1-α) Q (s, a) + α [r + γ · max _{a ′} · Q (s _{t + 1} , a _{t + 1} )]
(Where s is the state, a is the action, Q (s, a) is the Q value when the state is s and the action is a, α is the step size parameter, γ is the discount rate, and α and γ are values of The range is 0 ≦ α, γ ≦ 1, and it is not necessary that α = γ, r is a reward given by the external environment, usually a negative value for a penalty, a positive value for a reward, Q ( _{_{s t + 1, a t +}} 1) is, s _{t + 1} state, action means the Q value at the time of a _{t + 1.} "state s _{t + 1,} when the action of a _{t + 1} "Means _a pair of states and actions that can be taken next. Max _{a '} means an action that has the highest Q value in all the pairs of states and actions that can be taken next. Max _{a ′} Q (s _{t + 1} , a _{t + 1} ) means an action a having a maximum Q value in all possible states / action pairs in the next step. Q value when taken.) Accordingly, the learning agent, the Q value is updated with the range of the updated tp <t <tn, Q value updating means having a hierarchical agent learning system is provided.

なお、本出願において使用する用語の定義を以下に示す。
（表１）用語の定義

In addition, the definition of the term used in this application is shown below.
(Table 1) Definition of terms

（表１の「用語の定義」における参考文献）
[Sutton98] Sutton、 R． S． & Barto、 A．:Reinforcement Learning: An Introduction、 A Bradford Book、 The MIT Press (1998)．
[Watkins92] Watkins、 C． J． C． H． and Dayan、 P．:Technical Note: Q-Learning、 Machine Learning 8、 pp． 279--292 (1992)． (References in "Definition of terms" in Table 1)
[Sutton98] Sutton, R.M. S. & Barto, A. : Reinforcement Learning: An Introduction, A Bradford Book, The MIT Press (1998).
[Watkins92] Watkins, C.I. J. C. H. and Dayan, P.A. : Technical Note: Q-Learning, Machine Learning 8, pp. 279--292 (1992).

本発明により、学習型オンライン制御システムにおいて、評価信号に時間的、空間的あいまいさが含まれる系に対して、あいまいさを吸収することが可能なシステムおよび制御方法を提供することができる。 According to the present invention, it is possible to provide a system and a control method capable of absorbing ambiguity for a system in which temporal and spatial ambiguities are included in an evaluation signal in a learning-type online control system.

以下、図面を参照しつつ、本発明の実施の形態を説明する。
システムを局所的な学習をする学習エージェント、大局的な学習をする教育エージェント、その間をとりもつ中間層の3層に分け、各層にエージェントが存在する（図１）。ここでいうエージェントとは、自分以外のすべてを環境とし、環境を認識することができるセンサを持ち、環境から与えられる信号と評価から、自己評価が可能なものとする。
図１のように、教育エージェントは、学習エージェントの行動を観測し、学習エージェントに対して評価値という形式で、指示を出す。教育エージェントは、逐次的・大局的な学習を行う点に特徴がある。
これに対して、学習エージェントは、教育エージェントから与えられた評価値を元に行動し、局所的な学習を行う点に特徴がある。
中間層のエージェントは、教育エージェントが出す評価値と学習エージェントがその指示に従って行動した時の自己評価値を観測し、評価値のずれが大きいため目的が達成できないと判断した場合に、教育エージェントの評価値と学習エージェントの評価値の整合性をとることにより、目的達成を図る。ここでは、評価値のずれを評価系におけるあいまいさと定義し、時間的なあいまいさと空間的なあいまいさに分類する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
The system is divided into a learning agent that performs local learning, an educational agent that performs global learning, and an intermediate layer that lies between them, and there are agents in each layer (Fig. 1). The agent referred to here is assumed to be a self-evaluation based on signals and evaluations given from the environment, with all other than the environment as the environment, a sensor that can recognize the environment.
As shown in FIG. 1, the educational agent observes the behavior of the learning agent and issues an instruction to the learning agent in the form of an evaluation value. Educational agents are characterized by sequential and global learning.
On the other hand, the learning agent is characterized in that it acts based on the evaluation value given by the educational agent and performs local learning.
The middle-tier agent observes the evaluation value given by the educational agent and the self-evaluation value when the learning agent acts according to the instructions, and determines that the objective cannot be achieved due to the large deviation between the evaluation values. The objective is achieved by matching the evaluation value with the evaluation value of the learning agent. Here, the deviation of the evaluation value is defined as ambiguity in the evaluation system, and is classified into temporal ambiguity and spatial ambiguity.

ここで、空間的なあいまいさとは、図２のように、学習器にとっての、ある空間位置ｓにおける、ある時点ｔでの関数値ｆ^*（ｔ，ｓ）が、学習器の認識する位置と、客観的な正確な位置および形状とのズレの存在のために、理論値とは異なる、ということである。
同様に、時間的あいまいさとは、図３のように、学習器（エージェント内部に構築されている学習機構のこと。エージェントにはアクチュエータやセンサを含むが、学習器はそれらの情報を入出力として、意思決定を行うもの）にとっての、ある空間位置ｓにおける、ある時点ｔでの関数値ｆ^*（ｔ，ｓ）が、学習器の認識する時刻と、外部の時刻とのズレの存在のために、理論値とは異なる、ということである。ここでいう「理論値」というものは、学習器にとって最も効率良く学習ができる更新式と捉えられる。この「更新式」をそのまま使えないことが「あいまいさ」と言い替えることができる。
これらの”あいまいさ”の存在により、外部環境との相互作用から自己の位置および行動を評価して学習する学習器の学習が、所望の過程によって改善されない場合が存在する。本発明はこのような”あいまいさ”を克服して、学習器に対して所望の学習を行わせることに関する。
空間的なあいまいさを含む問題として、経路探索を含むピアノ問題をとりあげる。また、時間的なあいまいさを含む問題として、マンマシンインタフェースをとりあげ、それぞれについて詳しく説明する。 Here, as shown in FIG. 2, the spatial ambiguity means that the function value f ^* (t, s) at a certain time point t at a certain spatial position s for the learning device is the position recognized by the learning device. It is different from the theoretical value because of the existence of deviation from objective and accurate position and shape.
Similarly, the temporal ambiguity is a learning device (a learning mechanism built inside the agent as shown in FIG. 3. The agent includes actuators and sensors, but the learning device uses the information as input and output. The function value f ^* (t, s) at a certain time point t at a certain spatial position s has a difference between the time recognized by the learning device and the external time. In other words, it is different from the theoretical value. The “theoretical value” here is regarded as an update formula that can be most efficiently learned by the learning device. The fact that this “update formula” cannot be used as it is can be paraphrased as “ambiguity”.
Due to the presence of these “ambiguities”, there is a case where learning of a learning device that learns by evaluating its position and behavior from interaction with the external environment is not improved by a desired process. The present invention relates to overcoming such “ambiguity” and causing the learner to perform desired learning.
As a problem including spatial ambiguity, the piano problem including route search is taken up. In addition, as a problem including temporal ambiguity, a man-machine interface will be taken and each will be described in detail.

空間的なあいまいさを含む問題例：経路探索を含む姿勢制御問題（ピアノ問題）
空間的なあいまいさを含む問題として、経路探索を含むピアノ問題をシミュレートした例をとりあげる。
ピアノ問題とは、物体（例えばピアノ）を部屋の中のスタート地点から部屋の外の目的地に出すために、物体の回転および平行移動などの物体の姿勢、および経路探索を扱う問題である。本実施例では、この経路探索を含むピアノ問題で、あらかじめ設定されたスタートからゴールまで、エージェント自身が障害物回避を行いながら到達する。各エージェントにとってゴールの位置および環境は未知である。
実環境への応用を目的としているので、各エージェントは、自分自身の形状や大きさを知らない。但し、学習エージェントは自身の大きさ、および、自己のセンサが検知できる範囲内の環境は把握している。これは、ロボットの正確な大きさや形状を測定することが困難であるということを念頭に置いている。また、実環境において、環境を正確に知ることはほぼ不可能なためである。
環境には、通行可能な領域と通行不可能な障害物が存在する。図４で黒い部分が障害物で、白い部分が通行可能領域となる。口の中のＳ点41がスタート地点であり、エージェントの中心座標がそこに直角に置かれる。Ｇ点43がゴールとなる。
教育エージェント45の形は20×40ピクセルの長方形とする。教育エージェントを駆動するキャスターに相当するのが、障害物回避を学習した学習エージェント（47ａ〜47ｄ）である。
学習エージェントは4体おり、長方形（教育エージェント）（図４の45に相当）の四つ角に配置された、それぞれ、直径4ピクセル（シミュレーション画像の１画素＝１ピクセルである）の円形とする（図５の53ａ〜53ｄ）。学習エージェントの移動によって、教育エージェント51（図４の45に相当）および学習エージェント（53ａ〜53ｄ）（図４の47ａ〜47ｄに相当）の全体が移動する。学習エージェント（53ａ〜53ｄ）は衝突回避と姿勢制御を行う。 Example of problems including spatial ambiguity: Posture control problem including path search (piano problem)
As an example of a problem involving spatial ambiguity, we will take an example of simulating a piano problem involving route search.
The piano problem is a problem that deals with the object's posture, such as rotation and translation of the object, and path search in order to bring the object (for example, piano) from the start point in the room to the destination outside the room. In this embodiment, in the piano problem including this route search, the agent himself reaches from a preset start to a goal while performing obstacle avoidance. The goal location and environment are unknown to each agent.
Each agent does not know its own shape and size because it is intended for application in a real environment. However, the learning agent knows its size and the environment within the range that its own sensor can detect. This keeps in mind that it is difficult to measure the exact size and shape of the robot. Also, it is almost impossible to know the environment accurately in the actual environment.
There are accessible areas and inaccessible obstacles in the environment. In FIG. 4, the black part is an obstacle and the white part is a passable area. The S point 41 in the mouth is the start point, and the center coordinates of the agent are placed at a right angle there. G point 43 is the goal.
The shape of the education agent 45 is a rectangle of 20 × 40 pixels. The learning agents (47a to 47d) who have learned obstacle avoidance correspond to casters that drive the education agent.
There are four learning agents, arranged in the four corners of a rectangle (education agent) (corresponding to 45 in FIG. 4), each having a circle of 4 pixels in diameter (1 pixel in the simulation image = 1 pixel) (see FIG. 4). 5 53a-53d). Due to the movement of the learning agent, the entire education agent 51 (corresponding to 45 in FIG. 4) and learning agent (53a to 53d) (corresponding to 47a to 47d in FIG. 4) move. The learning agents (53a to 53d) perform collision avoidance and posture control.

ピアノ問題における学習エージェント（53ａ〜53ｄ）は、キャスター付きのピアノを教育エージェント51とすると、キャスターに相当する部分である。キャスターにモータと意思決定機構をつけて、各自が勝手な方向に行こうとするが、各々が長方形（教育エージェント51）の四つ角に配置されているため、自分が行きたい方向に自由には行けない。しかし、学習を繰り返すことによって、段々と各キャスターが協調行動を取るようになる。この機構はＱ−learningによって行われる。移動方向の物理条件もプログラミング上で設定される。
学習エージェント（53ａ〜53ｄ）は、その「状態ｓ」および「行動ａ」、を持つが、状態sは自分のセンサ値(5方向、6段階)（図5）、各エージェントのセンサ（合計４つ）が反応しているかどうか(反応している＝１，していない＝０、の2段階)の8状態、行動aは4方向とする。本実施例における上記Ｑ−learningでは、各学習エージェントの通しナンバー、「状態ｓ」および「行動ａ」を配列化した、Q[agent_num]([s0][s1][s2][s3][s4][other],[action_num]) =Ｑ(s,a)を用いて学習を行う。ここで、ｓ（状態）は各学習エージェントのセンサ値（[s0][s1][s2][s3][s4]）および、圧縮した自分以外の学習エージェントのセンサ値[other]であり、行動は自分の行動であり、[action_num]に相当する。つまり、Q[agent_num]([s0][s1][s2][s3][s4][other],[action_num])の８次元配列である。なお、各学習エージェントは自分のセンサレンジは５つのレンジを持っているが、自分以外のエージェントから受け取るセンサ情報は、そのエージェントのセンサに反応があったかどうか、すなわち、０か１かだけの情報となる。これが”圧縮”の意味である。agent_numは４体のアクチュエータ付ローラが存在するため４つの値を取り、action_numは行動方向が４方向なので、４つの値を取る。
[s0][s1][s2][s3][s4]のそれぞれは、５段階のセンサ感度を持っているので、５つの値を取る。
[other]については０か１の２値である。 The learning agent (53a to 53d) in the piano problem is a portion corresponding to a caster when a piano with casters is an educational agent 51. A caster is equipped with a motor and a decision-making mechanism, and each one tries to go in the direction they want. However, since they are arranged in the four corners of the rectangle (education agent 51), you can go freely in the direction you want to go. Absent. However, by repeating learning, each caster gradually takes a cooperative action. This mechanism is performed by Q-learning. The physical condition of the moving direction is also set by programming.
The learning agents (53a to 53d) have their “state s” and “behavior a”, but the state s is the sensor value of its own (5 directions, 6 levels) (FIG. 5), the sensors of each agent (total 4 8) whether or not is responding (2 stages of responding = 1 and not responding = 0), and action a has 4 directions. In the Q-learning in the present embodiment, Q [agent_num] ([s0] [s1] [s2] [s3] [s4], in which serial numbers of each learning agent, “state s” and “action a” are arranged. ] [other], [action_num]) = Q (s, a) is used for learning. Here, s (state) is the sensor value ([s0] [s1] [s2] [s3] [s4]) of each learning agent and the compressed sensor value [other] of learning agents other than yourself, Is my action and corresponds to [action_num]. That is, it is an 8-dimensional array of Q [agent_num] ([s0] [s1] [s2] [s3] [s4] [other], [action_num]). Each learning agent has five sensor ranges, but the sensor information received from agents other than its own is whether or not there is a response to the sensor of that agent, that is, only 0 or 1 information. Become. This is the meaning of “compression”. agent_num takes four values because there are four rollers with actuators, and action_num takes four values because the action direction is four directions.
Since each of [s0] [s1] [s2] [s3] [s4] has five levels of sensor sensitivity, it takes five values.
[other] is a binary value of 0 or 1.

教育エージェント51は、経路探索を行い、現在いるところからゴールまでの方向を示す（部屋の外までの経路を学習する）部分である。教育エージェント51は自身の大きさを把握していないが、探索した環境の履歴を学習という形で保存することは可能である。教育エージェント51が示す方向は、8方向である。各学習エージェント（53ａ〜53ｄ）が実際に移動する方向は4方向であるが、４個の学習エージェント（53ａ〜53ｄ）の力の合成により移動可能な方向は256方向となる。教育エージェント51内部には、強化学習の一手法であるTD-learningを構築する。本実施例におけるTD-learningでは、各教育エージェント51の「状態ｓ」を配列化した、Ｖ（ｓ）を用いて学習を行う。ここで、教育エージェント51の状態sはシミュレーション環境の絶対座標(問題空間に依存する)である。本実施例では、状態ｓは、移動平面上の座標（x，y）の２次元量である。教育エージェント51の行動aは、移動すべき、8方向のいずれかを学習エージェント（53ａ〜53ｄ）に示すことである。
教育エージェント51は、自分自身の形状を知らないために、一見通過可能な経路を示す。しかし、経路の幅が狭すぎるために、経路は存在しているが、通過できないという現象がおきる。自身の位置とゴールの間に障害物がある場合、教育エージェント51が示す方向と、学習エージェント（53ａ〜53ｄ）が示す方向が異なる。 The education agent 51 is a part that conducts a route search and indicates the direction from the current position to the goal (learns the route to the outside of the room). Although the educational agent 51 does not know its size, it can store the history of the searched environment in the form of learning. The direction indicated by the education agent 51 is eight directions. The directions in which each learning agent (53a to 53d) actually moves are four directions, but the directions in which the learning agents (53a to 53d) can move by combining the forces of the four learning agents (53a to 53d) are 256 directions. TD-learning, which is one method of reinforcement learning, is built inside the education agent 51. In TD-learning in the present embodiment, learning is performed using V (s) in which the “state s” of each educational agent 51 is arranged. Here, the state s of the education agent 51 is an absolute coordinate (depending on the problem space) of the simulation environment. In this embodiment, the state s is a two-dimensional quantity of coordinates (x, y) on the moving plane. The action a of the education agent 51 is to indicate to the learning agents (53a to 53d) any one of the eight directions to be moved.
Since the education agent 51 does not know its own shape, the education agent 51 shows a path that can be seen at first glance. However, since the width of the route is too narrow, the route exists but cannot pass. When there is an obstacle between its own position and the goal, the direction indicated by the education agent 51 is different from the direction indicated by the learning agents (53a to 53d).

このように、物体の大きさが実際の値と教育エージェントの値と異なる場合、この値の差を空間的なあいまいさ、と定義する。空間的なあいまいさが引き起こす問題を解決するために、中間層のエージェントが、簡単なルールをもとに、解決をはかる。経路探索にも学習を採用した。
つまり、中間エージェント（図６における61）の役割は、学習エージェント63と教育エージェント65の示す方向が異なる場合に、中間層のエージェント61が仲立ちをし、目的を達成することである。
このような場合に、中間エージェント61はあらかじめ設定されたルールに従って、学習エージェント63に移動方向を示す。中間エージェント61内部には、学習エージェント63が示す方向と教育エージェント65が示す方向の整合性をとるためのルールが構築されている。ルールは複数設定され、どのルールを適用するか、ルールに必要なパラメータを自律的に適応する。ここでは、（１）教育エージェント優先した方向の指示、（２）学習エージェントと教育エージェントの示す方向のベクトル合成とする、の２つのルールを切り替えるタイミングをパラメータとしてもつ。
なお、本実施例（シミュレーション）では、教育エージェント、学習エージェント、中間エージェントの全てを、コンピュータおよびソフトウェアで実現しており、人間は介在しない。この点で人間が教育エージェントの役割を担う、実施例２とは異なる。
空間的なあいまいさを含む問題に対して、階層型エージェントシステムは有効である。階層型に機能分化させることにより、各階層での機構が単純化され、変更が容易となる。また、一部の層だけ変更しても、問題解決が可能であることが実験により示されている。 Thus, when the size of the object is different from the actual value and the value of the educational agent, the difference between the values is defined as spatial ambiguity. In order to solve the problem caused by spatial ambiguity, agents in the middle layer try to solve based on simple rules. Learning was also adopted for route search.
That is, the role of the intermediate agent (61 in FIG. 6) is that the agent 61 in the intermediate layer mediates and achieves the purpose when the directions indicated by the learning agent 63 and the educational agent 65 are different.
In such a case, the intermediate agent 61 indicates the moving direction to the learning agent 63 according to a preset rule. Within the intermediate agent 61, a rule is established for matching the direction indicated by the learning agent 63 and the direction indicated by the education agent 65. A plurality of rules are set, which rule is applied, and parameters necessary for the rule are autonomously adapted. Here, it has as a parameter the timing of switching between two rules: (1) instruction in the direction given to the educational agent, and (2) vector synthesis in the direction indicated by the learning agent and the educational agent.
In this embodiment (simulation), all of the educational agent, the learning agent, and the intermediate agent are realized by a computer and software, and no human is involved. This is different from the second embodiment in which a human plays the role of an educational agent.
Hierarchical agent systems are effective for problems involving spatial ambiguity. By differentiating functions into layers, the mechanism at each layer is simplified and changes are facilitated. Experiments have shown that the problem can be solved even if only some layers are changed.

以上のような各エージェントに関する設定条件を前提として、以下に説明するフローチャートのアルゴリズムに従ってシミュレーション実験を行った結果を図７および図８に示す。シミューレート環境は、スタートからゴールまでの経路がいくつかある中で、より距離が短い経路は、道幅がせまいために通過できず、遠回りしなくてはゴールにたどり着かないような環境を用意した。
図７Ａ〜Ｃでは、（１）教育エージェントの判断を優先した、方向の指示、図７Ｄ〜Ｆでは（２）学習エージェントと教育エージェントの示す方向のベクトル合成とする、のルールを採用した場合を示す。ここで、episodeは、試行の回数である。
ここでは、（１）のルールの方が、容易にゴールに到達している。
参考として、図８は、教育エージェントのエージェント構造を、本実施例のＴＤ学習とは異なる方法に変えた場合との比較を示したものである。図８では教育エージェントがＴＤ学習を行った場合（左）とｔｒｅｅによる経路探索（右）を行った場合とを比較した経路を表示している。
図９−Ａ、９Ｂは、経路探索が改善される様子を示している。図９−Ａでは、教育エージェントが、スタートからゴールに向かって、近い経路である図９−Ａの左側にある細い道（経路913）を見つけるのであるが、実際に学習エージェント901が行動をとって通過しようとしたときに、道幅が狭くて通過できない。
そこで、教育エージェントが経路探索を再び行う。図９−Ｂでは、図９−Ａで通行不可であった部分を「通行不可領域」909として学習し、次回からはそこは探索対象からはずし、ゴール907に到達している。
次に同様に参考として、図１０−Ａを参照する。これは、図８（Ｂ）（Ｃ）におけるように、教育エージェントの内部機構に探索木を用いた場合の説明である（本実施例の、教育エージェントがＴＤ学習を行う場合とは異なる）。ここでは、障害物1005に対して探索木を伸ばし、障害物1005に当った場合は、次から探索対象から外して、ゴール1003に到達している。この「探索対象から外す」操作を詳しく説明する図１０−Ｂを参照すると、枝1010が障害物に当った場合、「通行不可」1012の場所は経路としないこととしている。そして、ゴールできる経路を発見するまでは、「通行不可」1012も障害物と考え、重ねて「通行不可」1012を増やしている。これによって、不用な場所の探索を行わなくなる。
再度図７に戻る。図７の（Ａ）から（Ｃ）までに示されるように、中間層のエージェントは、教育エージェントが示す方向の中から、もっともQ値の高い学習エージェントの示す方向を採用するという簡単なルールにも関わらず、物体は、姿勢制御を繰り返し、遠回りをして、ゴールに到達しているのがわかる。よって、このピアノ問題を実環境で行うことができれば、キャスターつきの動かしたい物体に動力部をつけることによって、その物体が自律的に障害物を回避し、指定された場所まで勝手に移動するということが可能となり、引越しや部屋の模様替えの際に、人が重い物を運ぶ必要がなくなる。 FIG. 7 and FIG. 8 show the results of a simulation experiment performed according to the algorithm of the flowchart described below on the premise of the setting conditions regarding each agent as described above. In the simulated environment, there are several routes from the start to the goal, so the route with a shorter distance cannot be passed due to the narrow width of the road, and the environment is prepared so that it cannot reach the goal unless it makes a detour. .
7A to 7C, a case is adopted in which the rule of (1) direction instruction giving priority to the judgment of the education agent is adopted, and in FIG. Show. Here, episode is the number of trials.
Here, the goal of (1) easily reaches the goal.
For reference, FIG. 8 shows a comparison with the case where the agent structure of the educational agent is changed to a method different from the TD learning of the present embodiment. FIG. 8 shows a route comparing the case where the educational agent performs TD learning (left) and the case where the route search by tree (right) is performed.
9A and 9B show how the route search is improved. In FIG. 9-A, the educational agent finds a narrow path (path 913) on the left side of FIG. 9-A, which is a close path from the start to the goal, but the learning agent 901 actually takes action. When trying to pass, the road is too narrow to pass.
Therefore, the educational agent performs route search again. In FIG. 9B, the portion that was not allowed to pass in FIG. 9A is learned as the “passable area” 909, which is removed from the search target and reaches the goal 907 from the next time.
Reference is now made to FIG. 10-A for reference as well. This is an explanation when a search tree is used as the internal mechanism of the educational agent as shown in FIGS. 8B and 8C (different from the case where the educational agent performs TD learning in this embodiment). Here, when the search tree is extended with respect to the obstacle 1005 and hits the obstacle 1005, it is removed from the search target and reaches the goal 1003. Referring to FIG. 10-B which explains this “removal from search target” operation in detail, when the branch 1010 hits an obstacle, the location of “impossible to pass” 1012 is not assumed to be a route. Until the path that can be reached is found, “impossible to pass” 1012 is also considered an obstacle, and “passable” 1012 is repeatedly increased. As a result, an unnecessary place is not searched.
Returning again to FIG. As shown in (A) to (C) of FIG. 7, the middle layer agent adopts the simple rule of adopting the direction indicated by the learning agent with the highest Q value from the directions indicated by the education agent. Nevertheless, the object repeats the posture control, makes a detour, and sees that it has reached the goal. Therefore, if this piano problem can be performed in a real environment, by attaching a power unit to the object to be moved with casters, the object autonomously avoids the obstacle and moves to the designated place without permission. This eliminates the need for people to carry heavy objects when moving or redesigning a room.

ここで、以下、ピアノ問題についてのエージェント学習方法を、フローチャートを用いて説明する。
[メインフロー]（図１１）
（１）S1100においてすべてのパラメータが初期化される。ここでのパラメータを以下に示す。
（表２）パラメータの一覧

（２）S1110において、すべてのエージェントがスタート地点に戻る(あるいは設置される)。ここで、スタート地点とは、あらかじめ設定されたある一点のことで、この一点を中心にして、同じ角度に設置される。スタート地点に戻った段階で、ステップ数sはゼロに、エピソード数episodeは1つ加算される。
（３）S1120において学習エージェントはQ-Learningを用いて、教育エージェントはTD-Learningを用いて、それぞれ次に進みたい方向を決定する(意思決定)。
（４）S1130で、学習エージェントがQ-Learningを用いて意思決定した行動、および、教育エージェントがTD−learningを用いて意思決定した行動、をそれぞれ中間エージェントに送信する。つまり、中間エージェントは、学習エージェントと教育エージェントの両方からこれから動きたい方向を受信することになる。
（５）S1140において、中間エージェントが受信した学習エージェントの方向と教育エージェントの方向が一致しているかどうかを照合する。
（６）S1150では、学習エージェントと教育エージェントの行動が一致していない場合、中間エージェントが両エージェントの移動したい方向を受け取り、全体が移動する方向を物理法則から計算し、実際に行動をとる学習エージェントに送信する。ここで言う「物理法則」の意味を簡単に説明する。 Here, an agent learning method for the piano problem will be described below using a flowchart.
[Main flow] (Fig. 11)
(1) All parameters are initialized in S1100. The parameters here are shown below.
(Table 2) List of parameters

(2) In S1110, all agents return (or are installed) to the starting point. Here, the start point is a certain point set in advance, and is set at the same angle with this one point as the center. At the stage of returning to the starting point, the number of steps s is reduced to zero and the number of episodes episode is incremented by one.
(3) In S1120, the learning agent uses Q-Learning, and the education agent uses TD-Learning to decide the direction to proceed next (decision decision).
(4) In S1130, the behavior determined by the learning agent using Q-Learning and the behavior determined by the educational agent using TD-learning are transmitted to the intermediate agent. That is, the intermediate agent receives the direction in which it wants to move from both the learning agent and the educational agent.
(5) In S1140, it is checked whether the direction of the learning agent received by the intermediate agent matches the direction of the education agent.
(6) In S1150, if the learning agent and the teaching agent do not match, the intermediate agent receives the direction in which both agents want to move, calculates the direction in which the whole moves from the physical law, and learns to actually take the action Send to agent. The meaning of "physical law" here will be briefly explained.

図１２を参照頂きたい。本実施例におけるシミュレーションでは、形の変わらないものを移動するので、学習エージェント（1201ａ〜ｄ）が全部中心に向かって行動を選択すると、全体としては動かなくなる。また、学習エージェントがすべて同じ方向に行動選択した場合は、全体としてその方向に移動する。しかし、その中の1つが、違う方向を選択した場合には、全体的に回転しながらちょっと移動する、という行動が取られる。この全体の行動は、シミュレーションで設定した、エージェントの重さ、摩擦、力の合成等による物理法則にのっとって計算しなければ、どのくらい回転してどのくらいの移動距離をとるのかが計算できない。つまり、「物理法則」とは、シミュレーションでエージェントの行動を計算するために用いるエージェントの重さ、摩擦、力の合成等の法則を意味する。 Please refer to FIG. In the simulation in the present embodiment, an object that does not change its shape is moved. Therefore, if the learning agents (1201a to 1201d) select an action toward the center, the movement does not move as a whole. If all learning agents select an action in the same direction, they move in that direction as a whole. However, if one of them selects a different direction, the action of moving a little while rotating as a whole is taken. If this total behavior is not calculated according to the physical laws based on the weight, friction, force synthesis, etc. of the agent set in the simulation, it is not possible to calculate how much rotation and how much movement distance it takes. In other words, the “physical law” means a law such as composition of weight, friction, and force of the agent used for calculating the behavior of the agent in the simulation.

ここで、図１１に戻る。
（７）S1160で、S1140で中間エージェントが受信した、学習エージェントの方向と、教育エージェントの方向が一致している場合は、中間エージェントは何もせず、学習エージェントが意思決定した方向に移動する。一致していない場合は、中間エージェントが示す方向に移動する。移動距離は、設定された1ステップ分とする。移動後に、ステップ数を１つ加算する。
（８）S1170では、ステップ数が設定された最大ステップ数を超えているかどうかを判定する。
（９）S1180で、ステップ数が最大ステップ数よりも少ない場合各エージェントは環境から報酬あるいはペナルティを獲得することにより、学習を行う。4対の各学習エージェントはQ-Learningを用いてQ値を更新し、1体の教育エージェントはTD-Learningを用いてV値を更新する。
（１０）S1190で、ゴールに到達したかどうかの判定を行う。
（１１）S1195で、エピソードが終了したかどうかの判定を行う。 Returning now to FIG.
(7) In S1160, if the direction of the learning agent received by the intermediate agent in S1140 matches the direction of the education agent, the intermediate agent does nothing and moves in the direction determined by the learning agent. If they do not match, the agent moves in the direction indicated by the intermediate agent. The moving distance is one set step. After moving, add one step number.
(8) In S1170, it is determined whether the number of steps exceeds the set maximum number of steps.
(9) If the number of steps is less than the maximum number of steps in S1180, each agent learns by acquiring a reward or penalty from the environment. Each of the four pairs of learning agents updates the Q value using Q-Learning, and one educational agent updates the V value using TD-Learning.
(10) In S1190, it is determined whether the goal has been reached.
(11) In S1195, it is determined whether or not the episode has ended.

[各学習エージェントが意思決定するフロー：Ｓ1120に対応]（図13）
（１）S1310で、各学習エージェントが設定された観測範囲で、自分達自身が持っているQ-tableのQ値を観測する。
（２）S1330で、方策にしたがって学習エージェントが行動を選択する。ここではε-greedy方策を用いる。
（３）S1350で、各エージェントが4方向の中から１つ行動を選択する。力の合成によって、全体としての移動方向が決定する。 [Flow for each learning agent to make decisions: corresponds to S1120] (Figure 13)
(1) In S1310, the Q values of the Q-tables they own are observed within the observation range where each learning agent is set.
(2) In S1330, the learning agent selects an action according to the policy. Here we use the ε-greedy strategy.
(3) In S1350, each agent selects one action from four directions. The force direction determines the overall direction of movement.

[教育エージェントが意思決定するフロー：Ｓ1120に対応]（図14）
（１）S1410で、教育エージェントが設定された観測範囲で、自分達自身が持っているV-tableのV値を観測する。
（２）S1430で、方策にしたがって教育エージェントが行動を選択する。ここではε-greedy方策を用いる。 [Flow of decision making by educational agent: corresponding to S1120] (Fig. 14)
(1) In S1410, observe the V-table V-values they own within the observation range set by the educational agent.
(2) In S1430, the educational agent selects an action according to the policy. Here we use the ε-greedy strategy.

[中間エージェントが新たな行動を生成するフロー：Ｓ1150に対応]（図15）
（１）S1510で、ルールに従って行動を生成する。ここでは、一つめに教育エージェント優先型ルールを採用する。教育エージェントは8方向の中から1方向を示す。角度45度の幅の中で、最も高いQ値の学習エージェントの行動を選択する。なお、ここでいうルールというのは、教育エージェント優先方にするのか、ベクトル合成をとるのか、という意味でのルールである。
（２）S1530で、あるエピソード数が経過した後に、ゴールに到達しているかどうかの判定を行う。
（３）S1550で、到達していない場合は、新たなルールを用いる。ここでは、学習エージェントと教育エージェントのベクトル合成を用いる。学習エージェントが示す方向とQ値を１つのベクトルとし、教育エージェントが示す方向とV値をもう１つのベクトルとし、ベクトル合成した方向を選択する。 [Intermediate agent generates a new action: corresponds to S1150] (Figure 15)
(1) In S1510, an action is generated according to the rule. Here, the education agent priority type rule is adopted first. Educational agents show one of eight directions. Select the behavior of the learning agent with the highest Q value within the 45-degree range. The rule here is a rule in the sense of whether to give priority to the education agent or to use vector synthesis.
(2) In S1530, it is determined whether or not the goal has been reached after a certain number of episodes have elapsed.
(3) If not reached in S1550, use a new rule. Here, vector composition of learning agent and educational agent is used. The direction indicated by the learning agent and the Q value are set as one vector, the direction indicated by the education agent and the V value are set as another vector, and the direction of vector synthesis is selected.

[各学習エージェントが学習するフロー：Ｓ1180に対応]（図16）
（１）S1610で、移動したあとでゴールに到達したかどうかの判定する。
（２）S1620で、到達していない場合、衝突したどうかの判定を行う。
（３）S1630で、ゴールに到達した場合、報酬1を得る。
（４）S1640で、ゴールに到達しておらず、衝突もしていない場合、報酬0を得る。
（５）S1650で、衝突した場合、報酬 1/Max_step(ペナルティ)を得る。
（６）S1660で、Q-LearningにおけるQ値の更新式にしたがって、観測した範囲のすべてのQ値を更新する。Ｑ値の更新式は、
（数１）
Q(s,a)←(1-α)Q(s,a)+α[r+γmax_a'Q(s_t+1,a_t+1)]
で表される。
ここで、ｓは状態、ａは行動、Ｑ（ｓ，ａ）は、状態がｓ、行動がａのときのＱ値、αはステップサイズパラメータ、γは割引率、Ｑ（s_t+1,a_t+1）は、状態がｓ_t+1、行動がａ_t+1のときのＱ値を意味する。「状態がｓ_t+1、行動がａ_t+1のとき」とは、次に取りうる状態、行動、の対を意味する。
rは外部環境が与える報酬のことで、ペナルティの場合は通常マイナス、報酬の場合はプラスの値を与る。α、γの値の範囲は0≦α、γ≦1となり、α=γである必要はない。
max_a'は、次に取ることが可能な全ての状態、行動、の対において、最もＱ値が最大になるような行動を選択することを意味する。ｍａｘ_a'Ｑ(s_t+1,a_t+1)は、次のステップで取りうる全ての状態、行動、の対において、Ｑ値が最大となるような行動ａを取ったときのＱ値である。 [Flows learned by each learning agent: corresponding to S1180] (Fig. 16)
(1) In S1610, it is determined whether or not the goal has been reached after moving.
(2) If not reached in S1620, it is determined whether or not a collision has occurred.
(3) If the goal is reached in S1630, reward 1 is obtained.
(4) In S1640, if the goal has not been reached and no collision has occurred, 0 reward is obtained.
(5) If there is a collision in S1650, a reward 1 / Max_step (penalty) is obtained.
(6) In S1660, all Q values in the observed range are updated according to the Q value update formula in Q-Learning. The update formula for the Q value is
(Equation 1)
Q (s, a) ← (1-α) Q (s, a) + α [r + γmax _{a '} Q (s _{t + 1} , a _{t + 1} )]
It is represented by
Here, s is a state, a is an action, Q (s, a) is a Q value when the state is s and the action is a, α is a step size parameter, γ is a discount rate, Q (s _{t + 1} , a _{t + 1} ) means the Q value when the state is s _{t + 1} and the action is at _{t + 1} . “When the state is s _{t + 1} and the action is at _{t + 1} ” means the next possible state / action pair.
r is a reward given by the external environment, which is usually negative for penalties and positive for rewards. The range of the values of α and γ is 0 ≦ α and γ ≦ 1, and it is not necessary that α = γ.
“max _{a ′”} means that the action having the maximum Q value is selected in all the states / action pairs that can be taken next. max _{a ′} Q (s _{t + 1} , a _{t + 1} ) is the Q value when the action a having the maximum Q value is taken in all the states and actions that can be taken in the next step It is.

ここで、式のＱ値更新過程について詳述する。図１７を参照頂きたい。
(s,a)は状態、行動の種類である。この状態のときにこういう行動をとったときのQ値がいくつである、ということをあらわすのがQ(s,a)である。よって(s,a)には便宜上、番号を振り分けているだけで、具体的な値とは厳密には違う。Q値には具体的な値が入る。
今、例えば、状態sが座標（1701、1703、1705）とし、行動ａが上下の2つ（1707〜1717）だとすると、
（数２）
Q[x0][y0][上]＝０、Q[x0][y0][下]=0
Q[x1][y1][上]＝０、Q[x1][y1][下]=0
Q[x2][y2][上]＝０、Q[x2][y2][下]=0
が初期値となる。
[X0][y0]がスタート、[x2][y2]がゴールとする。まずスタートから移動して次に、[x1][y1]に居て、エージェントが下に行ったときにゴールに到達したとすると(矢印1713)、Q(s,a)は上式で更新されるが、[x1][y1][下]のところの更新は、r=1.0が入るので(ｒは報酬である。ゴールに到達したときの報酬r=1と設定している。次の状態の報酬も観測できるので、下という行動をとれば、報酬r=1がもらえる(ゴールに到達する)、ということがわかる。したがってr=1となる。)、ここのQ値を計算すると、
(1-α)Q(s,a)+α[r+γmax_a'Q(s_t+1,a_t+1)]において、Q(s,a)=0, max_a'Q(s_t+1,a_t+1)=0 なので、単純にQ[x1][y1][下]=αとなり、Q値が上がる。(他はr=0ならば、上記はゼロのまま。)
最初のエピソード終了後の値は、以下のとおりである。
（数３）
Q[x0][y0][上]＝０、Q[x0][y0][下]=0
Q[x1][y1][上]＝０、Q[x1][y1][下]=α
Q[x2][y2][上]＝０、Q[x2][y2][下]=0
Ｑ値を維持したまま、再度スタート地点から試行を開始した場合(つまり次のエピソードにおいて)、[x0][y0]にいるエージェントがまわりのQ値を観測したときに、Q値の一番高い[x1][y1][下]のQ値をmax_a'Q(s_t+1,a_t+1)の項に使う。したがって、[x0][y0][下]の値は、
α)Q(s,a)+α[r+γmax_a'Q(s_t+1,a_t+1)]において、
Q(s,a)=0, r=0(ゴール以外の状態にある報酬はゼロと設定した。もし、衝突した場合はr=−１/Max_stepの値が入るので、Q値は下がる。)、max_a'Q(s_t+1,a_t+1)=αとなるので、Q[x0][y0][下]=α＊αγとなる。
さらに、[x1][y1][下]のときにゴールにまた到達するので、このとき[x1][y1][下]のQ値はαなので、上式でQ(s,a)=αとなる。よって、このエピソード終了時には、
（数４）
Q[x0][y0][上]＝０、Q[x0][y0][下]= α＊αγ
Q[x1][y1][上]＝０、Q[x1][y1][下]=（1-α）＊α+α
Q[x2][y2][上]＝０、Q[x2][y2][下]=0
となり、ゴールに近いところから順にQ値が更新されていく。周りにQ値が全部ゼロの時は、ランダムに行動を選択するしかないが、一度ゴールに到達すると、そこからどんどん学習がすすんで行く。 Here, the Q value update process of the equation will be described in detail. Please refer to FIG.
(s, a) is the type of state and action. Q (s, a) indicates how many Q values are obtained when such an action is taken in this state. Therefore, for convenience, only a number is assigned to (s, a), which is strictly different from a specific value. The Q value is a specific value.
Now, for example, if the state s is coordinates (1701, 1703, 1705) and the action a is two above and below (1707-1717),
(Equation 2)
Q [x0] [y0] [Up] = 0, Q [x0] [y0] [Down] = 0
Q [x1] [y1] [Up] = 0, Q [x1] [y1] [Down] = 0
Q [x2] [y2] [Up] = 0, Q [x2] [y2] [Down] = 0
Is the initial value.
[X0] [y0] is the start and [x2] [y2] is the goal. If you move from the start and then stay in [x1] [y1] and the goal is reached when the agent goes down (arrow 1713), Q (s, a) is updated by the above equation However, the update at [x1] [y1] [below] is r = 1.0 (r is a reward. The reward when reaching the goal is set to r = 1. Next state Can be observed, so it can be seen that if you take the action below, you will get a reward r = 1 (reaching the goal), so r = 1.) If you calculate the Q value here,
(1-α) Q (s, a) + α [r + γmax _{a ′} Q (s _{t + 1} , a _{t + 1} )], Q (s, a) = 0, max _{a ′} Q (s _{t Since + 1} , a _{t + 1} ) = 0, Q [x1] [y1] [lower] = α, and the Q value increases. (Others will remain zero if r = 0.)
The values after the end of the first episode are as follows:
(Equation 3)
Q [x0] [y0] [Up] = 0, Q [x0] [y0] [Down] = 0
Q [x1] [y1] [Up] = 0, Q [x1] [y1] [Down] = α
Q [x2] [y2] [Up] = 0, Q [x2] [y2] [Down] = 0
When the trial is started again from the starting point while maintaining the Q value (that is, in the next episode), when the agent in [x0] [y0] observes the surrounding Q value, the Q value is the highest The Q value of [x1] [y1] [bottom] is used for the term of max _{a ′} Q (s _{t + 1} , a _{t + 1} ). Therefore, the value of [x0] [y0] [bottom] is
α) Q (s, a) + α [r + γmax _{a ′} Q (s _{t + 1} , a _{t + 1} )]
Q (s, a) = 0, r = 0 (The reward in a state other than the goal is set to zero. If there is a collision, the value of r = −1 / Max_step is entered, so the Q value decreases.) , Max _{a ′} Q (s _{t + 1} , a _{t + 1} ) = α, so Q [x0] [y0] [lower] = α * αγ.
Furthermore, since the goal is reached again when [x1] [y1] [lower], the Q value of [x1] [y1] [lower] is α, so Q (s, a) = α It becomes. So at the end of this episode,
(Equation 4)
Q [x0] [y0] [upper] = 0, Q [x0] [y0] [lower] = α * αγ
Q [x1] [y1] [Up] = 0, Q [x1] [y1] [Down] = (1-α) * α + α
Q [x2] [y2] [Up] = 0, Q [x2] [y2] [Down] = 0
The Q value is updated in order from the place close to the goal. When the Q value is all around, you have no choice but to choose an action at random, but once you reach the goal, learning goes on from there.

[教育エージェントが学習するフロー：Ｓ1180に対応]（図１８）
（１）S1810で、移動したあとでゴールに到達したかどうかの判定する。
（２）S1820で、到達していない場合、衝突したどうかの判定を行う。
（３）S1830で、ゴールに到達した場合、報酬1を得る。
（４）S1840で、ゴールに到達しておらず、衝突もしていない場合、報酬0を得る。
（５）S1850で、衝突した場合、報酬 1/Max_step(ペナルティ)を得る。
（６）S1860で、TD-LearningにおけるV値の更新式にしたがって、観測した範囲のすべてのV値を更新する。
更新式は、
（数５）
V(s_t)←V(s_t)+α[r_t+γV(s_t+1)-V(s_t)]
で表される。
ここで、ｓ_tは時刻ｔにおける状態ｓ、s_t+1は、時刻ｔ＋１における状態ｓ、Ｖ（ｓ_t）は、状態ｓ_tのときのＶ値、Ｖ（s_t+1）は、状態s_t+1のときのＶ値、αはステップサイズパラメータ、γは割引率、r_tは状態s_tにおける報酬を意味する。
rは外部環境が与える報酬のことで、ペナルティの場合は通常マイナス、報酬の場合はプラスの値を与える。α、γの値の範囲は0≦α、γ≦1となり、α=γである必要は無い。
本実施例では、Ｖ（ｓ）は、Ｖ（x，ｙ）の２次元配列である（（ｘ、ｙ）は、教育エージェントの位置である）。（学習エージェントの座標ではない。教育エージェントは四角。学習エージェントは丸） [Flow of education agent learning: corresponds to S1180] (Fig. 18)
(1) In S1810, it is determined whether or not the goal has been reached after moving.
(2) If not reached in S1820, it is determined whether or not a collision has occurred.
(3) If the goal is reached in S1830, a reward of 1 is obtained.
(4) In S1840, if the goal has not been reached and no collision has occurred, 0 reward is obtained.
(5) If there is a collision in S1850, a reward 1 / Max_step (penalty) is obtained.
(6) In S1860, all V values in the observed range are updated according to the V value update formula in TD-Learning.
The update formula is
(Equation 5)
V (s _t ) ← V (s _t ) + α [r _t + γV (s _{t + 1} ) -V (s _t )]
It is represented by
Here, s _t a state at time t s, s _{t + 1,} the state s at time _{t + 1, V (s t} ) is V value in the state _{_{s t, V (s t +}} 1) , the state s _{t +} V value when the _1, alpha step size parameter, gamma is the discount rate, r _t denotes a reward in the state s _t.
r is a reward given by the external environment, which is usually negative for penalties and positive for rewards. The range of the values of α and γ is 0 ≦ α and γ ≦ 1, and it is not necessary that α = γ.
In this embodiment, V (s) is a two-dimensional array of V (x, y) ((x, y) is the position of the educational agent). (It is not the coordinates of the learning agent. The educational agent is a square. The learning agent is a circle.)

Ｖ値の更新の過程は、上述のＱ値のそれと類似する。Ｑ値の場合には「状態」、「行動」の双方が変数となていたが、Ｖ値の場合には「状態」のみが変数となっているために、Ｑ値の場合のＱ(ｓ、下)のみがＶ(ｓ)であると考えれば、Ｖ値が更新される様子が理解できる。
図１９で、
（数６）
V[x0][y0]=0
V[x1][y1]=0
V[x2][y2]=0
が初期値となる。
[X0][y0]（1901）がスタート、[x2][y2]（1903）がゴールとする。まずスタート1901から移動し、偶然ゴール1903であるV[x2][y2]に到達したとする。V(s)は上式で更新されるが、[x2][y2]のところの更新は、r=1.0が入るので(ｒは報酬である。ゴールに到達したときの報酬r=1と設定している。次の状態の報酬も観測できるので、下という行動をとれば、報酬r=1がもらえる(ゴールに到達する)、ということがわかる。したがってr=1となる。)、ここのV値を計算すると、
V(s_t)←V(s_t)+α[r_t+γV(s_t+1)-V(s_t)]において、V(s)=0, V(s_t+1)=0 なので、単純にV[x2][y2]=αとなり、V値が上がる。(他はr=0ならば、上記はゼロのまま。)
最初のエピソード終了後の値は、以下のとおりである。
（数７）
V[x0][y0]=0
V[x1][y1]=0
V[x2][y2]= α
V値を維持したまま、再度スタート地点（1901）から試行を開始した場合(つまり次のエピソードにおいて)、[x1][y1]（1905）にいるエージェントがまわりのV値を観測したときに、V値の一番高い[x2][y2]（1903）のV値をV(s_t+1)の項に使う。したがって、[x1][y1]の値は、
V(s_t)←V(s_t)+α[r_t+γV(s_t+1)-V(s_t)]において、
V(s)=0, r=0(ゴール以外の状態にある報酬はゼロと設定した。もし、衝突した場合はr=−１/Max_stepの値が入るので、V値は下がる。)、V(s_t+1)=αとなるので、V[x1][y1]=α＊αγとなる。
さらに、[x2][y2]（1903）のときにゴールにまた到達する。このとき[x2][y2]（1903）のV値はαなので、上式でV(s)=αとなる。よって、このエピソード終了時には、
（数８）
V[x0][y0]= 0
V[x1][y1]= α＊αγ
V[x2][y2]= α+α(1+αγ-α)=2α+α(αγ-α)
となり、ゴールに近いところから順にV値が更新されていく。周りにV値が全部ゼロの時は、ランダムに行動を選択するしかないが、一度ゴールに到達すると、そこからどんどん学習がすすんで行く。 The process of updating the V value is similar to that of the Q value described above. In the case of the Q value, both “state” and “action” are variables. However, in the case of the V value, only “state” is a variable. If only (lower) is V (s), it can be understood that the V value is updated.
In FIG.
(Equation 6)
V [x0] [y0] = 0
V [x1] [y1] = 0
V [x2] [y2] = 0
Is the initial value.
[X0] [y0] (1901) is the start and [x2] [y2] (1903) is the goal. First, it is assumed that the user moves from the start 1901 and accidentally reaches V [x2] [y2], which is the goal 1903. V (s) is updated by the above formula, but the update at [x2] [y2] has r = 1.0 (r is reward. Set reward r = 1 when the goal is reached) Since you can also observe the reward of the next state, you can see that if you take the action below, you will get reward r = 1 (reaching the goal), so r = 1.) When calculating the V value,
In V (s _t ) ← V (s _t ) + α [r _t + γV (s _{t + 1} ) -V (s _t )], V (s) = 0 and V (s _{t + 1} ) = 0 V [x2] [y2] = α, and the V value increases. (Others will remain zero if r = 0.)
The values after the end of the first episode are as follows:
(Equation 7)
V [x0] [y0] = 0
V [x1] [y1] = 0
V [x2] [y2] = α
If the trial is started again from the start point (1901) while maintaining the V value (that is, in the next episode), when the agent in [x1] [y1] (1905) observes the surrounding V value, The V value of [x2] [y2] (1903) with the highest V value is used for the term of V (s _{t + 1} ). Therefore, the value of [x1] [y1] is
In V (s _t ) ← V (s _t ) + α [r _t + γV (s _{t + 1} ) -V (s _t )],
V (s) = 0, r = 0 (rewards in a state other than the goal are set to zero. If there is a collision, the value of r = −1 / Max_step is entered, so the V value decreases.), V Since (s _{t + 1} ) = α, V [x1] [y1] = α * αγ.
Furthermore, the goal is reached again at [x2] [y2] (1903). At this time, since the V value of [x2] [y2] (1903) is α, V (s) = α in the above equation. So at the end of this episode,
(Equation 8)
V [x0] [y0] = 0
V [x1] [y1] = α * αγ
V [x2] [y2] = α + α (1 + αγ-α) = 2α + α (αγ-α)
The V value is updated in order from the point close to the goal. When the V value is all around, you have no choice but to choose an action at random, but once you reach the goal, learning goes on from there.

時間的なあいまいさを含む問題例： Kheperaロボットの行動制御
ここでは、個人に密接に適応するマンマシンインタフェースをとりあげる。
時間的なあいまいさを含む問題のタスクの例としては、寝たきりの患者が、限られた空間内にある欲しい物、例えば薬や水などを、学習エージェント（本実施例ではKheperaロボットだが、本発明はこれに限定されない）に取ってきてもらうというものである（図２０）。ここで対象とする患者は、自分の意志で身体を自由に動かすことが難しい人、会話でコミュニケーションが取れない人、どこか一箇所の筋肉を自分の意志で動かすことのできる人(眉、指、片腕など)とする。このため従来手法である音声認識は使用することができない。 Example of problems including temporal ambiguity: Khepera robot behavior control Here, we take up a man-machine interface that closely adapts to individuals.
As an example of a problem task including temporal ambiguity, a bedridden patient may want a desired agent in a limited space, such as medicine or water, as a learning agent (in this example, a Khepera robot. Is not limited to this) (FIG. 20). The target patients here are those who are unable to move their bodies freely with their own will, those who are unable to communicate through conversation, and those who can move one of their muscles at their own will (brows, fingers) , One arm, etc.). For this reason, the conventional speech recognition cannot be used.

本実施例では、学習エージェントが、予め設定されたスタートからゴールまで障害物回避を行いながら到達する、という環境を前提とする（図21）。この環境には、通行可能な領域と、通行不可能な障害物が存在する。教育エージェント、中間エージェント、学習エージェント、の全エージェントにとって、ゴールの位置と環境は未知である。オペレータである人（本実施例では教育エージェント）も、ゴールのだいたいの方向は把握しているが、正確な位置は、障害物の陰になって見えないような実験環境で実験を行っている。オペレータも、学習エージェントがゴールに到達したか否かは、環境から教えられる。実際のプログラム上では、ある決められたゴール座標に学習エージェント（Kheperaロボット）が到達したらゴールと報酬を与えるが、このゴールの位置情報はどのエージェントにとっても未知な状態でスタートする。これは環境のみがもっている情報である。欲しいものが何かの影になって、オペレータにとってもはっきりとは確認できない状況が考えられるので、ロボットには自律性が要求される。
教育エージェント（患者）は、学習エージェントの行動を観測し、その行動が不快と感じた時にスイッチを押す。このシステムにおける教育エージェントは患者である。
学習エージェントは、Kheperaロボットであり、内部に強化学習を構築している。
Kheperaロボットは、図２２−Ａに示すように、略円柱形状であり、２個の移動用モータ（同図（ｂ）の2205ａ、ｂ）を持ち、８個の赤外線センサ（同図2203ａ〜ｈ）を有している。移動可能な方向は３６０度方向を４５度ずつ均等に８つに分けた、８方向である（図２２−Ｂの2210）。また、１ステップ当りの移動距離は、Kheperaロボット2215の半径の45％である。１ステップ上方向に移動した場合の位置を2217で示す。
本実施例では、教育エージェントが押すスイッチがそのままオンラインでペナルティ(評価値)として強化学習に与えられ、学習エージェントはすべてを細かく教えてもらわなくても、自分自身の意思決定によって、ゴールに到達できるように、繰り返し学習をおこなう。 In this embodiment, it is assumed that the learning agent reaches from a preset start to a goal while avoiding obstacles (FIG. 21). In this environment, there are areas that are accessible and obstacles that are not allowed to pass. The goal position and environment are unknown to all the agents, the education agent, the intermediate agent, and the learning agent. The person who is an operator (educational agent in this example) also knows the general direction of the goal, but is conducting experiments in an experimental environment where the exact position is hidden behind obstacles. . The operator is also informed from the environment whether the learning agent has reached the goal. In an actual program, when a learning agent (Khepera robot) reaches a certain goal coordinate, a goal and a reward are given, but this goal position information starts in an unknown state for any agent. This is information that only the environment has. Robots are required to be autonomy because there is a situation in which the operator wants to be in the shadow of something and cannot be clearly confirmed by the operator.
The educational agent (patient) observes the behavior of the learning agent and presses the switch when the behavior is uncomfortable. The educational agent in this system is a patient.
The learning agent is a Khepera robot that builds reinforcement learning inside.
As shown in FIG. 22-A, the Khepera robot has a substantially cylindrical shape, has two moving motors (2205a and b in FIG. 22B), and has eight infrared sensors (see FIG. 2203a to h in FIG. 22B). )have. The movable directions are eight directions obtained by equally dividing the 360-degree direction into eight 45-degree directions (2210 in FIG. 22-B). The moving distance per step is 45% of the radius of the Khepera robot 2215. The position when moved upward by one step is indicated by 2217.
In this example, the switch pressed by the education agent is given to the reinforcement learning as a penalty (evaluation value) online as it is, and the learning agent can reach the goal by his own decision making without having to teach everything in detail. As such, it learns repeatedly.

しかし、スイッチを押すタイミングの個人差が非常に大きく、どのような行動が不快と感じるかは、人によって違う。また、同じ行動を不快と感じても、ペナルティを与えるタイミングが人によって異なるため、学習エージェント内に構築された学習器にとって、あいまいな評価となる。そのため、全く同じ学習機構で、同じ内部構造(意思決定構造)を持っていても、学習結果に大きな隔たりがおき、ある人に対しては成功するが、別の人に対しては、失敗する、というようなことが起こる。
すなわち、ある人に対しては、学習が収束するが、別の人に対しては、全く学習がすすまない。これは、学習器にとって、人の評価の与え方が時間的にあいまいなためである。シミュレーション上では、同期を取るためこのような時間的なあいまいさが起こらない。しかし、実際のロボットを使うような実環境下では、非同期が必ず存在するため、このような時間的なあいまいさが起こる。
本発明では、この時間的なあいまいさの存在を発見し、これを解決するために、階層型のエージェントシステムを提案する。ここでは、先程まで述べた教育エージェント(人間)と学習エージェント(Kheperaロボット)の仲立ちをする中間層のエージェント(時間的なあいまいさの吸収役)の3層構造とする。具体的には、学習エージェントに強化学習の一つであるQ学習を採用する。Q学習では、Q値を更新することによって、学習がすすむ。
中間エージェントは、個人差によるタイミングの違いを吸収する。具体的には、教育エージェントがペナルティーを与えた時間から、学習エージェントが学習を更新する時間幅を、各個人に対して調整する個人適応型インターフェースの役割を担う。
学習エージェントがＱ学習で用いるQテーブルは、状態sと行動aの組み合わせからなるテーブル（配列）のことで、各テーブルの値がQ値と呼ばれるものである。ここでいう状態ｓとは、スタート位置を原点とした、現在の位置のｘ，ｙ座標であり、行動は上下左右、各方向の斜めの8方向である。Q値の更新には、報酬(マイナスの報酬の場合はペナルティとも呼ぶ)が必要である。ここでは、環境からのペナルティと、人間が与えるインタラクティブなペナルティの２つを採用する。 However, individual differences in the timing of pressing the switch are very large, and what kind of behavior feels uncomfortable varies from person to person. Moreover, even if the same behavior is felt uncomfortable, the timing for giving a penalty varies depending on the person, so it is an ambiguous evaluation for the learning device built in the learning agent. Therefore, even if they have exactly the same learning mechanism and the same internal structure (decision making structure), there is a big gap in the learning results, and it succeeds for one person, but fails for another person. That happens.
That is, learning converges for one person, but learning does not work for another person. This is because it is ambiguous in time for the learning device to give a person's evaluation. In the simulation, such temporal ambiguity does not occur because of synchronization. However, in an actual environment where an actual robot is used, asynchrony always exists, and this temporal ambiguity occurs.
In the present invention, a hierarchical agent system is proposed in order to discover and resolve the existence of this temporal ambiguity. Here, it is assumed to be a three-layer structure of an intermediate layer agent (absorber of temporal ambiguity) that mediates between the educational agent (human) and the learning agent (Khepera robot) described above. Specifically, Q learning, which is one of reinforcement learning, is adopted as a learning agent. In Q learning, learning progresses by updating the Q value.
Intermediate agents absorb timing differences due to individual differences. Specifically, it plays the role of a personal adaptive interface that adjusts the time width for the learning agent to update learning from the time when the educational agent gives a penalty.
The Q table used by the learning agent for Q learning is a table (array) composed of a combination of the state s and the action a, and the value of each table is called a Q value. The state s here is the x and y coordinates of the current position with the start position as the origin, and the actions are eight directions, up and down, left and right, and diagonal directions. To update the Q value, a reward (also called a penalty for negative rewards) is required. Here, we adopt two penalties: an environmental penalty and an interactive penalty given by humans.

上述のように、Kheperaロボットは、赤外線センサを用いて環境を観測する。環境からのペナルティは、Kheperaロボットの赤外線センサより得る。Kheperaロボットには赤外線センサが８個ついており、障害物の有無を、数値データとして障害物までの距離を観測可能である。しかし、赤外線データは誤差が非常に大きいため、数値データをガウス分布によって障害物の存在確率とし、その確率を環境からのペナルティとしてQ学習の更新を行う。
人間が与えるペナルティは、人がKheperaロボットの行動を観測し、不快に感じた時に、マウスをクリックする。一度クリックすると一定の値(-0.5)のペナルティが与えられ、Q値を更新する。Q値の更新範囲には、空間的な範囲と時間的な範囲がある。空間的な更新範囲は、Kheperaロボットと同じ大きさとし、中心をペナルティの最大値としたガウス分布で更新を行う。これにより、繰り返し学習を行う時の、空間的なずれ（すなわち、実環境下では、１回目の学習に用いた位置と2回目の位置が全く同じとは限らない）を補正することが可能である。時間的な更新範囲は、現在の時間tと、一ステップ前のt-1、現在と同じ行動をとると仮定したときの次のステップt+1の３つを用意し、(t-1、t)、(t、t+1)、(t-1、t、t+1)の３つの組み合わせの中から、個人に適応した更新時間を選択する。例えば、（ｔ−１、ｔ）の組合せを選んだ場合は、時刻tと時刻t-1の両方のQ値を更新する、ということになる。(t-1、t)、(t、t+1)、(t-1、t、t+1)の３つのうち、(t-1、t)は時刻t-1とtのQ値を更新し、(t、t+1)は時刻ｔとt+1のQ値を更新し、(t-1、t、t+1)はｔ−１とｔとｔ＋１の３つの時刻のQ値を同時に更新することを意味する。この３つの更新ルールのうち、どれかを選ぶ。この更新ルールの選択を行うのが中間層のエージェントである。「(t-1、t、t+1)はｔ−１とｔとｔ＋１の３つの時刻のQ値を同時に更新する」とは、ｔ−１とｔとｔ＋１の３つの時刻で、Ｑ値をそれぞれ、所定の式（例えば後述の（数９））にしたがって一斉に更新する、という意味である。 As described above, the Khepera robot observes the environment using an infrared sensor. Penalties from the environment are obtained from the infrared sensor of the Khepera robot. The Khepera robot is equipped with eight infrared sensors, and it is possible to observe the distance to the obstacle as numerical data, indicating the presence or absence of the obstacle. However, since the infrared data has a very large error, the numerical data is set as an obstacle existence probability by a Gaussian distribution, and Q learning is updated with the probability as a penalty from the environment.
The penalty given by humans is to click the mouse when a person observes the behavior of the Khepera robot and feels uncomfortable. Once clicked, a certain value (-0.5) penalty is given and the Q value is updated. The update range of the Q value has a spatial range and a temporal range. The spatial update range is the same size as the Khepera robot, and the update is performed using a Gaussian distribution with the center as the maximum penalty. As a result, it is possible to correct a spatial deviation (that is, the position used for the first learning and the second position are not exactly the same in an actual environment) when performing the repeated learning. is there. There are three temporal update ranges: the current time t, t-1 one step before, and the next step t + 1 when assuming the same action as the current one, (t-1, The update time adapted to the individual is selected from the three combinations of (t), (t, t + 1), and (t-1, t, t + 1). For example, when the combination of (t−1, t) is selected, the Q values at both time t and time t−1 are updated. Of the three (t-1, t), (t, t + 1), (t-1, t, t + 1), (t-1, t) is the Q value at time t-1 and t. Update, (t, t + 1) updates the Q values at times t and t + 1, and (t-1, t, t + 1) updates the Q values at three times, t-1, t, and t + 1. Means to update at the same time. Choose one of these three update rules. This update rule is selected by the middle-tier agent. “(T−1, t, t + 1) simultaneously updates the Q value at three times t−1, t, and t + 1” means that the Q value at three times t−1, t, and t + 1. Are updated all at once according to a predetermined formula (for example, (Equation 9 described later)).

Kheperaロボットは、各学習の開始時(エピソード開始時)に、決められたスタート位置につく。ロボットの中に構築されているQテーブルのうち、最も高いQ値を持つ行動を選択する。最初のエピソードの時は、すべてのQ値が同じ値(ここではゼロ)であるため、前進するものとする。前に障害物が現れるか、人によるインタラクション(ペナルティ)が入るまで、前進を続ける。
センサ値とインタラクションによって、Q値が更新されると、Q値の最も高い行動を次に選択する。Q値の観測を繰り返し、ゴールまで到達すると、１エピソード終了である。
Kheperaロボットは、ゴールの位置を知らない。環境からそこがゴールだと知らされる。これをプログラミングで実装するときには、ゴールと設定した座標にKheperaロボットが到達したときに「ゴールした」と教えて、報酬を与える。
１エピソード内でゴールに到達しない場合は、中間層のエージェントが、更新ルールを変更する。適用する更新ルールの違いによる人間が与えるインタラクションの回数と学習した経路を結果として示す。適用するルールによっては、学習が全く進まない、あるいは逆に悪くなっていることがわかる。逆に悪くなっている例として、図２５−Ｂ、および、図２６−Ｂ、Ｃを参照して欲しい。これらの図中、丸又は四角形の中に「Ｇ」の文字が付された部分がゴール位置である。また、小さい白丸が環境からペナルティーが与えられた箇所である。そして、実線が学習エージェントの軌跡である。
図２５−Ａでは、エピソード(試行)の１回目から、１０回目、まで全てのエピソードでゴールに到達しており、更に、エピソード数が増す程、ペナルティーを受けなくなっている。
これに対して、図２５−Ｂでは、１回目は成功しているが、２回目の試行では失敗している。 The Khepera robot takes a fixed starting position at the start of each learning (at the start of an episode). Select the action with the highest Q value from among the Q tables built in the robot. At the time of the first episode, all Q values are the same value (here, zero), so it is assumed to move forward. Continue moving forward until an obstacle appears before or there is a human interaction (penalty).
When the Q value is updated by the sensor value and the interaction, the action having the highest Q value is selected next. When the Q value is observed repeatedly and the goal is reached, one episode is over.
The Khepera robot does not know the position of the goal. The environment tells me that is the goal. When this is implemented by programming, when the Khepera robot reaches the goal and the set coordinates, it is taught that it has “goaled” and rewarded.
If the goal is not reached within one episode, the agent in the middle layer changes the update rule. As a result, the number of interactions given by humans and the learned route according to the difference in the applied update rules are shown. It can be seen that depending on the rules applied, learning does not progress at all, or on the contrary. On the other hand, see FIGS. 25-B and FIGS. 26-B, C for examples of worsening. In these figures, the part with the letter “G” in a circle or square is the goal position. Also, small white circles are places where penalties are given by the environment. The solid line is the locus of the learning agent.
In FIG. 25-A, the goal is reached in all episodes from the first episode to the tenth episode (trial), and the penalty is not increased as the number of episodes increases.
In contrast, in FIG. 25-B, the first attempt succeeds, but the second attempt fails.

中間エージェント内部では、教育エージェントがペナルティを与えるためにスイッチを押したタイミングと、そのペナルティによって学習したKheperaロボットの行動を観測し、ゴールに到達しない場合に、Kheperaロボットが学習を更新する範囲を自律的に調整する。
ここでは、人間のインタラクションを用いたロボット制御をタスクとしてあげたが、人間が直接オンラインで制御するものに対して、本階層型エージェントシステムの適用が可能である。また、ここでは、ロボットの障害物回避を例として説明してきたが、マニピュレータや、学習を組み込んである他の制御システムへの適用が可能である。この階層型エージェントシステムを適用する際に、教育エージェント、学習エージェント機構をできるだけ簡単にし、容易に変更可能なものとすると、中間層のエージェントの機構に重点を置くことで、自律適応型の制御システムの設計が容易となる。 Inside the intermediate agent, observe the timing when the educational agent pressed the switch to give a penalty and the behavior of the Khepera robot learned by the penalty, and if the goal is not reached, the range in which the Khepera robot updates learning is autonomous To adjust.
Here, robot control using human interaction is given as a task. However, this hierarchical agent system can be applied to what is directly controlled by humans online. Also, here, the obstacle avoidance of the robot has been described as an example, but application to manipulators and other control systems incorporating learning is possible. When this hierarchical agent system is applied, the educational agent and learning agent mechanism should be as simple as possible and easily changeable. It becomes easy to design.

実験環境１では、障害物１および２を配置し（図２３−Ａ Environment 1）、環境２では障害物２のみを配置した（図２３−Ｂ Environment ２）。それぞれの環境において、スタート場所からゴールまでの、Kheperraロボットの動作を観察した結果が図２４−Ａ〜Ｃである。この図中で、図２４−Ａ、および図２４−Ｂが環境1の実験結果であり、図２４−Ｃが、実験環境1で学習済みのQ値を用いて、環境だけを実験環境2に変更した１エピソード目の結果である。図２４−Ａ〜Ｃ中、2401がスタート位置、2403がゴール位置、2407（図２４−Ａ）、2409（図２４−Ｂ）、2411（図２４−Ｃ）は、センサからの情報で壁と判断した部分である。
これから分かるように、実験環境が変化しても別の環境で学習済みのQ値を用いてもゴールに到達できたことから、提案手法が環境へのロバスト性に優れていることがわかる。（-通常の強化学習では、環境が変化すると再学習が必要となったり、スイープ(これまで学習してきたQ値やV値をすべて初期状態に戻すこと)が必要となり、最初のエピソードでゴールに到達することはほとんど無い。）なお、この結果は、時間的あいまいさを補正している、図２５−Ａで示している成功した例の被験者Aの場合である。 In the experimental environment 1, obstacles 1 and 2 were arranged (FIG. 23-A Environment 1), and in the environment 2, only the obstacle 2 was arranged (FIG. 23-B Environment 2). The results of observing the operation of the Kheperra robot from the start location to the goal in each environment are shown in FIGS. In this figure, FIGS. 24-A and 24-B show the experimental results of environment 1, and FIG. 24-C shows only the environment as experimental environment 2 using the Q value learned in experimental environment 1. It is the result of the changed first episode. 24-A to C, 2401 is the start position, 2403 is the goal position, 2407 (FIG. 24-A), 2409 (FIG. 24-B), 2411 (FIG. 24-C) This is the part that was judged.
As can be seen, even if the experimental environment changes, the goal can be reached even if the Q value learned in another environment is used, indicating that the proposed method is excellent in robustness to the environment. (In normal reinforcement learning, re-learning is required when the environment changes, and sweeping (returning all the Q and V values that have been learned so far to the initial state) is necessary. Note that this result is for the successful example subject A shown in FIG. 25-A, correcting for temporal ambiguity.

Kheperaロボットについてのエージェント学習方法を、フローチャートを用いて説明する。
[メインフロー]（図２７）
（１）S2710において、すべてのパラメータを初期化する。パラメータを以下に示す。
（表３）パラメータの一覧

（２）S2713で、Kheperaロボットがスタート地点に戻る。ここで、スタート地点とは、あらかじめ設定されたある一点のことで、この一点を中心にして、前回の試行（エピソード）と同じ角度に向けて設置される。スタート地点に戻った段階で、ステップ数sはゼロに、エピソード数episodeは1つ加算される。
（３）S2715においてKheperaロボットは赤外線センサを用いて、環境を観測する。
（４）S2717で、センサに反応があるかどうかで、前方に障害物があるかどうかの判定をする。
（５）S2719では、センサに反応がある場合、センサ値をもとに確率分布から報酬(マイナス)を更新する。ここでは、環境から報酬を受ける。
（６）S2721においてQ-Learningによる学習を行う。
（７）S2723では、移動する方向を、8方向の中から決定する。
（８）S2725で、設定された移動量で、1ステップ移動する。
（９）S2727では、教育エージェントからペナルティを受けたかどうかの判定をする。
（１０）S2729において、ペナルティを受けた場合、設定された値で報酬(マイナス)を更新する。ここでは、Ｓ2719とは異なり、人間から報酬を受ける。
（１１）S2731で、Q-Learningによる学習を行う。
（１２）S2733で、最大ステップ数を超えたかどうかの判定をする。
（１３）S2735で、最大ステップ数に到達した場合、ゴールに到達したかの判定を行う。
（１４）S2737で、最大ステップ数に到達していない場合、ゴールに到達したかどうかの判定を行う。
（１５）S2739で、ゴールに到達した場合、エピソードが設定された最大エピソード数を超えたかどうかの判定を行う。
（１６）S2741において、最大ステップ数を超えてもゴールに到達できなかった場合、中間エージェントがQ値を更新する時間的な範囲を変更し、変更ルールをKheperaロボットに送信する。
（１７）S2743で、Kheperaロボットが更新ルールを更新する。
（１８）S2745で、あらたな更新ルールに従って、履歴から再学習を行う。
（１９）S2747で、教育エージェントが自分の目でKheperaロボットの行動を観測する。
（２０）S2749で、Kheperaロボットの行動が自分自身にとって不快かどうかを直感的に判定する。
（２１）S2751で、Kheperaロボットの行動が不快な場合、スイッチを押すことによってペナルティを与える。
（２２）S2753で、Kheperaロボットの行動と教育エージェントのペナルティを与えるタイミングを観測する。 The agent learning method for the Khepera robot will be described using a flowchart.
[Main flow] (Fig. 27)
(1) In S2710, all parameters are initialized. The parameters are shown below.
(Table 3) List of parameters

(2) In S2713, the Khepera robot returns to the starting point. Here, the start point is a certain point set in advance, and is set with the one point as the center and the same angle as the previous trial (episode). At the stage of returning to the starting point, the number of steps s is reduced to zero and the number of episodes episode is incremented by one.
(3) In S2715, the Khepera robot observes the environment using an infrared sensor.
(4) In S2717, it is determined whether there is an obstacle ahead based on whether the sensor has a response.
(5) In S2719, when there is a response to the sensor, the reward (minus) is updated from the probability distribution based on the sensor value. Here we get rewards from the environment.
(6) In S2721, learning by Q-Learning is performed.
(7) In S2723, the moving direction is determined from eight directions.
(8) In S2725, move by one step with the set amount of movement.
(9) In S2727, it is determined whether a penalty has been received from the educational agent.
(10) When a penalty is received in S2729, the reward (minus) is updated with the set value. Here, unlike S2719, the person receives a reward.
(11) In S2731, learning by Q-Learning is performed.
(12) In S2733, it is determined whether the maximum number of steps has been exceeded.
(13) If the maximum number of steps is reached in S2735, it is determined whether the goal has been reached.
(14) If the maximum number of steps has not been reached in S2737, it is determined whether or not the goal has been reached.
(15) If the goal is reached in S2739, it is determined whether the number of episodes exceeds the set maximum number of episodes.
(16) In S2741, when the goal cannot be reached even if the maximum number of steps is exceeded, the temporal range in which the intermediate agent updates the Q value is changed, and the change rule is transmitted to the Khepera robot.
(17) In S2743, the Khepera robot updates the update rule.
(18) In S2745, relearning is performed from the history according to a new update rule.
(19) In S2747, the educational agent observes the behavior of the Khepera robot with his own eyes.
(20) In S2749, it is intuitively determined whether or not the behavior of the Khepera robot is uncomfortable for itself.
(21) If the action of the Khepera robot is uncomfortable in S2751, a penalty is given by pressing a switch.
(22) In S2753, observe the timing of giving a penalty for the behavior of the Khepera robot and the education agent.

[学習エージェントが意思決定するフロー：Ｓ2723に対応]（図28）
（１）S2810で、学習エージェントが設定された観測範囲で、自分達自身が持っているQ-tableのQ値を観測する。
（２）S2830で、方策にしたがって学習エージェントが行動を選択する。ここではgreedy方策を用いるため、最もQ値の高い行動のみを選択する。 [Learning Agent Decision-making Flow: Corresponds to S2723] (Figure 28)
(1) In S2810, the Q value of the Q-table that they own is observed within the observation range where the learning agent is set.
(2) In S2830, the learning agent selects an action according to the policy. Here, since the greedy policy is used, only the action with the highest Q value is selected.

[学習エージェントが行動を取るフロー：Ｓ2725に対応]（図29）
（１）S2910で、最大回転角度を180度とし、入力された行動の方向に左右どちらか、回転角度が小さな方向に一定角度で、回転する。
（２）S2930で、回転が終了後、設定された1ステップの移動量を進む
（３）S2950で、ステップ数パラメータ sを1つ加算する。 [Flow for learning agent to take action: corresponds to S2725] (Figure 29)
(1) In S2910, the maximum rotation angle is set to 180 degrees, and the input action direction is rotated at a constant angle in either the left or right direction or a smaller rotation angle.
(2) After the rotation is completed in S2930, the set movement amount of one step is advanced. (3) In S2950, one step number parameter s is added.

[学習エージェントが学習するフロー：Ｓ2721およびＳ2731に対応]（図30）
Ｓ3010で、Q値をt_p<t<t_nの範囲で更新する。上述の概説においては、（t-1,t）、（t,t+1）、（t-1,t,t+1）の３つの場合に限定して説明したが、一般的化すると、このように、t_p<t<t_nの任意の時間範囲におけるＱ値を更新することも可能である。特に、お年寄り等であれば、スイッチを押すタイミングが大きく遅れてしまうため、１ステップ前後ではなく、もっと広い範囲でＱ値を更新する必要がある。
ここで、tは学習エージェントが現在（時刻ｔ）に存在する位置、t_p は現在いる位置からt_p時間前の位置、t_nは現在いる位置から、現在と同じ行動を取った場合にt_n時間後にいる位置である。これらの時間の学習エージェントの位置を中心に、設定された観測範囲でQ値を、t_pから順番に更新する。行使は、以下の式による。
（数９）
Q(s,a)←(1-a)Q(s,a)+α[r+γmax_a'Q(s_t+1,a_t+1)]
ここで、ｓは状態、ａは動作、Ｑ（ｓ，ａ）は、状態がｓ、動作がａのときのＱ値、αはステップサイズパラメータ、γは割引率、Ｑ（s_t+1,a_t+1）は、状態がｓ_t+1、動作がａ_t+1のときのＱ値を意味する。「状態がｓ_t+1、動作がａ_t+1のとき」とは、次に取りうる状態、行動、の対を意味する。
rは外部環境が与える報酬のことで、ペナルティの場合は通常マイナス、報酬の場合はプラスの値を与える。α、γの値の範囲は0≦α、γ≦1となり、α=γである必要は無い。
max_a'は、次に取ることが可能な全ての状態、状態対において最もＱ値が最大になるような行動を選択することを意味する。max_a'Q(s_t+1,a_t+1)は、次のステップで取りうる全ての状態、行動、の対において、Ｑ値が最大となるような行動ａを取ったときのＱ値である。
Ｑ値の更新過程については、実施例１での説明を参照頂きたい。
本実施例では、Q[agent_num]([s0][s1][s2][s3][s4][other],[action_num]) =Ｑ(s,a)、のｓ（状態）は、自身の中心座標であり、行動は自分の行動であり、[action_num]に相当する。つまり、Ｑ（[x],[y]，[action_num]）の３次元配列である。
agent_numはKheperaロボットのＤＣモータ付きのタイヤ、action_numは、action_numは４方向なので、いずれも４つの値を取る。
（同じＱ-Ｌｅａｒｎｉｎｇを使っているが、状態、行動、の取り方はKheperaロボットとピアノ問題では大きく異なる） [Flow of learning agent learning: corresponds to S2721 and S2731] (Fig. 30)
In S3010, the Q value is updated in the range of t _p <t <t _n . In the above overview, the description has been limited to the three cases of (t−1, t), (t, t + 1), and (t−1, t, t + 1). As described above, the Q value in an arbitrary time range of t _p <t <t _n can be updated. In particular, if the elderly person or the like, the timing of pressing the switch is greatly delayed, so it is necessary to update the Q value in a wider range rather than around one step.
Where t is the position where the learning agent currently exists (at time t), t _p is the position t _p time before the current position, and t _n is the time when the same action is taken from the current position. _This is the position that is _n hours later. The Q value is updated in order from t _p in the set observation range around the position of the learning agent at these times. Use the following formula.
(Equation 9)
Q (s, a) ← (1-a) Q (s, a) + α [r + γmax _{a '} Q (s _{t + 1} , a _{t + 1} )]
Here, s is a state, a is an operation, Q (s, a) is a Q value when the state is s and an operation is a, α is a step size parameter, γ is a discount rate, Q (s _{t + 1} , a _{t + 1} ) means the Q value when the state is s _{t + 1} and the operation is at _{t + 1} . “When the state is s _{t + 1} and the action is at _{t + 1} ” means the next possible state / action pair.
r is a reward given by the external environment, which is usually negative for penalties and positive for rewards. The range of the values of α and γ is 0 ≦ α and γ ≦ 1, and it is not necessary that α = γ.
“max _{a ′”} means selecting an action that maximizes the Q value in all possible states and state pairs. max _{a ′} Q (s _{t + 1} , a _{t + 1} ) is the Q value when taking action a that maximizes the Q value in all possible states, actions, pairs in the next step It is.
For the update process of the Q value, refer to the description in the first embodiment.
In this embodiment, Q [agent_num] ([s0] [s1] [s2] [s3] [s4] [other], [action_num]) = Q (s, a) It is the central coordinate, the action is my action, and corresponds to [action_num]. That is, it is a three-dimensional array of Q ([x], [y], [action_num]).
Since agent_num is a tire with a DC motor of the Khepera robot and action_num is action_num in four directions, each takes four values.
(The same Q-Learning is used, but how to take the state, action, and the Khepera robot and the piano problem are greatly different)

[中間エージェントが更新ルールを学習エージェントに送るフロー：Ｓ2741に対応]（図31）
（１）S3110で、Q値の時間的更新範囲t_p、t_nを照合する。
（２）S3130で、パラメータt_p、t_nを加減することによって、更新範囲を変更する。 [Flow for intermediate agent sending update rule to learning agent: Corresponding to S2741] (Fig. 31)
(1) In S3110, the Q value temporal update ranges t _p and t _n are collated.
(2) In S3130, the update range is changed by adjusting the parameters t _p and t _n .

[学習エージェントが履歴から学習するフロー：Ｓ2745に対応]（図32）
（１）S3210で、ペナルティが入った時の状態、行動、の対、時刻tを照合する。ここで、照合とは、履歴から学習するときは、失敗したときに行うが、Kheperaロボットの場合は、同時にいくつかのｔ（時刻）に対してＱ値を更新する。このｔ値の取り方次第で、オペレータによって全く結果が異なる。一度行った学習を、一旦クリアして、違うｔの幅でＱ値を更新するときに、どのｔの時に、オペレータが不快と感じたかを知る必要がある。オペレータが不快と感じた時刻ｔをリストアップする、というのが、照合に相当する。
一度更新したＱ値をきれいに、なかったことにして、そのエピソードのスタート状態に戻して、実際にはいつペナルティーが入っていたのかをチェックして、そのペナルティーが入った時刻に、別のｔの幅でＱ値を再更新する。
（２）S3230で、現在のエピソードで更新されたQ値を1つ前のエピソード開始時の状態に戻す。
本ステップにおいては、学習エージェントが一度もゴールに到達しなかったかを、中間エージェントが判断し、一度もゴールに到達しなかったと判断された場合に、中間エージェントが、現在の試行で更新された前記学習エージェントが持つＱ値を、一つ前の試行開始時の状態に戻す。
一方、過去にゴールに到達したと判断された場合には、中間エージェントが、現在の試行で更新された学習エージェントの持つＱ値を、直近にゴールに到達した試行終了後のQ値の状態に戻してから、上記Ｓ2743で更新された、新たな更新ルールで、次の試行のＱ値を更新する。
以上を更に詳説する。、図3３を参照頂きたい。
Q値は、ステップごとに更新される。ただし、Q値がゼロの場合は、いつまで更新式に代入した所でゼロのままである。Q値に値が入るのは、ゴールに到達した場合の報酬による値とペナルティによる値で、今回はペナルティのみによって報酬が入る。よって、センサが障害物を感知した時と人からのペナルティが与えられるとQ値に値が入る。
一度値が入ると、そのまわりのQ値がどんどん更新されて行く（3301）。しかし、更新ルールと人とのタイミングが合ってない場合、ゴールに到達しない。この間に更新されてしまったQ値（3305）を、前回ゴールに到達した状態(初期状態の可能性もある)（3307）に一度戻し（これが直近にゴールに到達した試行終了後のＱ値）、失敗したエピソードでの更新（3305）をなかったことにする。
ただし、どのタイミングでペナルティが入ったかは、記録にとってあり、あらたに適用した更新ルール（Ｓ2743）で、もう一度ペナルティの入ったタイミングにおいてQ値の更新する、ということである。 [Flow for learning agent to learn from history: corresponds to S2745] (Figure 32)
(1) In S3210, the state, action pair, and time t when a penalty is entered are collated. Here, collation is performed when learning is performed when learning fails, but in the case of a Khepera robot, the Q value is updated for several t (time) at the same time. Depending on how to obtain the t value, the result is completely different depending on the operator. When learning once performed is cleared and the Q value is updated with a different t width, it is necessary to know at which t the operator feels uncomfortable. Listing the time t when the operator feels uncomfortable corresponds to collation.
Once the Q value that was updated is not clean, return it to the start state of the episode, check when it was actually penalized, and at the time when the penalty entered another t Update the Q value again with the width.
(2) In S3230, the Q value updated in the current episode is returned to the state at the start of the previous episode.
In this step, the intermediate agent determines whether the learning agent has never reached the goal, and if it is determined that the learning agent has never reached the goal, the intermediate agent is updated in the current trial. The Q value of the learning agent is returned to the state at the start of the previous trial.
On the other hand, when it is determined that the goal has been reached in the past, the intermediate agent changes the Q value of the learning agent updated in the current trial to the state of the Q value after the end of the trial that reached the goal most recently. After returning, the Q value of the next trial is updated with the new update rule updated in S2743.
The above will be described in further detail. Please refer to Figure 33.
The Q value is updated every step. However, if the Q value is zero, it will remain zero until it is substituted into the update formula. The value entered in the Q value is the value due to the reward and the value due to the penalty when the goal is reached, and this time the reward is entered only by the penalty. Therefore, the Q value is entered when the sensor detects an obstacle and when a penalty is given by the person.
Once a value is entered, the Q value around it is updated more and more (3301). However, if the timing of the update rule and the person do not match, the goal is not reached. The Q value (3305) that has been updated during this period is once returned to the state where the previous goal was reached (possibly in the initial state) (3307) (the Q value after the end of the trial that reached the goal most recently) , Suppose that there was no update (3305) in the failed episode.
However, the timing at which the penalty was entered is for recording, and the Q value is updated again at the timing when the penalty is entered again according to the newly applied update rule (S2743).

再度図３２に戻る。
（３）S3250で、新たに更新された更新ルールの範囲t_p<t<t_nでQ値を更新する。更新は、以下の式による。
（数１０）
Q(s,a)←(1-α)Q(s,a)+α[r+γmax_a'Q(s_t+1,a_t+1)]
ここで、ｓは状態、ａは動作、Ｑ（ｓ，ａ）は、状態がｓ、動作がａのときのＱ値、αはステップサイズパラメータ、γは割引率、Ｑ（s_t+1,a_t+1）は、状態がｓ_t+1、動作がａ_t+1のときのＱ値を意味する。「状態がｓ_t+1、動作がａ_t+1のとき」とは、次に取りうる状態、行動、の対を意味する。
rは外部環境が与える報酬のことで、ペナルティの場合は通常マイナス、報酬の場合はプラスの値を与える。α、γの値の範囲は0≦α、γ≦1となり、α=γである必要は無い。
max_a'は、次に取ることが可能な全ての状態、行動、の対において最もＱ値が最大になるような行動を選択することを意味する。max_a'Q(s_t+1,a_t+1)は、次のステップで取りうる全ての状態、行動、の対においてＱ値が最大となるような行動ａを取ったときのＱ値である。
本実施例では、Q（[x],[y],[action_num]) =Ｑ(s,a)、のｓ（状態）は自身の中心座標であり、行動は自分の行動であり、[action_num]に相当する。つまり、Q（[ｘ]，[ｙ],[action_num])の３次元配列である。
agent_numはKheperaロボットのＤＣモータ付きのタイヤ、action_numは４方向なので、いずれも４つの値を取る。
同じＱ-Learningを使っているが、状態、行動の取り方は、Kheperaとピアノ問題で大分違う。 Returning again to FIG.
(3) In S3250, the Q value is updated within the newly updated update rule range t _p <t <t _n . Update is based on the following formula.
(Equation 10)
Q (s, a) ← (1-α) Q (s, a) + α [r + γmax _{a '} Q (s _{t + 1} , a _{t + 1} )]
Here, s is a state, a is an operation, Q (s, a) is a Q value when the state is s and an operation is a, α is a step size parameter, γ is a discount rate, Q (s _{t + 1} , a _{t + 1} ) means the Q value when the state is s _{t + 1} and the operation is at _{t + 1} . “When the state is s _{t + 1} and the action is at _{t + 1} ” means the next possible state / action pair.
r is a reward given by the external environment, which is usually negative for penalties and positive for rewards. The range of the values of α and γ is 0 ≦ α and γ ≦ 1, and it is not necessary that α = γ.
“max _{a ′”} means that an action having the maximum Q value is selected in all the states / action pairs that can be taken next. max _{a ′} Q (s _{t + 1} , a _{t + 1} ) is the Q value when the action a that maximizes the Q value in all the states, actions, pairs that can be taken in the next step is taken is there.
In this embodiment, Q ([x], [y], [action_num]) = Q (s, a), s (state) is its own central coordinate, the action is its own action, and [action_num ]. That is, it is a three-dimensional array of Q ([x], [y], [action_num]).
Since agent_num is a tire with a DC motor of Khepera robot and action_num is 4 directions, each takes 4 values.
The same Q-Learning is used, but the state and behavior are very different between Khepera and piano problems.

（実施例１および２の比較）
以上の実施例１（ピアノ問題）および実施例２（Kheperaロボット）の比較を示したものが、以下の表である。
（表４）比較表

このように、両者の最も大きな違いは、ピアノ問題では、人間が介在せず、時間的なあいまいさが存在しないのに対し、Kheperaロボットでは、教育エージェントが人間であるため、時間的あいまいさが存在する点にある。
そして、ピアノ問題において、空間的あいまいさを補正するためには、各エージェントが過去に行った学習(例えば、この地点で障害物にぶつかったためにペナルティーを受けたこと)によって、次の試行を、より成功し易くする、ことが必要である。
一方、Kheperaロボットで、時間的あいまいさを補正するためには、各エージェントが過去に行った学習をそのままダイレクトに次の試行で利用するのではなく、過去の試行で受けた報酬やペナルティーが与えられた時刻が誤っている可能性を考え、その時刻をずらしてみた場合の報酬・ペナルティーがどのようなものであったかを再評価し、その「ずらし」が最も適切な値となるように調整することが必要である。 (Comparison of Examples 1 and 2)
The following table shows a comparison between Example 1 (piano problem) and Example 2 (Khepera robot).
(Table 4) Comparison table

In this way, the biggest difference between the two is that in the piano problem, there is no human intervention and there is no temporal ambiguity, whereas in the Khepera robot, the educational agent is human, so the temporal ambiguity is It is in a point that exists.
And to correct the spatial ambiguity in the piano problem, the next trial is based on the learning that each agent has done in the past (for example, a penalty for hitting an obstacle at this point) It is necessary to make it more successful.
On the other hand, in order to correct temporal ambiguity with the Khepera robot, the rewards and penalties received in the past trials are given instead of directly using the learning performed by each agent directly in the next trial. Considering the possibility that the given time is incorrect, re-evaluate what the reward / penalty was when the time was shifted, and adjust the shift to the most appropriate value It is necessary.

以上、本発明を、具体的な実施例に基づいて説明したが、本発明の詳細な説明の範囲はこれに限定されず、特許請求の範囲内で種々の変更、追加、均等物との置換等が可能である。 Although the present invention has been described based on the specific embodiments, the scope of the detailed description of the present invention is not limited thereto, and various modifications, additions, and substitutions with equivalents are within the scope of the claims. Etc. are possible.

本発明の実施の形態における、学習エージェント、教育エージェント、および中間エージェントの役割分担。The role assignment of the learning agent, the education agent, and the intermediate agent in the embodiment of the present invention. 本発明の実施の形態における、空間形状におけるあいまいさの説明。The description of the ambiguity in the space shape in the embodiment of the present invention. 本発明の実施の形態における、時間的なあいまいさの説明。Description of temporal ambiguity in the embodiment of the present invention. 本発明の実施例１における、通行可能な領域と通行不可能な障害物の説明。Description of the passable area and the non-passable obstacle in the first embodiment of the present invention. 本発明の実施例１における、学習エージェントと教育エージェントの概説。The outline | summary of the learning agent and the educational agent in Example 1 of this invention. 本発明の実施例１における、各エージェントの学習モデル。The learning model of each agent in Example 1 of this invention. 本発明の実施例１における、中間エージェントの判断基準の相違による経路探索結果の比較。The comparison of the route search result by the difference in the judgment criteria of an intermediate agent in Example 1 of this invention. 本発明の実施例１における、教育エージェントの学習手法の相違による経路探索結果の比較。The comparison of the route search result by the difference in the learning method of an education agent in Example 1 of this invention. 本発明の実施例１における、経路探索が改善される様子。The mode search in Example 1 of this invention is improved. 本発明の実施例１における、経路探索が改善される様子。The mode search in Example 1 of this invention is improved. 本発明の実施例１における、探索木による経路探索の説明。Explanation of route search by a search tree in Embodiment 1 of the present invention. 本発明の実施例１における、探索木による経路探索の説明。Explanation of route search by a search tree in Embodiment 1 of the present invention. 本発明の実施例１における、全体フローチャート。1 is an overall flowchart in Embodiment 1 of the present invention. 本発明の実施例１における、学習エージェントの移動方向。The movement direction of a learning agent in Example 1 of this invention. 本発明の実施例１における、各学習エージェントが意思決定するフローのフローチャート。The flowchart of the flow in which each learning agent makes a decision in Example 1 of this invention. 本発明の実施例１における、教育エージェントが意思決定するフローのフローチャート。The flowchart of the flow in which the educational agent makes a decision in Example 1 of this invention. 本発明の実施例１における、中間エージェントが新たな行動を生成するフローのフローチャート。The flowchart of the flow which the intermediate agent produces | generates a new action in Example 1 of this invention. 本発明の実施例１における、各学習エージェントが学習するフローのフローチャート。The flowchart of the flow which each learning agent learns in Example 1 of this invention. 本発明の実施例１における、Ｑ値更新の説明。Explanation of Q value update in Embodiment 1 of the present invention. 本発明の実施例１における、教育エージェントが学習するフローのフローチャート。The flowchart of the flow which the education agent learns in Example 1 of this invention. 本発明の実施例１における、Ｖ値更新の説明。Explanation of V value update in Embodiment 1 of the present invention. 本発明の実施例２における、時間的なあいまいさを含む問題のタスクの例。The example of the task of the problem containing temporal ambiguity in Example 2 of this invention. 本発明の実施例２における、時間的なあいまいさを含む問題のタスクの例。The example of the task of the problem containing temporal ambiguity in Example 2 of this invention. 本発明の実施例２における、Kheperaロボット。The Khepera robot in Example 2 of this invention. 本発明の実施例２における、Kheperaロボットの移動範囲。The movement range of the Khepera robot in Example 2 of the present invention. 本発明の実施例２における、実験環境１。Experimental environment 1 in Example 2 of the present invention. 本発明の実施例２における、実験環境２。Experiment environment 2 in Example 2 of the present invention. 本発明の実施例２における、Kheperraロボットの動作を観察した結果。The result of having observed operation of the Kheperra robot in Example 2 of the present invention. 本発明の実施例２における、Kheperraロボットの動作を観察した結果。The result of having observed operation of the Kheperra robot in Example 2 of the present invention. 本発明の実施例２における、Kheperraロボットの動作を観察した結果。The result of having observed operation of the Kheperra robot in Example 2 of the present invention. 本発明の実施例２における、教育エージェントによる経路探索結果の相違。The difference of the route search result by the education agent in Example 2 of this invention. 本発明の実施例２における、教育エージェントによる経路探索結果の相違。The difference of the route search result by the education agent in Example 2 of this invention. 本発明の実施例２における、中間エージェントによる経路探索結果の相違。The difference of the route search result by an intermediate agent in Example 2 of this invention. 本発明の実施例２における、中間エージェントによる経路探索結果の相違。The difference of the route search result by an intermediate agent in Example 2 of this invention. 本発明の実施例２における、中間エージェントによる経路探索結果の相違。The difference of the route search result by an intermediate agent in Example 2 of this invention. 本発明の実施例２における、全体フローチャート。The whole flowchart in Example 2 of this invention. 本発明の実施例２における、学習エージェントが意思決定するフローのフローチャート。The flowchart of the flow in which the learning agent in Example 2 of this invention makes a decision. 本発明の実施例２における、学習エージェントが行動を取るフローのフローチャート。The flowchart of the flow which a learning agent takes action in Example 2 of this invention. 本発明の実施例２における、学習エージェントが学習するフローのフローチャート。The flowchart of the flow which the learning agent learns in Example 2 of this invention. 本発明の実施例２における、中間エージェントが更新ルールを学習エージェントに送るフローのフローチャート。The flowchart of the flow which the intermediate agent sends the update rule to the learning agent in Example 2 of this invention. 本発明の実施例２における、学習エージェントが履歴から学習するフローのフローチャート。The flowchart of the flow which the learning agent learns from a log | history in Example 2 of this invention. 本発明の実施例２における、Ｑ値更新。Q value update in the second embodiment of the present invention.

Explanation of symbols

41 スタート地点
43 ゴール
45 教育エージェント
47ａ学習エージェント
47ｂ学習エージェント
47ｃ学習エージェント
47ｄ学習エージェント
51 教育エージェント
53ａ学習エージェント
53ｂ学習エージェント
53ｃ学習エージェント
53ｄ学習エージェント
2201 センサ
2203 センサ
2205 移動用モータ
2217 １ステップ上方向に移動した場合の位置 41 Starting point
43 goals
45 Educational agent
47a Learning Agent
47b Learning Agent
47c Learning Agent
47d Learning Agent
51 Educational Agent
53a Learning Agent
53b Learning Agent
53c Learning Agent
53d Learning Agent
2201 sensor
2203 Sensor
2205 Motor for movement
2217 Position when moving up one step

Claims

At least a teaching agent, a learning agent, and an intermediate agent functioning by a computer, the position recognized by each agent at a predetermined time, the size of each agent itself, the road width that the learning agent can pass, and the objective Each agent learns autonomously in a system in which there is a difference between the position of each agent, the size of each agent itself, and the road width that the learning agent can pass through. A method for controlling an agent from a start position to a goal position,
A learning agent decision process (1120) in which the learning agent decides what action to take,
An educational agent decision-making process (1120) in which an educational agent decides what action to take,
An action transmission step (130) in which the learning agent and the educational agent send the decision making of the learning agent and the educational agent to the intermediate agent, respectively,
A decision step (1140) in which an intermediate agent judges whether the decision making of the learning agent and the decision making of the education agent are the same;
When the learning agent decision is the same as the educational agent decision, the action to be taken by the learning agent is decided, and when the learning agent decision is different from the educational agent decision. Is an action instruction step (1150) in which the intermediate agent sends a new action to be taken by the intermediate agent to the learning agent according to the rules;
An action execution step (1160) in which the learning agent executes the action to be taken;
A goal attainment determining step (1190) in which each of the learning agent, the intermediate agent, and the educational agent determines whether the learning agent has reached a goal;
When the goal is not reached within the designated number of steps, the learning agent and the educational agent perform reinforcement learning, respectively, a learning agent reinforcement learning execution step (1180) and an education agent reinforcement learning execution step (1180), ,
Hierarchical agent learning method that includes

In the learning agent decision making step (1120),
A learning agent Q-value table value observing step in which each learning agent observes a value of a Q-value table that the learning agent has within a set observation range;
An action selection step in which the learning agent selects an action direction having the highest Q value with a probability (1-ε), and selects a random action direction with a probability ε;
An action synthesis step in which the intermediate agent synthesizes the direction of action selected by each learning agent;
Included,
The hierarchical agent learning method according to claim 1.

In the educational agent decision making process (1120),
An educational agent V-value table value observing step in which the educational agent observes a value of a V-value table that the educational agent has within a set observation range;
An action selection step in which the educational agent selects an action direction having the highest V value with probability (1-ε), and selects a random action direction with probability ε;
Included,
The hierarchical agent learning method according to claim 1.

A first action generation step in which the intermediate agent selects the action of the learning agent having the highest Q value among the actions indicated by the education agent in the action instruction step (1150);
If the goal is not reached after a predetermined number of trials, the direction and Q value indicated by the learning agent are set as a first vector, the direction and V value indicated by the education agent are set as a second vector, and the intermediate agent A second action generating step of selecting a direction in which the first and second vectors are combined;
Included,
The hierarchical agent learning method according to claim 1.

In the learning agent reinforcement learning execution step (1180),
A first reward granting step in which, when each learning agent reaches a goal, an agent other than each learning agent and an external environment) give each learning agent a first predetermined reward;
When each learning agent does not reach the goal and does not collide with an obstacle, a second reward grant is provided, where an agent other than each learning agent and the external environment give each learning agent a second predetermined reward. Process,
A third reward granting step in which each learning agent and the external environment give each learning agent a third predetermined reward when each learning agent does not reach the goal and collides with an obstacle; ,
After each reward granting step, according to each reward value, each learning agent updates all observed Q values, a Q value updating step,
Included,
The hierarchical agent learning method according to claim 1.

In the educational agent reinforcement learning execution step (1180),
A first reward granting step in which the external environment gives the education agent a first predetermined reward when each learning agent reaches the goal;
A second reward granting step in which the external environment gives the education agent a second predetermined reward when each learning agent does not reach the goal and does not collide with an obstacle;
A third reward granting step in which the external environment gives the education agent a third predetermined reward when each learning agent does not reach the goal and collides with an obstacle;
After each reward granting step, the educational agent updates all observed V values according to respective reward values, a V value updating step,
Included,
The hierarchical agent learning method according to claim 1.

In the Q value update process,
Q (s, a) ← (1-α) Q (s, a) + α [r + γmax _{a '} Q (s _{t + 1} , a _{t + 1} )]
(Where s is the state, a is the action, Q (s, a) is the Q value when the state is s and the action is a, α is the step size parameter, γ is the discount rate, and α and γ are values of The range is 0 ≦ α, γ ≦ 1, and it is not necessary that α = γ, r is the reward given by the external environment, usually giving a negative value for a penalty, and a positive value for a reward, _{Q (s t + 1, a} t + 1) , the state is s _{t + 1,} action means the Q value at the time of a _{t + 1.} "state s _{t + 1,} action is a _{t + 1} "When" means the next possible state / action pair. Max _{a '} is the largest Q value in all possible state / action pairs next. Max _{a ′} Q (s _{t + 1} , a _{t + 1} ) is such that the Q value is maximized in all possible state / action pairs in the next step. Q value when taking action a.)
The learning agent includes a first Q value updating step in which the learning agent updates the Q value.
The hierarchical agent learning method according to claim 5.

In the V value update process,
V (s _t ) ← V (s _t ) + α [r _t + γV (s _{t + 1} ) -V (s _t )]
(Here, s _t a state at time t s, s _{t + 1,} the state s at time _{t + 1, V (s t} ) is V value in the state _{_{s t, V (s t +}} 1) is V value at the time of state _{st + 1} , α is a step size parameter, γ is a discount rate, α, γ value ranges are 0 ≦ α, γ ≦ 1, and it is not necessary that α = γ, r is This is a reward given by the external environment, which means that a penalty is usually a negative value, and a reward is a positive value.In this embodiment, V (s) is a two-dimensional V (x, y). An array ((x, y) is the center coordinate of the education agent).
A first V value updating step in which the educational agent updates the V value according to the formula:
The hierarchical agent learning method according to claim 6.

There is a discrepancy between the objective time and the time recognized by each agent in a given spatial location, including at least a learning agent and an intermediate agent functioning by a human or robotic education agent and a computer. In the system, each agent learns autonomously to control the learning agent from a start position to a goal position,
An environment observation step (2715) in which the learning agent observes the environment;
An obstacle confirmation step (2717) in which the learning agent determines whether there is an obstacle ahead;
A first reward update step (2719) in which, when it is determined in the obstacle confirmation step that an obstacle is present, an external environment or an educational agent gives a negative reward to the learning agent;
In the obstacle confirmation step, if it is determined that an obstacle exists, the learning agent performs reinforcement learning after the first reward granting step, or after the obstacle confirmation step. Reinforcement learning execution process (2721),
A decision step (2723) for determining a direction in which the learning agent moves;
An action execution step (2725) in which the learning agent takes an action with the amount of movement determined in the decision making step;
As a result of the action, a penalty determining step (2727) in which the learning agent determines whether or not the learning agent has received a negative reward from the education agent;
When the learning agent receives the negative reward, it prompts the input of the negative reward value to the already set reward value of the learning agent from an education agent who is a person or robot recognized as an environment In the second reward update step (2729) and in the penalty determination step, if the learning agent receives a negative reward, after the reward update step, otherwise after the penalty determination step, A second reinforcement learning execution step (2731) in which the learning agent performs reinforcement learning;
A goal attainment determining step (2737, 2735) for determining whether or not the learning agent has reached a goal;
If the learning agent has not reached the goal,
When the above steps are repeated a predetermined number of times and the goal has not been reached, the intermediate agent is rewarded by the intermediate agent in the first reward update step and the second reward update step. Remuneration update rule changing step (2741) for changing the timing to give, a history learning step (2745) in which the learning agent learns from the history,
Is included,
Hierarchical agent learning method.

In the decision making process (2723)
A Q table observation process in which the learning agent observes the value of the Q table;
an action selection process in which the learning agent selects only the action with the highest Q value according to the greedy strategy;
Included,
The hierarchical agent learning method according to claim 9.

In the action execution step (2725),
A rotation process for rotating the learning agent by a predetermined angle;
A movement step of causing the learning agent to advance by a predetermined movement amount after the end of the rotation;
The hierarchical agent learning method according to claim 9 or 10, wherein:

In the first reinforcement learning execution step (2721) and / or the second reinforcement learning execution step (2731),
Q value is in the range of t _p <t <t _n ,
Q (s, a) ← (1-α) Q (s, a) + α [r + γ · max _{a ′} · Q (s _{t + 1} , a _{t + 1} )]
(Where s is the state, a is the action, Q (s, a) is the Q value when the state is s and the action is a, α is the step size parameter, γ is the discount rate, and α and γ are values of The range is 0 ≦ α, γ ≦ 1, and it is not necessary that α = γ. R is a reward given by the external environment, usually giving a negative value for a penalty, giving a positive value for a reward, Q ( _{_{s t + 1, a t +}} 1) is, s _{t + 1} state, action means the Q value at the time of a _{t + 1.} "state s _{t + 1,} when the action of a _{t + 1} "Means _a pair of states and actions that can be taken next. Max _{a '} means an action that has the highest Q value in all the pairs of states and actions that can be taken next. Max _{a ′} Q (s _{t + 1} , a _{t + 1} ) means an action a having a maximum Q value in all possible states / action pairs in the next step. Q value when taken.)
Updating the Q value according to the formula:
The hierarchical agent learning method according to claim 9, wherein:

In reward update rule change process (2741),
A temporal update range matching step of matching the Q value temporal update ranges t _p and t _n ;
t _p ← t _p −i, t _n ← t _n + j
(Here, i and j are positive integers and satisfy t _p <t _n .)
An update range changing step of changing the update range by adjusting the parameters t _p and t _n based on the following formula:
The hierarchical agent learning method according to claim 9, wherein:

In the history learning step (2745),
A time collating step in which the intermediate agent collates the time t of the state, action pair, in which the negative reward for the learning agent is entered;
A goal attainment determining step in which the intermediate agent determines whether the learning agent has never reached the goal;
In the goal attainment determining step, when it is determined that the goal has never been reached, the intermediate agent determines the Q value of the learning agent updated in the current trial at the start of the previous trial. A first Q value return step for returning to a state;
In the goal attainment determining step, when it is determined that the goal has been reached in the past,
After the intermediate agent returns the Q value of the learning agent updated in the current trial to the state of the Q value after the end of the trial that reached the goal most recently,
A second Q value return step of updating the Q value of the next trial with the new update rule changed in the reward update rule changing step;
After the first or second Q value return step,
Q (s, a) ← (1-α) Q (s, a) + α [r + γ · max _{a ′} · Q (s _{t + 1} , a _{t + 1} )]
(Where s is the state, a is the action, Q (s, a) is the Q value when the state is s and the action is a, α is the step size parameter, γ is the discount rate, and α and γ are values of The range is 0 ≦ α, γ ≦ 1, and it is not necessary that α = γ. R is a reward given by the external environment, usually giving a negative value for a penalty, giving a positive value for a reward, Q ( _{_{s t + 1, a t +}} 1) is, s _{t + 1} state, action means the Q value at the time of a _{t + 1.} "state s _{t + 1,} when the action of a _{t + 1} "Means _a pair of states and actions that can be taken next. Max _{a '} means an action that has the highest Q value in all the pairs of states and actions that can be taken next. Max _{a ′} Q (s _{t + 1} , a _{t + 1} ) means an action a having a maximum Q value in all possible states / action pairs in the next step. Q value when taken.)
A Q value updating step in which the learning agent updates the Q value in the range of updated tp <t <tn according to the following formula:
14. The hierarchical agent learning method according to claim 9, wherein:

At least a teaching agent, a learning agent, and an intermediate agent functioning by a computer, the position recognized by each agent at a predetermined time, the size of each agent itself, the road width that the learning agent can pass, and the objective There is a discrepancy between the position of each agent, the size of each agent itself, and the width of the path through which the learning agent can pass. A system that controls from the start position to the goal position,
The learning agent is
A learning agent decision-making tool that decides what action to take,
Learning agent action transmission means for the learning agent to send a decision of the learning agent to the intermediate agent;
The educational agent
An educational agent decision-making tool that decides what action to take,
Action transmitting means for sending the educational agent's decision to the intermediate agent by the educational agent;
Have
The intermediate agent is
Decision making means for judging whether the decision of the learning agent and the decision of the education agent are the same;
If the decision of the learning agent and the decision of the educational agent are the same, the action to be taken determined by the learning agent is sent to the learning agent, and the decision of the learning agent and the decision of the educational agent are sent If the decisions are different, action instruction means for sending a new action to be taken by the intermediate agent according to the rules to the learning agent;
Have
The learning agent further includes action execution means for executing the action to be taken,
The learning agent, the intermediate agent, and the education agent each further have a goal attainment judging means for judging whether or not the learning agent has reached a goal,
Learning agent reinforcement learning execution means for the reinforcement learning of the learning agent and the education agent, respectively, when the learning agent and the education agent have not reached the goal within the specified number of steps. And an educational agent reinforcement learning execution means,
Hierarchical agent learning system.

There is a discrepancy between the objective time and the time recognized by each agent in a given spatial location, including at least a learning agent and an intermediate agent functioning by a human or robotic education agent and a computer. In the system, each agent learns autonomously to control the learning agent from a start position to a goal position,
The learning agent is
Environmental observation means for observing the environment;
Obstacle confirmation means for judging whether there is an obstacle ahead,
A first reward updating means for receiving a negative reward from the external environment or an educational agent when the obstacle confirmation means determines that an obstacle is present;
When the obstacle confirmation means determines that there is an obstacle, the learning agent performs reinforcement learning after receiving the first reward, or after confirming that the obstacle exists. 1 reinforcement learning execution means,
A decision-making means for determining the direction of movement;
An action execution means for taking an action with the amount of movement determined by the decision making means;
Penalty judging means for judging whether or not a negative reward has been received from the educational agent as a result of the action,
When the learning agent receives the negative reward, the negative reward value is added to the already set reward value of the learning agent from the education agent that is a person or robot recognized as an environment. A second reward update means for prompting;
When the penalty determining means determines that a negative reward has been received, after the negative reward value is added by the second reward update means, or after the determination by the penalty determining means, the learning agent A second reinforcement learning execution means for performing reinforcement learning;
Goal attainment judging means for judging whether or not the learning agent has reached the goal;
Have
The educational agent
When the learning agent has not reached the goal, the control by the above means is repeated a predetermined number of times, and when the goal has not been reached after that, the intermediate agent makes the first reward update means, and Reward update rule changing means for changing the timing for giving reward to the learning agent in the second reward update means,
Furthermore, the intermediate agent
In accordance with a determination as to whether or not the learning agent has reached the goal in the past, it has a history update preparation means for preparing a Q value to be used in the next trial,
The learning agent is
Using a Q value obtained by the history update preparation means, having a history update means for updating the Q value;
This is a hierarchical agent learning system.