Policy¶
Policy.py - abstract class for all policies¶
Copyright CUED Dialogue Systems Group 2015 - 2017
See also
CUED Imports/Dependencies:
import utils.Settings
import utils.DiaAct
import utils.ContextLogger
import ontology.OntologyUtils
import policy.SummaryAction
- class policy.Policy.Action(action)¶
Dummy class representing one action. Used for recording and may be overridden by sub-class.
- class policy.Policy.Episode(dstring=None)¶
An episode encapsulates the state-action-reward triplet which may be used for learning. Every entry represents one turn. The last entry should contain
TerminalState
andTerminalAction
- check()¶
Checks whether length of internal state action and reward lists are equal.
- getWeightedReward()¶
Returns the reward weighted by normalised accumulated weights. Used for multiagent learning in committee.
- Returns
the reward weighted by normalised accumulated weights
- record(state, action, reward, ma_weight=None)¶
Stores the state action reward in internal lists.
- tostring()¶
Prints state, action, and reward lists to screen.
- class policy.Policy.EpisodeStack(block_size=100)¶
A handler for episodes. Required if stack size is to become very large - may not want to hold all episodes in memory, but write out to file.
- add_episode(domain_episodes)¶
Items on stack are dictionaries of episodes for each domain (since with BCM can learn from 2 or more domains if a multidomain dialogue happens)
- retrieve_episode(episode_key)¶
NB: this should probably be an iterator, using yield, rather than return
- class policy.Policy.Policy(domainString, learning=False, specialDomain=False)¶
Interface class for a single domain policy. Responsible for selecting the next system action and handling the learning of the policy.
To create your own policy model or to change the state representation, derive from this class.
- act_on(state)¶
Main policy method: mapping of belief state to system action.
This method is automatically invoked by the agent at each turn after tracking the belief state.
May initially return ‘hello()’ as hardcoded action. Keeps track of last system action and last belief state.
- Parameters
state (
DialogueState
) – the belief state to act onhyps (list) – n-best-list of semantic interpretations
- Returns
the next system action of type
DiaAct
- convertStateAction(state, action)¶
Converts the given state and action to policy-specific representations.
By default, the generic classes
State
andAction
are used. To change this, override method in sub-class.- Parameters
state (anything) – the state to be encapsulated
action – the action to be encapsulated
- Type
action: anything
- finalizeRecord(reward, domainInControl=None)¶
Records the final reward along with the terminal system action and terminal state. To change the type of state/action override
convertStateAction()
.This method is automatically executed by the agent at the end of each dialogue.
- Parameters
reward (int) – the final reward
domainInControl (str) – used by committee: the unique identifier domain string of the domain this dialogue originates in, optional
- Returns
None
- nextAction(beliefstate)¶
Interface method for selecting the next system action. Should be overridden by sub-class.
This method is automatically executed by
act_on()
thus at each turn.- Parameters
beliefstate (dict) – the state the policy acts on
- Returns
the next system action
- record(reward, domainInControl=None, weight=None, state=None, action=None)¶
Records the current turn reward along with the last system action and belief state.
This method is automatically executed by the agent at the end of each turn.
To change the type of state/action override
convertStateAction()
. By default, the last master action is recorded. If you want to have another action being recorded, eg., summary action, assign the respective object to self.actToBeRecorded in a derived class.- Parameters
reward (int) – the turn reward to be recorded
domainInControl (str) – the domain string unique identifier of the domain the reward originates in
weight (float) – used by committee: the weight of the reward in case of multiagent learning
state (dict) – used by committee: the belief state to be recorded
action (str) – used by committee: the action to be recorded
- Returns
None
- restart()¶
Restarts the policy. Resets internal variables.
This method is automatically executed by the agent at the end/beginning of each dialogue.
- savePolicy(FORCE_SAVE=False)¶
Saves the learned policy model to file. Should be overridden by sub-class.
This method is automatically executed by the agent either at certain intervals or at least before shutting down the agent.
- Parameters
FORCE_SAVE (bool) – used to force cleaning up of any learning and saving when we are powering off an agent.
- train()¶
Interface method for initiating the training. Should be overridden by sub-class.
This method is automatically executed by the agent at the end of each dialogue if learning is True.
This method is called at the end of each dialogue by
PolicyManager
if learning is enabled for the given domain policy.
- class policy.Policy.State(state)¶
Dummy class representing one state. Used for recording and may be overridden by sub-class.
- class policy.Policy.TerminalAction¶
Dummy class representing one terminal action. Used for recording and may be overridden by sub-class.
- class policy.Policy.TerminalState¶
Dummy class representing one terminal state. Used for recording and may be overridden by sub-class.
PolicyManager.py - container for all policies¶
Copyright CUED Dialogue Systems Group 2015 - 2017
See also
CUED Imports/Dependencies:
import utils.Settings
import utils.ContextLogger
import ontology.Ontology
import ontology.OntologyUtils
- class policy.PolicyManager.PolicyManager¶
The policy manager manages the policies for all domains.
It provides the interface to get the next system action based on the current belief state in
act_on()
and to initiate the learning in the policy intrain()
.- _check_committee(committee)¶
Safety tool - should check some logical requirements on the list of domains given by the config
- Parameters
committee (
PolicyCommittee
) – the committee be be checked
- _load_committees()¶
Loads and instantiates the committee as configured in config file. The new object is added to the internal dictionary.
- _load_domains_policy(domainString=None)¶
Loads and instantiates the respective policy as configured in config file. The new object is added to the internal dictionary.
Default is ‘hdc’.
- Parameters
domainString (str) – the domain the policy will work on. Default is None.
- Returns
the new policy object
- act_on(dstring, state)¶
Main policy method which maps the provided belief to the next system action. This is called at each turn by
DialogueAgent
- Parameters
dstring (str) – the domain string unique identifier.
state (
DialogueState
) – the belief state the policy should act on
- Returns
the next system action as
DiaAct
- bootup(domainString)¶
Loads a policy for a given domain.
- finalizeRecord(domainRewards)¶
Records the final rewards of all domains. In case of a committee, the recording is delegated.
This method is called once at the end of each dialogue by the
DialogueAgent
. (One dialogue may contain multiple domains.)- Parameters
domainRewards (dict) – a dictionary mapping from domains to final rewards
- Returns
None
- getLastSystemAction(domainString)¶
Returns the last system action of the specified domain.
- Parameters
domainString (str) – the domain string unique identifier.
- Returns
the last system action of the given domain or None
- printEpisodes()¶
Prints the recorded episode of the current dialogue.
- record(reward, domainString)¶
Records the current turn reward for the given domain. In case of a committee, the recording is delegated.
This method is called each turn by the
DialogueAgent
.- Parameters
reward (int) – the turn reward to be recorded
domainString (str) – the domain string unique identifier of the domain the reward originates in
- Returns
None
- restart()¶
Restarts all policies of all domains and resets internal variables.
- savePolicy(FORCE_SAVE=False)¶
Initiates the policies of all domains to be saved.
- Parameters
FORCE_SAVE (bool) – used to force cleaning up of any learning and saving when we are powering off an agent.
- train(training_vec=None)¶
Initiates the training for the policies of all domains. This is called at the end of each dialogue by
DialogueAgent
PolicyCommittee.py - implementation of the Bayesian committee machine for dialogue management¶
Copyright CUED Dialogue Systems Group 2015 - 2017
See also
CUED Imports/Dependencies:
import utils.Settings
import utils.ContextLogger
import utils.DiaAct
- class policy.PolicyCommittee.CommitteeMember¶
Base class defining the interface methods which are needed in addition to the basic functionality provided by
Policy
Committee members should derive from this class.
- abstract_actions(actions)¶
Converts a list of domain acts to their abstract form
- Parameters
actions (list of actions) – the actions to be abstracted
- getMeanVar_for_executable_actions(belief, abstracted_currentstate, nonExecutableActions)¶
Computes the mean and variance of the Q value based on the abstracted belief state for each executable action.
- Parameters
belief (dict) – the unabstracted current domain belief
abstracted_currentstate (
State
or subclass) – the abstracted current beliefnonExecutableActions (list) – actions which are not selected for execution based on heuristic
- getPriorVar(belief, act)¶
Returns prior variance for a given belief and action
- Parameters
belief (dict) – the unabstracted current domain belief state
act (str) – the unabstracted action
- get_Action(action)¶
Converts the unabstracted domain action into an abstracted action to be used for multiagent learning.
- Parameters
action (str) – the last system action
- get_State(beliefstate, keep_none=False)¶
Converts the unabstracted domain state into an abstracted belief state to be used with
getMeanVar_for_executable_actions()
.- Parameters
beliefstate (dict) – the unabstracted belief state
- unabstract_action(actions)¶
Converts a list of abstract acts to their domain form
- Parameters
actions (list of actions) – the actions to be unabstracted
- class policy.PolicyCommittee.PolicyCommittee(policyManager, committeeMembers, learningmethod)¶
Manages everything related to policy committee. All policy members must inherit from
Policy
andCommitteeMember
.- _bayes_committee_calculator(domainQs, priors, domainInControl, scale)¶
Given means and variances of committee members - forms the Bayesian committee distribution for each action, draws sample from each, returns act with highest sample.
Note
this implementation is probably slow – can reformat domainQs - and redo this via matricies and slicing
- Parameters
domainQs (dict of domains and dict of actions and dict of variance/mu and values) – the means and variances of all Q-value estimates of all domains
priors (dict of actions and values) – the prior of the Q-value
domainInControl (str) – domain the dialoge is in
scale (float) – a scaling factor used to control exploration during learning
- Returns
the next abstract system action
- _set_multi_agent_learning_weights(comm_meansVars, chosen_act)¶
Set reward scalings for each committee member. Implements NAIVE approach from “Multi-agent learning in multi-domain spoken dialogue systems”, Milica Gasic et al. 2015.
- Parameters
comm_meansVars (dict of domains and dict of actions and dict of variance/mu and values) – the means and variances of all committee members
chosen_act (str) – the abstract system action to be executed
- Returns
None
- act_on(domainInControl, state)¶
Provides the next system action based on the domain in control and the belief state.
The belief state is mapped to an abstract representation which is used for all committee members.
- Parameters
domainInControl (str) – the domain unique identifier string of the domain in control
state (
DialogueState
) – the belief state to act on
- Returns
the next system action
- finalizeRecord(reward, domainInControl)¶
Records for each committee member the reward and the domain the dialogue has been on
- Parameters
reward (int) – the final reward to be recorded
domainInControl (str) – the domain the reward was achieved in
- record(reward, domainInControl)¶
record for committee members. in case of multiagent learning, use information held in committee along with the reward to record (b,a) + r
- Parameters
reward (str) – the turn reward to be recorded
reward – the domain the reward was achieved in
- Returns
None
HDCPolicy.py - Handcrafted dialogue manager¶
Copyright CUED Dialogue Systems Group 2015 - 2017
See also
CUED Imports/Dependencies:
import policy.Policy
import policy.PolicyUtils
import policy.SummaryUtils
import utils.Settings
import utils.ContextLogger
- class policy.HDCPolicy.HDCPolicy(domainString)¶
Handcrafted policy derives from Policy base class. Based on the slots defined in the ontology and fix thresholds, defines a rule-based policy.
If no info is provided by the user, the system will always ask for the slot information in the same order based on the ontology.
GPPolicy.py - Gaussian Process policy¶
Copyright CUED Dialogue Systems Group 2015 - 2017
Relevant Config variables [Default values]:
[gppolicy]
kernel = polysort
thetafile = ''
See also
CUED Imports/Dependencies:
import policy.GPLib
import policy.Policy
import policy.PolicyCommittee
import ontology.Ontology
import utils.Settings
import utils.ContextLogger
- class policy.GPPolicy.GPPolicy(domainString, learning, sharedParams=None)¶
An implementation of the dialogue policy based on Gaussian process and the GPSarsa algorithm to optimise actions where states are GPState and actions are GPAction.
The class implements the public interfaces from
Policy
andCommitteeMember
.
- class policy.GPPolicy.Kernel(kernel_type, theta, der=None, action_kernel_type='delta', action_names=None, domainString=None)¶
The Kernel class defining the kernel for the GPSARSA algorithm.
The kernel is usually divided into a belief part where a dot product or an RBF-kernel is used. The action kernel is either the delta function or a handcrafted or distributed kernel.
- class policy.GPPolicy.GPAction(action, numActions, replace={})¶
Definition of summary action used for GP-SARSA.
- class policy.GPPolicy.GPState(belief, keep_none=False, replace={}, domainString=None)¶
Definition of state representation needed for GP-SARSA algorithm Main requirement for the ability to compute kernel function over two states
- class policy.GPPolicy.TerminalGPAction¶
Class representing the action object recorded in the (b,a) pair along with the final reward.
- class policy.GPPolicy.TerminalGPState¶
Basic object to explicitly denote the terminal state. Always transition into this state at dialogues completion.
GPLib.py - Gaussian Process SARSA algorithm¶
Copyright CUED Dialogue Systems Group 2015 - 2017
This module encapsulates all classes and functionality which implement the GPSARSA algorithm for dialogue learning.
Relevant Config variables [Default values]. X is the domain tag:
[gpsarsa_X]
saveasprior = False
random = False
learning = False
gamma = 1.0
sigma = 5.0
nu = 0.001
scale = -1
numprior = 0
See also
CUED Imports/Dependencies:
import utils.Settings
import utils.ContextLogger
import policy.PolicyUtils
- class policy.GPLib.GPSARSA(in_policyfile, out_policyfile, domainString=None, learning=False, sharedParams=None)¶
Derives from GPSARSAPrior
Implements GPSarsa algorithm where mean can have a predefined value self._num_prior specifies number of means self._prior specifies the prior If not specified a zero-mean is assumed
Parameters needed to estimate the GP posterior self._K_tida_inv inverse of the Gram matrix of dictionary state-action pairs self.sharedParams[‘_C_tilda’] covariance function needed to estimate the final variance of the posterior self.sharedParams[‘_c_tilda’] vector needed to calculate self.sharedParams[‘_C_tilda’] self.sharedParams[‘_alpha_tilda’] vector needed to estimate the mean of the posterior self.sharedParams[‘_d’] and self.sharedParams[‘_s’] sufficient statistics needed for the iterative estimation of the posterior
Parameters needed for the policy selection self._random random policy choice self._scale scaling of the standard deviation when sampling Q-value, if -1 than taking the mean self.learning if true in learning mode
- class policy.GPLib.GPSARSAPrior(in_policyfile, out_policyfile, numPrior=- 1, learning=False, domainString=None, sharedParams=None)¶
Defines the GP prior. Derives from LearnerInterface.
- class policy.GPLib.LearnerInterface¶
This class defines the basic interface for the GPSARSA algorithm.
specifies the policy files self._inputDictFile input dictionary file self._inputParamFile input parameter file self._outputDictFile output dictionary file self._outputParamFile output parameter file
self.initial self.terminal flags are needed for learning to specify initial and terminal states in the episode
HDCTopicManager.py - policy for the front end topic manager¶
Copyright CUED Dialogue Systems Group 2015 - 2017
See also
CUED Imports/Dependencies:
import policy.Policy
import utils.Settings
import utils.ContextLogger
- class policy.HDCTopicManager.HDCTopicManagerPolicy(dstring=None, learning=None)¶
The dialogue while being in the process of finding the topic/domain of the conversation.
At the current stage, this only happens at the beginning of the dialogue, so this policy has to take care of wecoming the user as well as creating actions which disambiguate/clarify the topic of the interaction.
It allows for the system to hang up if the topic could not be identified after a specified amount of attempts.
WikipediaTools.py - basic tools to access wikipedia¶
Copyright CUED Dialogue Systems Group 2015 - 2017
See also
CUED Imports/Dependencies:
import policy.Policy
import utils.Settings
import utils.ContextLogger
- class policy.WikipediaTools.WikipediaDM¶
Dialogue Manager interface to Wikipedia – developement state.
SummaryAction.py - Mapping between summary and master actions¶
Copyright CUED Dialogue Systems Group 2015 - 2017, 2017
See also
CUED Imports/Dependencies:
import policy.SummaryUtils
import ontology.Ontology
import utils.ContextLogger
import utils.Settings
- class policy.SummaryAction.SummaryAction(domainString, empty=False, confreq=False)¶
The summary action class encapsulates the functionality of a summary action along with the conversion from summary to master actions.
Note
The list of all possible summary actions are defined in this class.
SummaryUtils.py - summarises dialog events for mapping from master to summary belief¶
Copyright CUED Dialogue Systems Group 2015 - 2017
- Basic Usage:
>>> import SummaryUtils
Note
No classes; collection of utility methods
Local module variables:
global_summary_features: (list) global actions/methods
REQUESTING_THRESHOLD: (float) 0.5 min value to consider a slot requested
See also
CUED Imports/Dependencies:
import ontology.Ontology
import utils.Settings
import utils.ContextLogger
PolicyUtils.py - Utility Methods for Policies¶
Copyright CUED Dialogue Systems Group 2015 - 2017
Note
PolicyUtils.py is a collection of utility functions only (No classes).
Local/file variables:
ZERO_THRESHOLD: unused
REQUESTING_THRESHOLD: affects getRequestedSlots() method
See also
CUED Imports/Dependencies:
import ontology.Ontology
import utils.DiaAct
import utils.Settings
import policy.SummaryUtils
import utils.ContextLogger
- policy.PolicyUtils.REQUESTING_THRESHOLD = 0.5¶
Methods for global action.
- policy.PolicyUtils.add_venue_count(input, belief, domainString)¶
Add venue count.
- Parameters
input – String input act.
belief – Belief state
domainString (str) – domain tag like ‘SFHotels’
- Returns
act with venue count.
- policy.PolicyUtils.checkDirExistsAndMake(fullpath)¶
Used when saving a policy – if dir doesn’t exisit –> is created
- policy.PolicyUtils.getGlobalAction(belief, globalact, domainString)¶
Method for global action: returns action
- Parameters
belief (dict) – full belief state
globalact (int) –
str of globalActionName, e.g. ‘INFORM_REQUESTED’
domainString (str) – domain tag
- Returns
(str) action
- policy.PolicyUtils.getInformAcceptedSlotsAboutEntity(acceptanceList, ent, numFeats)¶
Method for global inform action: returns filled out inform() string need to be cleaned (Dongho)
- Parameters
acceptanceList (dict) – of slots with value:prob mass pairs
ent (dict) – slot:value properties for this entity
numFeats (int) – result of globalOntology.entity_by_features(acceptedValues)
- Returns
(str) filled out inform() act
- policy.PolicyUtils.getInformAction(numAccepted, belief, domainString)¶
Method for global inform action: returns inform act via getInformExactEntity() method or null() if not enough accepted
- Parameters
belief (dict) – full belief state
numAccepted (int) – number of slots with prob. mass > 80
domainString (str) – domain tag
- Returns
getInformExactEntity(acceptanceList,numAccepted)
- policy.PolicyUtils.getInformExactEntity(acceptanceList, numAccepted, domainString)¶
Method for global inform action: creates inform act with none or an entity
- Parameters
acceptanceList (dict) – of slots with value:prob mass pairs
numAccepted (int) – number of accepted slots (>80 prob mass)
domainString (str) – domain tag
- Returns
getInformNoneVenue() or getInformAcceptedSlotsAboutEntity() as appropriate
BCM_Tools.py - Script for creating slot abstraction mapping files¶
Copyright CUED Dialogue Systems Group 2015 - 2017
Note
Collection of utility classes and methods
See also
CUED Imports/Dependencies:
import ontology.Ontology
import utils.Settings
import utils.ContextLogger
This script is used to create a mapping from slot names to abstract slot (like slot0, slot1 etc), highest entropy to lowest. Writes mapping to JSON file
DeepRL Policies¶
A2CPolicy.py - Advantage Actor-Critic policy¶
Copyright CUED Dialogue Systems Group 2015 - 2017
The implementation of the advantage actor-critic with the temporal difference as an approximation of the advantage function. The network is defined in DRL.a2c.py You can turn on the importance sampling through the parameter A2CPolicy.importance_sampling
The details of the implementation can be found here: https://arxiv.org/abs/1707.00130
ACERPolicy.py - Sample Efficient Actor Critic with Experience Replay¶
Copyright CUED Dialogue Systems Group 2015 - 2017
The implementation of the sample efficient actor critic with truncated importance sampling with bias correction, the trust region policy optimization method and RETRACE-like multi-step estimation of the value function. The parameters ACERPolicy.c, ACERPolicy.alpha, ACERPolicy. The details of the implementation can be found here: https://arxiv.org/abs/1802.03753
See also: https://arxiv.org/abs/1611.01224 https://arxiv.org/abs/1606.02647
DQNPolicy.py - deep Q network policy¶
Copyright CUED Dialogue Systems Group 2015 - 2017
Warning
Documentation not done.
ENACPolicy.py - Episodic Natural Actor-Critic policy¶
Copyright CUED Dialogue Systems Group 2015 - 2017
The implementation of episodic natural actor-critic. The vanilla gradients are computed in DRL/enac.py using Tensorflow and then the natural gradient is obtained through function train. You can turn on the importance sampling through the parameter ENACPolicy.importance_sampling
The details of implementation can be found here: https://arxiv.org/abs/1707.00130 See also: https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2007-125.pdf
TRACERPolicy.py - Trust region advantage Actor-Critic policy with experience replay¶
Copyright CUED Dialogue Systems Group 2015 - 2017
The implementation of the actor-critic algorithm with off-policy learning and trust region constraint for stable training. The definition of the network and the approximation of the natural gradient is computed in DRL.na2c.py. You can turn on the importance sampling through the parameter TRACERPolicy.importance_sampling
The details of the implementation can be found here: https://arxiv.org/abs/1707.00130
See also: https://arxiv.org/abs/1611.01224 https://pdfs.semanticscholar.org/c79d/c0bdb138e5ca75445e84e1118759ac284da0.pdf
FeudalGainPolicy.py - Information Gain for FeudalRL policies¶
Copyright 2019-2021 HHU Dialogue Systems and Machine Learning Group
The implementation of the FeudalGain algorithm that incorporates information gain as intrinsic reward in order to update a Feudal policy. Information gain is defined as the change in probability distributions between consecutive turns in the belief state. The distribution change is measured using the Jensen-Shannon divergence. FeudalGain builds upon the Feudal Dialogue Management architecture and optimises the information-seeking policy to maximise information gain. If the information-seeking policy for instance requests the area of a restaurant, the information gain reward is calculated by the Jensen-Shannon divergence of the value distributions for area before and after the request.
The details can be found here: https://arxiv.org/abs/2109.07129
FeudalRL Policies¶
Traditional Reinforcement Learning algorithms fail to scale to large domains due to the curse of dimensionality. A novel Dialogue Management architecture based on Feudal RL decomposes the decision into two steps; a first step where a master policy selects a subset of primitive actions, and a second step where a primitive action is chosen from the selected subset. The structural information included in the domain ontology is used to abstract the dialogue state space, taking the decisions at each step using different parts of the abstracted state. This, combined with an information sharing mechanism between slots, increases the scalability to large domains.
For more information, please look at the paper Feudal Reinforcement Learning for Dialogue Management in Large Domains.