Policy¶

Policy.py - abstract class for all policies¶

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import utils.Settings
import utils.DiaAct
import utils.ContextLogger
import ontology.OntologyUtils
import policy.SummaryAction

class policy.Policy.Action(action)¶: Dummy class representing one action. Used for recording and may be overridden by sub-class.

class policy.Policy.Episode(dstring=None)¶

An episode encapsulates the state-action-reward triplet which may be used for learning. Every entry represents one turn. The last entry should contain TerminalState and TerminalAction

check()¶: Checks whether length of internal state action and reward lists are equal.

getWeightedReward()¶

Returns the reward weighted by normalised accumulated weights. Used for multiagent learning in committee.

Returns: the reward weighted by normalised accumulated weights

record(state, action, reward, ma_weight=None)¶

Stores the state action reward in internal lists.

Parameters

state (State) – the last belief state
action (Action) – the last system action
reward (int) – the reward of the last turn
ma_weight (float) – used by committee: the weight assigned by multiagent learning, optional

tostring()¶: Prints state, action, and reward lists to screen.

class policy.Policy.EpisodeStack(block_size=100)¶

A handler for episodes. Required if stack size is to become very large - may not want to hold all episodes in memory, but write out to file.

add_episode(domain_episodes)¶: Items on stack are dictionaries of episodes for each domain (since with BCM can learn from 2 or more domains if a multidomain dialogue happens)

retrieve_episode(episode_key)¶: NB: this should probably be an iterator, using yield, rather than return

class policy.Policy.Policy(domainString, learning=False, specialDomain=False)¶

Interface class for a single domain policy. Responsible for selecting the next system action and handling the learning of the policy.

To create your own policy model or to change the state representation, derive from this class.

act_on(state)¶

Main policy method: mapping of belief state to system action.

This method is automatically invoked by the agent at each turn after tracking the belief state.

May initially return ‘hello()’ as hardcoded action. Keeps track of last system action and last belief state.

Parameters

state (DialogueState) – the belief state to act on
hyps (list) – n-best-list of semantic interpretations

Returns

the next system action of type DiaAct

convertStateAction(state, action)¶

Converts the given state and action to policy-specific representations.

By default, the generic classes State and Action are used. To change this, override method in sub-class.

Parameters

state (anything) – the state to be encapsulated
action – the action to be encapsulated

Type

action: anything

finalizeRecord(reward, domainInControl=None)¶

Records the final reward along with the terminal system action and terminal state. To change the type of state/action override convertStateAction().

This method is automatically executed by the agent at the end of each dialogue.

Parameters

reward (int) – the final reward
domainInControl (str) – used by committee: the unique identifier domain string of the domain this dialogue originates in, optional

Returns

None

nextAction(beliefstate)¶

Interface method for selecting the next system action. Should be overridden by sub-class.

This method is automatically executed by act_on() thus at each turn.

Parameters: beliefstate (dict) – the state the policy acts on
Returns: the next system action

record(reward, domainInControl=None, weight=None, state=None, action=None)¶

Records the current turn reward along with the last system action and belief state.

This method is automatically executed by the agent at the end of each turn.

To change the type of state/action override convertStateAction(). By default, the last master action is recorded. If you want to have another action being recorded, eg., summary action, assign the respective object to self.actToBeRecorded in a derived class.

Parameters

reward (int) – the turn reward to be recorded
domainInControl (str) – the domain string unique identifier of the domain the reward originates in
weight (float) – used by committee: the weight of the reward in case of multiagent learning
state (dict) – used by committee: the belief state to be recorded
action (str) – used by committee: the action to be recorded

Returns

None

restart()¶

Restarts the policy. Resets internal variables.

This method is automatically executed by the agent at the end/beginning of each dialogue.

savePolicy(FORCE_SAVE=False)¶

Saves the learned policy model to file. Should be overridden by sub-class.

This method is automatically executed by the agent either at certain intervals or at least before shutting down the agent.

Parameters: FORCE_SAVE (bool) – used to force cleaning up of any learning and saving when we are powering off an agent.

train()¶

Interface method for initiating the training. Should be overridden by sub-class.

This method is automatically executed by the agent at the end of each dialogue if learning is True.

This method is called at the end of each dialogue by PolicyManager if learning is enabled for the given domain policy.

class policy.Policy.State(state)¶: Dummy class representing one state. Used for recording and may be overridden by sub-class.

class policy.Policy.TerminalAction¶: Dummy class representing one terminal action. Used for recording and may be overridden by sub-class.

class policy.Policy.TerminalState¶: Dummy class representing one terminal state. Used for recording and may be overridden by sub-class.

PolicyManager.py - container for all policies¶

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import utils.Settings
import utils.ContextLogger
import ontology.Ontology
import ontology.OntologyUtils

class policy.PolicyManager.PolicyManager¶

The policy manager manages the policies for all domains.

It provides the interface to get the next system action based on the current belief state in act_on() and to initiate the learning in the policy in train().

_check_committee(committee)¶

Safety tool - should check some logical requirements on the list of domains given by the config

Parameters: committee (PolicyCommittee) – the committee be be checked

_load_committees()¶: Loads and instantiates the committee as configured in config file. The new object is added to the internal dictionary.

_load_domains_policy(domainString=None)¶

Loads and instantiates the respective policy as configured in config file. The new object is added to the internal dictionary.

Default is ‘hdc’.

Parameters: domainString (str) – the domain the policy will work on. Default is None.
Returns: the new policy object

act_on(dstring, state)¶

Main policy method which maps the provided belief to the next system action. This is called at each turn by DialogueAgent

Parameters

dstring (str) – the domain string unique identifier.
state (DialogueState) – the belief state the policy should act on

Returns

the next system action as DiaAct

bootup(domainString)¶: Loads a policy for a given domain.

finalizeRecord(domainRewards)¶

Records the final rewards of all domains. In case of a committee, the recording is delegated.

This method is called once at the end of each dialogue by the DialogueAgent. (One dialogue may contain multiple domains.)

Parameters: domainRewards (dict) – a dictionary mapping from domains to final rewards
Returns: None

getLastSystemAction(domainString)¶

Returns the last system action of the specified domain.

Parameters: domainString (str) – the domain string unique identifier.
Returns: the last system action of the given domain or None

printEpisodes()¶: Prints the recorded episode of the current dialogue.

record(reward, domainString)¶

Records the current turn reward for the given domain. In case of a committee, the recording is delegated.

This method is called each turn by the DialogueAgent.

Parameters

reward (int) – the turn reward to be recorded
domainString (str) – the domain string unique identifier of the domain the reward originates in

Returns

None

restart()¶: Restarts all policies of all domains and resets internal variables.

savePolicy(FORCE_SAVE=False)¶

Initiates the policies of all domains to be saved.

Parameters: FORCE_SAVE (bool) – used to force cleaning up of any learning and saving when we are powering off an agent.

train(training_vec=None)¶: Initiates the training for the policies of all domains. This is called at the end of each dialogue by DialogueAgent

PolicyCommittee.py - implementation of the Bayesian committee machine for dialogue management¶

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import utils.Settings
import utils.ContextLogger
import utils.DiaAct

class policy.PolicyCommittee.CommitteeMember¶

Base class defining the interface methods which are needed in addition to the basic functionality provided by Policy

Committee members should derive from this class.

abstract_actions(actions)¶

Converts a list of domain acts to their abstract form

Parameters: actions (list of actions) – the actions to be abstracted

getMeanVar_for_executable_actions(belief, abstracted_currentstate, nonExecutableActions)¶

Computes the mean and variance of the Q value based on the abstracted belief state for each executable action.

Parameters

belief (dict) – the unabstracted current domain belief
abstracted_currentstate (State or subclass) – the abstracted current belief
nonExecutableActions (list) – actions which are not selected for execution based on heuristic

getPriorVar(belief, act)¶

Returns prior variance for a given belief and action

Parameters

belief (dict) – the unabstracted current domain belief state
act (str) – the unabstracted action

get_Action(action)¶

Converts the unabstracted domain action into an abstracted action to be used for multiagent learning.

Parameters: action (str) – the last system action

get_State(beliefstate, keep_none=False)¶

Converts the unabstracted domain state into an abstracted belief state to be used with getMeanVar_for_executable_actions().

Parameters: beliefstate (dict) – the unabstracted belief state

unabstract_action(actions)¶

Converts a list of abstract acts to their domain form

Parameters: actions (list of actions) – the actions to be unabstracted

class policy.PolicyCommittee.PolicyCommittee(policyManager, committeeMembers, learningmethod)¶

Manages everything related to policy committee. All policy members must inherit from Policy and CommitteeMember.

_bayes_committee_calculator(domainQs, priors, domainInControl, scale)¶

Given means and variances of committee members - forms the Bayesian committee distribution for each action, draws sample from each, returns act with highest sample.

Note

this implementation is probably slow – can reformat domainQs - and redo this via matricies and slicing

Parameters

domainQs (dict of domains and dict of actions and dict of variance/mu and values) – the means and variances of all Q-value estimates of all domains
priors (dict of actions and values) – the prior of the Q-value
domainInControl (str) – domain the dialoge is in
scale (float) – a scaling factor used to control exploration during learning

Returns

the next abstract system action

_set_multi_agent_learning_weights(comm_meansVars, chosen_act)¶

Set reward scalings for each committee member. Implements NAIVE approach from “Multi-agent learning in multi-domain spoken dialogue systems”, Milica Gasic et al. 2015.

Parameters

comm_meansVars (dict of domains and dict of actions and dict of variance/mu and values) – the means and variances of all committee members
chosen_act (str) – the abstract system action to be executed

Returns

None

act_on(domainInControl, state)¶

Provides the next system action based on the domain in control and the belief state.

The belief state is mapped to an abstract representation which is used for all committee members.

Parameters

domainInControl (str) – the domain unique identifier string of the domain in control
state (DialogueState) – the belief state to act on

Returns

the next system action

finalizeRecord(reward, domainInControl)¶

Records for each committee member the reward and the domain the dialogue has been on

Parameters

reward (int) – the final reward to be recorded
domainInControl (str) – the domain the reward was achieved in

record(reward, domainInControl)¶

record for committee members. in case of multiagent learning, use information held in committee along with the reward to record (b,a) + r

Parameters

reward (str) – the turn reward to be recorded
reward – the domain the reward was achieved in

Returns

None

HDCPolicy.py - Handcrafted dialogue manager¶

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import policy.Policy
import policy.PolicyUtils
import policy.SummaryUtils
import utils.Settings
import utils.ContextLogger

class policy.HDCPolicy.HDCPolicy(domainString)¶

Handcrafted policy derives from Policy base class. Based on the slots defined in the ontology and fix thresholds, defines a rule-based policy.

If no info is provided by the user, the system will always ask for the slot information in the same order based on the ontology.

GPPolicy.py - Gaussian Process policy¶

Copyright CUED Dialogue Systems Group 2015 - 2017

Relevant Config variables [Default values]:

[gppolicy]
kernel = polysort
thetafile = ''    

See also

CUED Imports/Dependencies:

import policy.GPLib
import policy.Policy
import policy.PolicyCommittee
import ontology.Ontology
import utils.Settings
import utils.ContextLogger

class policy.GPPolicy.GPPolicy(domainString, learning, sharedParams=None)¶

An implementation of the dialogue policy based on Gaussian process and the GPSarsa algorithm to optimise actions where states are GPState and actions are GPAction.

The class implements the public interfaces from Policy and CommitteeMember.

class policy.GPPolicy.Kernel(kernel_type, theta, der=None, action_kernel_type='delta', action_names=None, domainString=None)¶

The Kernel class defining the kernel for the GPSARSA algorithm.

The kernel is usually divided into a belief part where a dot product or an RBF-kernel is used. The action kernel is either the delta function or a handcrafted or distributed kernel.

class policy.GPPolicy.GPAction(action, numActions, replace={})¶: Definition of summary action used for GP-SARSA.

class policy.GPPolicy.GPState(belief, keep_none=False, replace={}, domainString=None)¶: Definition of state representation needed for GP-SARSA algorithm Main requirement for the ability to compute kernel function over two states

class policy.GPPolicy.TerminalGPAction¶: Class representing the action object recorded in the (b,a) pair along with the final reward.

class policy.GPPolicy.TerminalGPState¶: Basic object to explicitly denote the terminal state. Always transition into this state at dialogues completion.

GPLib.py - Gaussian Process SARSA algorithm¶

Copyright CUED Dialogue Systems Group 2015 - 2017

This module encapsulates all classes and functionality which implement the GPSARSA algorithm for dialogue learning.

Relevant Config variables [Default values]. X is the domain tag:

[gpsarsa_X]
saveasprior = False 
random = False
learning = False
gamma = 1.0
sigma = 5.0
nu = 0.001
scale = -1
numprior = 0

See also

CUED Imports/Dependencies:

import utils.Settings
import utils.ContextLogger
import policy.PolicyUtils

class policy.GPLib.GPSARSA(in_policyfile, out_policyfile, domainString=None, learning=False, sharedParams=None)¶

Derives from GPSARSAPrior

Implements GPSarsa algorithm where mean can have a predefined value self._num_prior specifies number of means self._prior specifies the prior If not specified a zero-mean is assumed

Parameters needed to estimate the GP posterior self._K_tida_inv inverse of the Gram matrix of dictionary state-action pairs self.sharedParams[‘_C_tilda’] covariance function needed to estimate the final variance of the posterior self.sharedParams[‘_c_tilda’] vector needed to calculate self.sharedParams[‘_C_tilda’] self.sharedParams[‘_alpha_tilda’] vector needed to estimate the mean of the posterior self.sharedParams[‘_d’] and self.sharedParams[‘_s’] sufficient statistics needed for the iterative estimation of the posterior

Parameters needed for the policy selection self._random random policy choice self._scale scaling of the standard deviation when sampling Q-value, if -1 than taking the mean self.learning if true in learning mode

class policy.GPLib.GPSARSAPrior(in_policyfile, out_policyfile, numPrior=- 1, learning=False, domainString=None, sharedParams=None)¶: Defines the GP prior. Derives from LearnerInterface.

class policy.GPLib.LearnerInterface¶

This class defines the basic interface for the GPSARSA algorithm.

specifies the policy files self._inputDictFile input dictionary file self._inputParamFile input parameter file self._outputDictFile output dictionary file self._outputParamFile output parameter file

self.initial self.terminal flags are needed for learning to specify initial and terminal states in the episode

HDCTopicManager.py - policy for the front end topic manager¶

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import policy.Policy
import utils.Settings
import utils.ContextLogger

class policy.HDCTopicManager.HDCTopicManagerPolicy(dstring=None, learning=None)¶

The dialogue while being in the process of finding the topic/domain of the conversation.

At the current stage, this only happens at the beginning of the dialogue, so this policy has to take care of wecoming the user as well as creating actions which disambiguate/clarify the topic of the interaction.

It allows for the system to hang up if the topic could not be identified after a specified amount of attempts.

WikipediaTools.py - basic tools to access wikipedia¶

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import policy.Policy
import utils.Settings
import utils.ContextLogger

class policy.WikipediaTools.WikipediaDM¶: Dialogue Manager interface to Wikipedia – developement state.

SummaryAction.py - Mapping between summary and master actions¶

Copyright CUED Dialogue Systems Group 2015 - 2017, 2017

See also

CUED Imports/Dependencies:

import policy.SummaryUtils
import ontology.Ontology
import utils.ContextLogger
import utils.Settings

class policy.SummaryAction.SummaryAction(domainString, empty=False, confreq=False)¶: The summary action class encapsulates the functionality of a summary action along with the conversion from summary to master actions.

Note

The list of all possible summary actions are defined in this class.

SummaryUtils.py - summarises dialog events for mapping from master to summary belief¶

Basic Usage:

>>> import SummaryUtils

Note

No classes; collection of utility methods

Local module variables:

global_summary_features:    (list) global actions/methods
REQUESTING_THRESHOLD:             (float) 0.5 min value to consider a slot requested

See also

CUED Imports/Dependencies:

import ontology.Ontology
import utils.Settings
import utils.ContextLogger

PolicyUtils.py - Utility Methods for Policies¶

Note

PolicyUtils.py is a collection of utility functions only (No classes).

Local/file variables:

ZERO_THRESHOLD:             unused
REQUESTING_THRESHOLD:       affects getRequestedSlots() method

See also

CUED Imports/Dependencies:

import ontology.Ontology
import utils.DiaAct
import utils.Settings
import policy.SummaryUtils
import utils.ContextLogger

policy.PolicyUtils.REQUESTING_THRESHOLD = 0.5¶: Methods for global action.

policy.PolicyUtils.add_venue_count(input, belief, domainString)¶

Add venue count.

Parameters

input – String input act.
belief – Belief state
domainString (str) – domain tag like ‘SFHotels’

Returns

act with venue count.

policy.PolicyUtils.checkDirExistsAndMake(fullpath)¶: Used when saving a policy – if dir doesn’t exisit –> is created

policy.PolicyUtils.getGlobalAction(belief, globalact, domainString)¶

Method for global action: returns action

Parameters

belief (dict) – full belief state
globalact (int) –
- str of globalActionName, e.g. ‘INFORM_REQUESTED’
domainString (str) – domain tag

Returns

(str) action

policy.PolicyUtils.getInformAcceptedSlotsAboutEntity(acceptanceList, ent, numFeats)¶

Method for global inform action: returns filled out inform() string need to be cleaned (Dongho)

Parameters

acceptanceList (dict) – of slots with value:prob mass pairs
ent (dict) – slot:value properties for this entity
numFeats (int) – result of globalOntology.entity_by_features(acceptedValues)

Returns

(str) filled out inform() act

policy.PolicyUtils.getInformAction(numAccepted, belief, domainString)¶

Method for global inform action: returns inform act via getInformExactEntity() method or null() if not enough accepted

Parameters

belief (dict) – full belief state
numAccepted (int) – number of slots with prob. mass > 80
domainString (str) – domain tag

Returns

getInformExactEntity(acceptanceList,numAccepted)

policy.PolicyUtils.getInformExactEntity(acceptanceList, numAccepted, domainString)¶

Method for global inform action: creates inform act with none or an entity

Parameters

acceptanceList (dict) – of slots with value:prob mass pairs
numAccepted (int) – number of accepted slots (>80 prob mass)
domainString (str) – domain tag

Returns

getInformNoneVenue() or getInformAcceptedSlotsAboutEntity() as appropriate

BCM_Tools.py - Script for creating slot abstraction mapping files¶

Note

Collection of utility classes and methods

See also

CUED Imports/Dependencies:

import ontology.Ontology
import utils.Settings
import utils.ContextLogger

This script is used to create a mapping from slot names to abstract slot (like slot0, slot1 etc), highest entropy to lowest. Writes mapping to JSON file

DeepRL Policies¶

A2CPolicy.py - Advantage Actor-Critic policy¶

The implementation of the advantage actor-critic with the temporal difference as an approximation of the advantage function. The network is defined in DRL.a2c.py You can turn on the importance sampling through the parameter A2CPolicy.importance_sampling

The details of the implementation can be found here: https://arxiv.org/abs/1707.00130

See also

CUED Imports/Dependencies:

import Policy import utils.ContextLogger

ACERPolicy.py - Sample Efficient Actor Critic with Experience Replay¶

The implementation of the sample efficient actor critic with truncated importance sampling with bias correction, the trust region policy optimization method and RETRACE-like multi-step estimation of the value function. The parameters ACERPolicy.c, ACERPolicy.alpha, ACERPolicy. The details of the implementation can be found here: https://arxiv.org/abs/1802.03753

See also

CUED Imports/Dependencies:

import Policy import utils.ContextLogger

DQNPolicy.py - deep Q network policy¶

See also

CUED Imports/Dependencies:

import Policy import utils.ContextLogger

Warning

Documentation not done.

ENACPolicy.py - Episodic Natural Actor-Critic policy¶

The implementation of episodic natural actor-critic. The vanilla gradients are computed in DRL/enac.py using Tensorflow and then the natural gradient is obtained through function train. You can turn on the importance sampling through the parameter ENACPolicy.importance_sampling

The details of implementation can be found here: https://arxiv.org/abs/1707.00130 See also: https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2007-125.pdf

See also

CUED Imports/Dependencies:

import Policy import utils.ContextLogger

TRACERPolicy.py - Trust region advantage Actor-Critic policy with experience replay¶

The implementation of the actor-critic algorithm with off-policy learning and trust region constraint for stable training. The definition of the network and the approximation of the natural gradient is computed in DRL.na2c.py. You can turn on the importance sampling through the parameter TRACERPolicy.importance_sampling

The details of the implementation can be found here: https://arxiv.org/abs/1707.00130

See also

CUED Imports/Dependencies:

import Policy import utils.ContextLogger

FeudalGainPolicy.py - Information Gain for FeudalRL policies¶

The implementation of the FeudalGain algorithm that incorporates information gain as intrinsic reward in order to update a Feudal policy. Information gain is defined as the change in probability distributions between consecutive turns in the belief state. The distribution change is measured using the Jensen-Shannon divergence. FeudalGain builds upon the Feudal Dialogue Management architecture and optimises the information-seeking policy to maximise information gain. If the information-seeking policy for instance requests the area of a restaurant, the information gain reward is calculated by the Jensen-Shannon divergence of the value distributions for area before and after the request.

The details can be found here: https://arxiv.org/abs/2109.07129

FeudalRL Policies¶

Traditional Reinforcement Learning algorithms fail to scale to large domains due to the curse of dimensionality. A novel Dialogue Management architecture based on Feudal RL decomposes the decision into two steps; a first step where a master policy selects a subset of primitive actions, and a second step where a primitive action is chosen from the selected subset. The structural information included in the domain ontology is used to abstract the dialogue state space, taking the decisions at each step using different parts of the abstracted state. This, combined with an information sharing mechanism between slots, increases the scalability to large domains.

For more information, please look at the paper Feudal Reinforcement Learning for Dialogue Management in Large Domains.

Policy¶

Policy.py - abstract class for all policies¶

PolicyManager.py - container for all policies¶

PolicyCommittee.py - implementation of the Bayesian committee machine for dialogue management¶

HDCPolicy.py - Handcrafted dialogue manager¶

GPPolicy.py - Gaussian Process policy¶

GPLib.py - Gaussian Process SARSA algorithm¶

HDCTopicManager.py - policy for the front end topic manager¶

WikipediaTools.py - basic tools to access wikipedia¶

SummaryAction.py - Mapping between summary and master actions¶

SummaryUtils.py - summarises dialog events for mapping from master to summary belief¶

PolicyUtils.py - Utility Methods for Policies¶

BCM_Tools.py - Script for creating slot abstraction mapping files¶

DeepRL Policies¶

A2CPolicy.py - Advantage Actor-Critic policy¶

ACERPolicy.py - Sample Efficient Actor Critic with Experience Replay¶

DQNPolicy.py - deep Q network policy¶

ENACPolicy.py - Episodic Natural Actor-Critic policy¶

TRACERPolicy.py - Trust region advantage Actor-Critic policy with experience replay¶

FeudalGainPolicy.py - Information Gain for FeudalRL policies¶

FeudalRL Policies¶

PyDial3

Navigation

Related Topics