Policy module

One way of describing human-computer dialogue is to view it as a sequence of dialogue turns in which each turn consists of a system utterance followed by an user utterance. In the traditional approach to builiding spoken dialogue systems, the system utterances are determined by hand-coded rules and depend on the input that is received from the speech recogniser. We will

PyDial library provides an RL-environment where you can in easy way test your existing policies or train your own system.

Hand-crafted policy

A convential approach to handling human-computer dialogue is to create a set of rule based on the slots defined in the ontology and fix thresholds. The illustrative global rules look as follows:

if global_summary['GLOBAL_BYCONSTRAINTS'] > 0.5 and global_summary['GLOBAL_COUNT80'] > 3:
    act = PolicyUtils.getGlobalAction(belief, 'INFORM_BYNAME', domainString=self.domainString)
elif global_summary['GLOBAL_BYALTERNATIVES'] > 0.5:
    act = PolicyUtils.getGlobalAction(belief, 'INFORM_ALTERNATIVES', domainString=self.domainString)
elif global_summary['GLOBAL_BYNAME'] > 0.5:
    act = PolicyUtils.getGlobalAction(belief, 'INFORM_REQUESTED', domainString=self.domainString , 

In order to test this approach you can just run:

pydial chat config/Tut-hdc-CamInfo.cfg

For a real world dialogue system the number of such rules can be very large and a designer is required to manually implement each of them. Moreover, the maintenance of the system is a very challenging task. Thus, in next sections we will analyse statistical methods for learning dialogue policy.

Dialogue as a control problem

The dialogue may be seen as a control problem where having a distributions over possible belief states we need to take some action which determines what the system says to user. We may apply the reinforcement learning framework to our problem where we look for the optimal policy $\pi : \mathcal{B} \times \mathcal{A} \rightarrow [0,1]$ during the dialogues with the user. In the learning procedure we update the $Q$-function which has the form: $$Q^{\pi}(\mathbf{b}, a) = \text{E}_{\pi} \{ \sum_{k=0}^{T-t} \gamma^k r_{t+k} | b_t = \mathbf{b}, a_t = a \},$$ where $r_t$ is a reward at a time $t$, $\gamma > 0 $ is a discount factor. Then, the policy can be obtained by: $$\pi(\mathbf{b}) = \arg \max_a \{Q(\mathbf{b,a}) : a \in \mathcal{A} \}.$$

Gaussian Processes-SARSA algorithm

We may model the $Q$-function using non-parametric approach via Gaussian processes with a zero mean and some kernel function $k(\cdot, \cdot)$, i.e.

$$Q(\mathbf{b},a) \sim \mathcal{GP}\left(0, k((\mathbf{b},a),(\mathbf{b},a)) \right).$$

Gaussian Processes follow a pure Bayesian framework which allows us to obtain the posterior given a new collected pair $(\mathbf{b},a)$. The chosen kernel function $k$ defines the observation's correlation with points collected so far which substantially increases the policy learning speed. Moreover, the non-parametric approach prevents us from over-fitting. This model is combined with classic RL method, SARSA, for the policy improvement.

You can find more information about this method in Gasic and Young, 2013

In order to test this model you can just run:

pydial train config/Tut-gp-Multidomain.cfg

Policy committee

PyDial enables also to build multi-domain dialogue system using Bayesian committee (BCM) approach which builds so-called 'committee' to take advantage advantage of having dialogue corpora even though from different domains. Committee members consist of estimators trained on different datasets. Training is analogous to the one described in the previous section. At every turn their estimated $Q$-values are combined to propose the final $Q$-value estimates using following formula:

$$\overline{Q} = \Sigma^Q(\mathbf{b},a)\sum_{i=1}^M \Sigma^Q_i(\mathbf{b},a)^{-1}\overline{Q}_i(\mathbf{b},a),\\ \Sigma^Q(\mathbf{b},a)^{-1} = -(M-1)\ast k((\mathbf{b},a),(\mathbf{b},a))^{-1} + \sum_{i=1}^M \Sigma_i^Q (\mathbf{b},a)^{-1}$$

It was shown that such approach is especially beneficial for adaptation in multi-domain dialogue system. In order to produce a generic policy that works across multiple domains belief state and action state are mapped to an abstract representation which is used for all committee members.

You can find more information about this method in Gasic et al., 2016

In order to test this model you can just run:

pydial train config/Tut-bcm-Multidomain.cfg


In PyDial there is a dedicated module for policy learning under directory policy. Policy - an abstract class that defines an interface for a single domain policy is in Policy.py script. It has all required functions to build generic reward model.

GP-SARSA dialogue policy is implemented in GPPolicy.py with additional functionalities in GPLib.py. It derives from Policy class.

Hand-crafted policy is implemented HDCPolicy.py. It can be run with all provided domains.

Bayesian committee machine model is implemented in PolicyCommittee.py with class CommitteeMember providing interface methods for single domains. The policy is for a committee is handled by PolicyCommittee class.

Configuration file

You can specify detailed parameters of your policy system through the configuration file. The general settings is 
under the section `policy_DOMAIN`. We 
belieftype = baseline or focus
useconfreq = False or True      # use confirm request ?
learning = False or True        # policy learning?
outpolicyfile = ''  
inpolicyfile = ''           
policytype = hdc or gp           
startwithhello = False or True   

You can choose belief tracker using belieftype. If you use the HDC policy you can choose to confirm request using useconfreq option. Setting learning to True will train a model and you can specify whether to train from scratch or by loading prexisting model providing a path to inpolicyfile. The policy can be saved to a provided path via outpolicyfile. Finally, startwithhello is a domain dependent setting and it is overruled and set to True if using single domain option.

GP-Policy settings

The specific option settings for a GP-Policy module can be given in gppolicy section. Here is the list of all options with possible values with default ones if there is only one provided:

kernel = polysort or gausssort 
actionkerneltype = delta or hdc or distributed 
abstractslots = False or True  
unabstractslots = False or True
thetafile = ''              
slotabstractionfile = ''    

This section is relevant only if policytype under policy_DOMAIN is set to gp. kernel and actionkerneltype choose belief and action kernels respectively. Option abstractslots is set to True if we want to use BCM. If training was performed with BCM and now we train single domain unabstractslots should be set to True. thetafile sets a path to belief kernel hyperparameters file. Default option for performing slot abstraction is to use hardcoded mapping found in policy/slot_abstraction/ but can be given another one through slotabstractionfile.

The detailed options for SARSA training are given under section gpsarsa_DOMAIN section:

random = False or True    
scale = 3          
saveasprior = False or True   
numprior = 0    
gamma = 1.0         
sigma = 5.0     
nu = 0.001     

random specifies whether actions are chosen randomly. scale parameter chooses how much exploration we want to perform as we sample from Gaussians of mean and std dev * scale. Learning scale set to be above 1 encourages exploration) and for testing scale is usually set to 1. saveasprior and numprior control saving and the number of prior means. gamma is a discount factor and it is usually set to $1$ as we perform episodic tasks. nu is a dictionary sparcification threshold and sigma is residual noise.

Committee settings

The specific option settings for a BCM module can be given in policycommittee section. Here is the list of all options with possible values:

bcm = False or True           
pctype = hdc  or configset      
configsetcommittee = list of domains to be in committee if using above configset
learningmethod = singleagent or multiagent      

BCM is used if bcm = True. Note that in that case you need to set abstractslots to True under [gppolicy_DOMAIN]. pctype chooses whether to youse hand-crafted policy (hdc) or policy specified by configset - this option requires configsetcommittee. learningmethod specifies whether we are learning an agent only in one domain. Default is just to do single agent learning - whereby only domain where actions are actually being taken is learning - with multiagent learning all in committee will learn.