One way of describing human-computer dialogue is to view it as a sequence of dialogue turns in which each turn consists of a system utterance followed by an user utterance. In the traditional approach to builiding spoken dialogue systems, the system utterances are determined by hand-coded rules and depend on the input that is received from the speech recogniser. We will
PyDial library provides an RL-environment where you can in easy way test your existing policies or train your own system.
A convential approach to handling human-computer dialogue is to create a set of rule based on the slots defined in the ontology and fix thresholds. The illustrative global rules look as follows:
if global_summary['GLOBAL_BYCONSTRAINTS'] > 0.5 and global_summary['GLOBAL_COUNT80'] > 3:
act = PolicyUtils.getGlobalAction(belief, 'INFORM_BYNAME', domainString=self.domainString)
elif global_summary['GLOBAL_BYALTERNATIVES'] > 0.5:
act = PolicyUtils.getGlobalAction(belief, 'INFORM_ALTERNATIVES', domainString=self.domainString)
elif global_summary['GLOBAL_BYNAME'] > 0.5:
act = PolicyUtils.getGlobalAction(belief, 'INFORM_REQUESTED', domainString=self.domainString ,
bookinginfo=self.booking_slots_got_value_for)
In order to test this approach you can just run:
pydial chat config/Tut-hdc-CamInfo.cfg
For a real world dialogue system the number of such rules can be very large and a designer is required to manually implement each of them. Moreover, the maintenance of the system is a very challenging task. Thus, in next sections we will analyse statistical methods for learning dialogue policy.
The dialogue may be seen as a control problem where having a distributions over possible belief states we need to take some action which determines what the system says to user. We may apply the reinforcement learning framework to our problem where we look for the optimal policy $\pi : \mathcal{B} \times \mathcal{A} \rightarrow [0,1]$ during the dialogues with the user. In the learning procedure we update the $Q$-function which has the form: $$Q^{\pi}(\mathbf{b}, a) = \text{E}_{\pi} \{ \sum_{k=0}^{T-t} \gamma^k r_{t+k} | b_t = \mathbf{b}, a_t = a \},$$ where $r_t$ is a reward at a time $t$, $\gamma > 0 $ is a discount factor. Then, the policy can be obtained by: $$\pi(\mathbf{b}) = \arg \max_a \{Q(\mathbf{b,a}) : a \in \mathcal{A} \}.$$
We may model the $Q$-function using non-parametric approach via Gaussian processes with a zero mean and some kernel function $k(\cdot, \cdot)$, i.e.
$$Q(\mathbf{b},a) \sim \mathcal{GP}\left(0, k((\mathbf{b},a),(\mathbf{b},a)) \right).$$
Gaussian Processes follow a pure Bayesian framework which allows us to obtain the posterior given a new collected pair $(\mathbf{b},a)$. The chosen kernel function $k$ defines the observation's correlation with points collected so far which substantially increases the policy learning speed. Moreover, the non-parametric approach prevents us from over-fitting. This model is combined with classic RL method, SARSA, for the policy improvement.
You can find more information about this method in Gasic and Young, 2013
In order to test this model you can just run:
pydial train config/Tut-gp-Multidomain.cfg
PyDial enables also to build multi-domain dialogue system using Bayesian committee (BCM) approach which builds so-called 'committee' to take advantage advantage of having dialogue corpora even though from different domains. Committee members consist of estimators trained on different datasets. Training is analogous to the one described in the previous section. At every turn their estimated $Q$-values are combined to propose the final $Q$-value estimates using following formula:
$$\overline{Q} = \Sigma^Q(\mathbf{b},a)\sum_{i=1}^M \Sigma^Q_i(\mathbf{b},a)^{-1}\overline{Q}_i(\mathbf{b},a),\\ \Sigma^Q(\mathbf{b},a)^{-1} = -(M-1)\ast k((\mathbf{b},a),(\mathbf{b},a))^{-1} + \sum_{i=1}^M \Sigma_i^Q (\mathbf{b},a)^{-1}$$
It was shown that such approach is especially beneficial for adaptation in multi-domain dialogue system. In order to produce a generic policy that works across multiple domains belief state and action state are mapped to an abstract representation which is used for all committee members.
You can find more information about this method in Gasic et al., 2016
In order to test this model you can just run:
pydial train config/Tut-bcm-Multidomain.cfg
In PyDial there is a dedicated module for policy learning under directory policy
. Policy
- an abstract class that defines an interface for a single domain policy is in Policy.py script. It has all required functions to build generic reward model.
GP-SARSA dialogue policy is implemented in GPPolicy.py
with additional functionalities in GPLib.py
. It derives from Policy
class.
Hand-crafted policy is implemented HDCPolicy.py
. It can be run with all provided domains.
Bayesian committee machine model is implemented in PolicyCommittee.py
with class CommitteeMember
providing interface methods for single domains. The policy is for a committee is handled by PolicyCommittee
class.
You can specify detailed parameters of your policy system through the configuration file. The general settings is
under the section `policy_DOMAIN`. We
[policy_DOMAIN]
belieftype = baseline or focus
useconfreq = False or True # use confirm request ?
learning = False or True # policy learning?
outpolicyfile = ''
inpolicyfile = ''
policytype = hdc or gp
startwithhello = False or True
You can choose belief tracker using belieftype
. If you use the HDC policy you can choose to confirm request
using useconfreq
option. Setting learning
to True will train a model and you can specify whether to train from scratch
or by loading prexisting model providing a path to inpolicyfile
. The policy can be saved to a provided path via
outpolicyfile
. Finally, startwithhello
is a domain dependent setting and it is overruled and set to True if using single domain
option.
The specific option settings for a GP-Policy module can be given in gppolicy
section. Here is the list of all options with possible values with default ones if there is only one provided:
[gppolicy_DOMAIN]
kernel = polysort or gausssort
actionkerneltype = delta or hdc or distributed
abstractslots = False or True
unabstractslots = False or True
thetafile = ''
slotabstractionfile = ''
This section is relevant only if policytype under policy_DOMAIN is set to gp. kernel
and actionkerneltype
choose
belief and action kernels respectively. Option abstractslots
is set to True if we want to use BCM. If training was performed
with BCM and now we train single domain unabstractslots
should be set to True. thetafile
sets a path to belief kernel hyperparameters file.
Default option for performing slot abstraction is to use hardcoded mapping found in policy/slot_abstraction/
but can be given
another one through slotabstractionfile
.
The detailed options for SARSA training are given under section gpsarsa_DOMAIN
section:
[gpsarsa_DOMAIN]
random = False or True
scale = 3
saveasprior = False or True
numprior = 0
gamma = 1.0
sigma = 5.0
nu = 0.001
random
specifies whether actions are chosen randomly. scale
parameter chooses how much exploration we want
to perform as we sample from Gaussians of mean and std dev * scale. Learning scale set to be above 1 encourages exploration)
and for testing scale
is usually set to 1. saveasprior
and numprior
control saving and the number of prior means. gamma
is a discount factor and it is usually set to $1$ as we perform episodic tasks. nu
is a dictionary sparcification threshold and sigma
is residual noise.
The specific option settings for a BCM module can be given in policycommittee
section. Here is the list of all options with possible values:
[policycommittee]
bcm = False or True
pctype = hdc or configset
configsetcommittee = list of domains to be in committee if using above configset
learningmethod = singleagent or multiagent
BCM is used if bcm = True
. Note that in that case you need to set abstractslots to True under [gppolicy_DOMAIN]. pctype
chooses whether to youse hand-crafted policy (hdc
) or policy specified by configset
- this option requires configsetcommittee. learningmethod
specifies whether we are learning an agent only in one domain. Default is just to do single agent learning - whereby only domain where actions are actually being taken is learning - with multiagent learning all in committee will learn.