Personal tools
You are here: Home Prospective postgraduates Possible PhD Topics in ILCC

Possible PhD Topics in ILCC

This page lists possible PhD topics suggested by members of staff in ILCC. These topics are meant to give PhD applicants an idea of the scope of the work in the Institute.

As part of a PhD application, applicants have to submit a PhD proposal, which can be based on one of the topics on this page (but the proposal needs to be much more detailed). Of course applicants can also suggest their own topic. In both cases, they should contact the potential supervisor before submitting an application.

Note that there is no specific funding attached to any of the suggested topics; however, the School of Informatics offers a range of scholarships for PhD students.

A number of grant-funded PhD studentships are also available.


Language Processing Topics

Concurrency in (Computational) Linguistics

Improving understanding of synchronic and diachronic aspects of phonology.

Supervisor: Julian Bradfield

In several aspects of linguistic analysis, it is natural to think of some form of concurrent processing, not least because the brain is a massively concurrent system. This is particularly true in phonology and phonetics, and descriptions such as feature analyses, and especially autosegmental phonology, go some way to recognizing this. Although there has been some work on rigorous formal models of such descriptions, there has been little if any application of the extensive body of research in theoretical computer science on concurrent processes. Such a project has the potential to give better linguistic understanding of synchronic and diachronic aspects of phonology and perhaps syntax, and even to improve speech generation and recognition, by adding formal underpinning and improvement to the existing agent-based approaches.

Spectral Learning for Natural Language Processing

Supervisors: Shay Cohen, Mirella Lapata

Latent variable modeling is a common technique for improving the expressive power of natural language processing models. The values of these latent variables are missing in the data, but we are still required to predict these values and estimate the model parameters while assuming these variables exist in the model. This project seeks to improve the expressive power of NLP models at various levels (such as morphological, syntactic and semantic) using latent variable modeling, and also to identify key techniques, based on spectral algorithms, in order to learn these models. The family of latent-variable spectral learning algorithms is a recent exciting development that was seeded in the machine learning community. It presents a principled, well-motivated approach for estimating the parameters of latent variable models using tools from linear algebra.

Natural Language Semantics and Question Answering

Supervisor: Shay Cohen

How can we make computers understand language? This is a question at the core of the area of semantics in natural language processing. Question answering, an NLP application in which a computer is expected to respond to natural language questions, provides a lens to look into this challenge.

Most modern search engines offer some level of functionality for factoid question answering. These systems have high precision, but their recall could significantly be improved. From the technical perspective, question answering offers a sweet spot between challenging semantics problems that are not expected to be solved in the near future, and problems that will be solved in the foreseeable future. As such, it is an excellent test-bed for semantic representation theories and for other attempts at describing the meaning of text. The most recent development in question answering is that of the retrieval of answers from open knowledge bases such as Freebase (a factoid database of various facts without a specific domain tying them all together).

The goal of this project is to explore various methods to improve semantic representations in language, with open question answering being potentially an important application for testing them. These semantic representations can either be symbolic (enhanced with a probabilistic interpretation) or they can be projections in a continuous geometric space. Both of these ideas have been recently explored in the literature.

Analyzing lexical networks

Supervisor: Sharon Goldwater

Psycholinguists often model the phonological structure of the mental lexicon as a network or graph, with nodes representing words and edges connecting words that differ by a single phoneme. Statistics gathered from this graph, such as the number of neighbors a word has, have been shown to predict behavioral measures, such as the word's recognition time. Researchers have also suggested that lexical networks across different languages share certain important properties, such as a short average path length between any two nodes. However, many important questions remain regarding the similarities and differences between lexical networks across languages, and even between the networks of children and adults in the same language community. Answering these questions could help us understand the language-universal cognitive and linguistic pressures that determine the structure of lexicons and the path of word learning in children. PhD projects in this general area will require developing new methodologies for comparing and analyzing graph structures and applying these methodologies to lexical networks from different languages. We aim to relate properties of the lexical networks to aspects of either adult language processing or child language acquisition. Possible projects could (but need not) include behavioural experiments.

Joint learning of morphology and syntax

Supervisor: Sharon Goldwater

Develop a system for joint unsupervised learning of morphology and syntax.

Morphological analysis (segmentation) is an important part of NLP and speech technology in many morphologically rich languages. Interestingly, unsupervised morphological analysis systems sometimes yield better results for downstream applications than do supervised systems.  However there is still room for improvement.  This project aims to improve unsupervised morphological analysis through incorporating some form of syntactic information (e.g., POS tags or dependencies) into a joint learning model.  Evaluation will be both against a gold standard but importantly also through incorporating the morphological analysis into a downstream application such as machine translation or speech recognition.  One goal is to determine what properties of a morphological segmentation are useful for that application.

Uniform Information Density and multimodal interaction

Supervisor: Sharon Goldwater

Explore whether and how the Uniform Information Density hypothesis applies to gestural aspects of communication.

The Uniform Information Density  (UID) hypothesis states that speakers attempt to communicate roughly equivalent amounts of information per unit time.  UID predicts, for example, that if a word is more predictable based on context then the speaker will pronounce the word less clearly or with shorter duration, as this loss in information balances out the gain due to context.  This and other predictions of UID have been shown to hold for speech, but in face-to-face communication, multiple channels are involved.  In particular, information is communicated through gesture as well as speech, so the predictions of UID should carry over into the domain of gesture.  This project aims to clarify and test these predictions.  Key challenges will be in identifying and/or collecting an appropriate corpus, and in defining a measure of reduction for gesture (analogous to reduction in word pronunciation).

Directly Learning Phrases in Machine Translation

Supervisor: Kenneth Heafield

Machine translation systems memorize phrasal translations so that they can translate chunks of text at a time.  For example, the system memorizes that the phrase "Chambre des représentants" translates as "House of Representatives" with some probability.  These phrases are currently identified using heuristics on top of word translations. However, word-level heuristics inadequately capture non-compositional phrases like "hot dog".

This project will look at ways to replace word-level heuristics by learning phrases directly from translated text.  Prior work (DeNero et al, 2006) has sought to segment translated sentences into translated phrases, but failed to address overlapping phrases.  Potential topics include models for overlapping phrases, the large number of parameters that results from considering all pairings of phrases, discriminatively optimizing phrases directly for translation quality, and modeling compositionality.

Tera-scale Language Models

Supervisor: Kenneth Heafield

Given a copy of the web, how well can we predict the next word you will type or figure out what sentence you said?  We have built language models on trillions of words of text and seen that they do improve quality.  Is this the best model?  Can we use 10x more data?  If we need to query the models on machines with fewer terabytes of RAM, where should we approximate or put the remaining data?  This project is about both the systems aspect of dealing with large amounts of data and the modeling questions of quality and approximation.  By jointly looking at systems and modeling we hope to arrive at the best language models of all sizes, rather than limiting the data we use.

Optimizing Structured Objectives

Supervisor: Kenneth Heafield

Natural language systems often boil down to search problems: find high-scoring options in a structured but exponential space.  Our ability to efficiently search these spaces limits what types of features can be used and, ultimately, the quality of the system.  This project is about growing the class of efficient features by making better search algorithms.  We would like to support long-distance continuous features like neural networks and discrete features such as parsing.  While most people treat features as a black box that returns final scores, we will open the box and improve search in ways that exploit the internal structure of natural language features.  Applications include machine translation, speech recognition, and parsing.

Multi-task Neural Networks

Supervisor: Kenneth Heafield

Neural networks are advertised as a way to directly optimize an objective function instead of doing feature engineering.  However, current systems feature separately-trained word embeddings, language models, and translation models.  Part of the problem is that there are different data sets for each task.  This work will look at multi-task learning as a way to incorporate multiple data sets.  Possible uses include direct speech-to-speech translation or joint language and translation modeling.

An Integrated Model of Human Syntactic and Semantic Processing

Supervisor: Frank Keller

Develop a model of human language processing that combines incremental syntactic parsing with word embeddings using deep learning

Over the past years, successful broad-coverage models of human syntactic processing have been proposed, e.g., Hale's Surprisal model. At the same time, human semantic intuitions have been captured successfully using semantic space models, and more recently, word embeddings. These two lines of work have been pursued largely independently, even though experimental results clearly show that humans combine syntactic and semantic information during sentence comprehension. The aim of this project is to build an integrated model of syntactic and semantic processing that combines Surprisal-based incremental parsing with word embeddings. Deep learning approaches could be used to learn joint representations of syntactic and semantic knowledge, which could then be integrated into an incremental parsing architecture such as PLTAG (psycholinguistically motivated tree-adjoining grammar). The resulting model can be evaluated against a wealth of data, including eye-tracking corpora and data from psycholinguistic experiments.

Weakly Supervised Learning with Eye-tracking Data

Supervisor: Frank Keller

Develop models that use eye-tracking data as a weak supervisory signal for NLP tasks such as part-of-speech tagging and parsing

Recent work in natural language processing has exploited weakly supervised and unsupervised techniques for core NLP tasks such as part-of-speech tagging or parsing. This makes it possible to build models for languages and domains for which little annotated data exists. At the same time, the use of behavioral data has shown promise in computer vision as a form of weak supervision: when humans view images, they tend to fixate on objects (rather than background). Data from an eye-tracker (which records fixation positions and durations during image viewing) can therefore be used to build object detectors that are trained from otherwise unannotated images.

The aim of this project is to exploit this insight for NLP. Eye-tracking data for reading shows that people make shorter fixations on function words than on content words, function words are more likely to be skipped, and fixation durations are increased at phrase boundaries. Eye-tracking data therefore contains information that is potentially valuable for unsupervised part-of-speech induction or parsing. This project will start from existing unsupervised PoS tagging or parsing models (Bayesian models or models using word embeddings and deep learning), and develop them to use eye-tracking data as a form of weak supervision. Eye-tracking corpora for several language are available for training, and additional data could be collected easily if required.

Structured Representations of Images and Text

Supervisor: Frank Keller

Develop models for inferring structured representations for images and exploiting them for multimodal tasks such as image description and image retrieval

The web is awash with image data: on Facebook alone, 350 million new images are uploaded every day. These images typically co-occur with textual data such as comments, captions, or tags. Well-established techniques exist for extracting structure from text, but image structure is largely an unexplored area. Prior work has shown that images can be represented as graphs expressing the relations between the objects in the image. One example is Visual Dependency Representations (VDRs), which can be aligned with textual data and used for image description.

The aim of this project is to explore the use of structured image representations such as VDRs. Ideas for topics include: (1) Developing new ways of inferring structured representations from images with various forms of prior annotation (e.g., using deep learning techniques). (2) Augmenting existing structured representations to be more expressive (e.g., by representing actions, attributes, background, context). (3) Exploring new models that use VDRs for tasks that require multimodal representations, e.g., image description, image retrieval, story illustration, or action recognition.

Maintaining Negative Polarity in Statistical Machine Translation

Supervisor: Bonnie Webber

Negative assertions, negative commands, and many negative questions all convey the opposite of their corresponding positives. Statistical machine translation (SMT), for all its other success, cannot be trusted to get this right: It may incorrectly render a negative clause in the source language as positive in the target; render negation of an embedded source clause as negation of the target matrix clause; render negative quantification in the source text as positive quantification in the target; and negative quantification in the source text as verbal negation in English, thereby significantly changing the meaning conveyed.

The goal of this research is a robust, language-independent method for improving accuracy in the translation of negative sentences in SMT. To assess negation-specific improvements, a bespoke evaluation metric must also be developed, to complement the standard SMT BLEU score.

Using discourse relations to inform sentence-level Statistical MT

Supervisor: Bonnie Webber

Gold standard annotation of the Penn Discourse TreeBank has enabled the development of methods for disambiguating the intended sense of an ambiguous discourse connective such as since or while, as well as for suggesting discourse relation(s) likely to hold between adjacent sentences that are not marked with a discourse connective.

Since a discourse connective and its two arguments can be viewed as in terms of constraints that hold between pairwise between the connective and each argument, or between the arguments, we should be able to use these constraints in Statistical MT, either in decoding or in re-ranking, preferring translations that are compatible with the constraints.  One might start this work either by looking at rather abstract, high-frequency discourse relations such as contrast, which have rather weak pair-wise constraints, or by looking at rather specific, low-frequency relations such as chosen alternative, which have very strong constraints between the arguments. 

Entity-coherence and Statistical MT

Supervisor: Bonnie Webber 

Entity-based coherence has been used in both Natural Language Generation (Barzilay and Lapata, 2008) : Elsner and Charniak, 2011) and essay scoring (Miltsakaki and Kukich, 2004), reflecting the observation that texts that display the entity-coherence patterns of well-written texts from the given genre are seen as being better written than texts that don't display these patterns. Recently, Guinaudeau and Strube (2013) have shown that matrix operations can be used to compute entity-based coherence very efficiently.

This project aims to apply the same insights to Statistical MT, and assess whether sentence-level SMT can be improved by promoting translations that better match natural patterns of entity-based coherence, or by getting better translations in the first place.

Improving the translation of Modality in SMT

Supervisor: Bonnie Webber   

Modal utterances are common in both argumentative (rhetorical) text and instructions. This project considers the translation of such texts (for example, TED talks as representative of argumentative text) and whether the translation of modality can be improved by considered discourse-level features of such texts. For example, there may be useful constraints between adjacent sentences or clauses, such that the appearance of a modal marker in one increases/decreases the likelihood of some kind of modal marker in the other.

Modelling Non-Cooperative Conversation

Supervisor: Alex Lascarides

Develop and implement a model of conversation that can handle cases where the agents' goals conflict.

Work on adversarial strategies from game theory and signalling theory lack sophisticated models of linguistic meaning. Conversely, current models of natural language discourse typically lack models of human action and decision making that deal with situations where the agents' goals conflict.  The aim of this project is to fill this gap and in doing so provide a model of implicature in non-cooperative contexts.

This project involves analysing a corpus of human dialogues of users playing the game Settlers of Catan: a well-known adversarial negotiating game.  This will be used to leverage extensions to an existing state of the art dynamic semantic model of dialogue content with a logically precise model of the agents' mental states and strategies.  The project will also involve implementing these ideas into a working dialogue system that extends an existing open source agent that plays Settlers, but that has no linguistic capabilities.

Interpreting Hand Gestures in Face to Face Conversation

Supervisor: Alex Lascarides

Map hand shapes and movements into a representation of their form and meaning.

The technology for mapping an acoustic signal into a sequence of words and for estimating the position of pitch accents is very well established. But estimating which hand movements are communicative and which aren't, estimating which part of a communicative hand movement is the stroke or post-stroke hold (i.e., those part of the move that conveys meaning) is much less well understood. Furthermore, to build a semantic representation of the multimodal action, one must, for depicting gestures at least (that is, gestures whose form resembles their meaning) capture qualitative properties of its shape, position and movement (e.g., that the trajectory of the hand was a circle, or a straight line moving vertically upwards).  On the other hand, deictic gestures must be represented using quantitative values in 4D Euclidean space. Mapping hand movement to these symbolic and quantitative representations of form is also an unsolved problem.

The aim of this project is to create and exploit a corpus to learn mappings from the communicative multimodal signals to the representation of their form, as required by an existing online grammar of multimodal action, which in turn is designed to yield (underspecified) representations of the meaning of the multimodal action.  We plan to use state of the art models of visual processing using kinect cameras to estimate hand positions and hand shapes, and design Hidden Markov Models that exploit the visual signal, language models and gesture models to estimate the qualitative (and quantitative) properties of the gesture.

The Content of Multimodal Interaction

Supervisor: Alex Lascarides

To design, implement and evaluate a semantic model of conversation that takes place in a dynamic environment.

It is widely attested in descriptive linguistics that non-linguistic events dramatically affect the interpretation of linguistic moves and conversely, linguistic moves affect how people perceive or conceptualise their environemnt.  For instance, suppose I look upset and so you ask me "What's wrong?"  I look over my shoulder towards a scribble on the living room wall, and then utter "Charlotte's been sent to her room".  An adequate interpretation of my response can be paraphrased as: Charlotte has drawn on the wall, and as a consequence she has been sent to her room.  In other words, you need to conceptualise the scribble on the wall as the result of Charlotte's actions; moreover, this non-linguistic event, with this description, is a part of my response to your question.  Traditional semantic models of dialogue don't allow for this type of interaction between linguistic and non-linguistic contexts.  The aim of this project is to fix this, by extending and refining an existing formal model of discourse structure to support the semantic role of non-linguistic events in context in the messages that speakers convey. The project will draw on data from an existing corpus of people playing Settlers of Catan, where there are many examples of complex semantic relationships among the player's utterances and the non-linguistic moves in the board game.  The project involves formally defining a model of discourse structure that supports the interpretation of these multimodal moves, and developing a discourse parser through machine learning on the Settlers corpus.

Incremental Interpretation for robust NLP using CCG and Dependency Parsing

Supervisor: Mark Steedman

Combinatory Categorial Grammar (CCG) is a computational grammar formalism that has recently been used widely in NLP applications including wide-coverage parsing, generation, and semantic parser induction.  The present project seeks to apply insights from these and other sources including dependency parsing to the problem of incremental word-by-word parsing and interpretation using statistical models.  Possible evaluation tasks include language modeling for automatic speech recognition, as well as standard parsing benchmarks.

Statistical NLP for Programming Languages

Supervisor: Charles Sutton

Find syntactic patterns in corpora of programming language text.

The goal of this project is to apply the advanced statistical techniques from natural language processing to a completely different and new textual domain: programming language text.  Think about how you program when you are using a new library or new environment for the first time. You "program by search engine", i.e., you search for examples of people who have used the same library, and you copy chunks of code from them. I want to systemize this process, and apply it at a large scale.  We have collected a corpus of 1.5 billion lines of source code from 8000 software projects, and we want to find syntactic patterns that recur across projects. These can then be presented to a programmer as she is writing code, providing an autocomplete functionality that can suggest entire function bodies.  Statistical techniques involved include language modeling, data mining, and Bayesian nonparametrics.  This also raises some deep and interesting questions in software engineering: i.e., Why do syntactic patterns occur in professionally written software when they could be refactored away?

Discovering hidden causes

Supervisor: Chris Lucas

In order to explain, predict, and influence events, human leaners must infer the presence of causes that cannot be directly observed. For example, we understand the behavior of other people by appealing to mental states, we can infer that a new disease is spreading when we see several individuals with novel symptoms, and we can speculate about why a device or computer program is prone to crashing. The aim of this project is to better understand how humans search for and discover hidden causes, using Bayesian models and behavioural experiments.

Approximate inference in human causal learning

Supervisor: Chris Lucas
A fundamental problem in casual learning is understanding what relationships hold among a large set of variables, and in general this problem is intractable. Nonetheless, humans are able to learn efficiently about the causal structure of the world around them, often making the same inferences that one would expect of an ideal or rational learner. How we achieve this performance is not yet well understood -- we rely on approximate inferences that deviate in certain systematic ways from what an ideal observer would do, but those deviations are still being catalogued and there are few detailed hypotheses about the underlying processes. This project is concerned with exploring these processes and developing models that reproduce human performance -- including error patterns -- in complex causal learning problems, with the aim of understanding and correcting for human errors and building systems that are of practical use

Multilingual semantics

Supervisor: Adam Lopez

Natural language processing has been enormously successful, but NLP systems still often fail to preserve the semantics of sentences—the "who did what to whom" relationships that they express. As a result, they fail to correctly understand, translate, extract, or generate the meaning of a wide variety of language phenomena in many languages. To preserve semantics, they must model semantics. Computational linguists have developed formal, expressive mathematical models of language that exhibit high empirical coverage of semantically annotated linguistic data, correctly predict a variety of important linguistic phenomena in many languages, and can be processed with highly efficient algorithms. However, these models are not completely understood, and they are untested as the basis of statistical NLP models. The goal of projects in this area is to develop the mathematics of semantic models and to apply them to basic problems in natural language understanding and generation.

Graph grammars, automata, and transducers for statistical semantics

Supervisor: Adam Lopez

 We now have large quantities of sentences annotated with their semantics, in the form of directed graphs expressing "who did what to whom" relationships. This creates the possibility to learn large-scale statistical models of semantic parsing or text generation. From a machine learning perspective, this is a structured prediction problem: the input and outputs are strings, trees, or graphs. The formal machinery of structured prediction models for strings and trees is often based on weighted automata and transducers: extensions of classical automata and transducers with real-valued weights representing probabilities. To extend these models to graph-based representations, we require automata and transducers for graphs, but these types of automata are much less well-developed than automata on strings and trees. The goal of projects in this area is to extend the mathematics of weighted graph automata and transducers and to apply them to real data.

Massively parallel algorithms for language

Supervisor: Adam Lopez
Many classical algorithms and models in NLP assume sequential computation, an assumption broken by the trend of placing many cores on the same die. There are now multiple solutions to utilizing the transistor budget: a small number of complex cores, or a large number of simple cores. The latter approach, exemplified by graphics processing units (GPUs) with thousands of parallel threads of execution, is especially tantalizing for computation-bound NLP problems. But harnessing the power of such processors requires a complete rethinking of the fundamental algorithms in the field. Projects in this area focus on the design of new algorithms and data structures for difficult problems in NLP—such as language understanding and generation—with the goal of achieving vast improvements in speed, and leading to new applications.

Low-resource language and speech processing

Supervisor: Adam Lopez
The most effective language and speech processing systems are based on statistical models learned from many annotated examples, a classic application of machine learning on input/ output pairs. But for many languages and domains we have little data. Even in cases where we do have data, it is government or news text. For the vast majority of languages and domains, there is hardly anything. However, in many cases, there is side information that we can exploit: dictionaries or other knowledge sources, or text paired with weak signals, such as images, speech, or timestamps. How can we exploit such heterogeneous information in statistical language processing? The goal of projects in this area is to develop statistical models and inference techniques that exploit such data, and apply them to real problems.


Speech Processing Topics

Topics in unsupervised speech processing and/or modelling infant speech perception

Supervisor: Sharon Goldwater

Work in unsupervised (or 'zero-resource') speech processing (see Aren Jansen et al., ICASSP 2013) has begun to investigate methods for extracting repeated units (phones, words) from raw acoustic data as a possible way to index and summarize speech files without the need for transcription.  This could be especially useful in languages where there is little data to develop supervised speech recognition systems.  In addition, it raises the possibility of whether similar methods could be used to model the way that human infants begin to identify words in the speech stream of their native language.  Unsupervised speech processing is a growing area of research with many interesting open questions, so a number of projects are possible.  Projects could focus mainly on ASR technology or mainly on modeling language acquisition; specific research questions will depend on this choice.  here are just two possibilities: (1) unsupervised learners are more sensitive to input representation than are supervised learners, and preliminary work suggests that MFCCs are not necessarily the best option.  Investigate how to learn better input representations (e.g., using neural networks) that are robust to speaker differences but encode linguistically meaningful differences.  (2) existing work in both speech processing and cognitive modeling suggests that trying to learn either words or phones alone may be too difficult and in fact we need to develop *joint learners* that simultaneously learn at both levels.  Investigate models that can be used to do this and evaluate how joint learning can improve performance.

Deep Neural Network-based speech synthesis

Supervisor: Simon King, Steve Renals

DNNs may offer new ways to model the complex relationship between text and speech, compared to HMMs. In particular, they may enable more effective use of the highly-factorial nature of the linguistic representation derived from text, which in the HMM approach is a 'flat' sequence of phonetically- and prosodically-context-dependent labels.

We are interested in any topic concerning DNN-based speech synthesis, for example: novel text processing methods to extract new forms of linguistic representation, how to represent linguistic information at the input to the DNN, how to represent the speech signal (or vocoder parameters) at the output of the DNN, methods for control over the speaker/gender/accent/style of the output speech, combinations of supervised, semi-supervised and un-supervised learning.

Hidden Markov model-based speech synthesis

Supervisor: Simon King, Junichi Yamagishi

The HMM, which is a statistical model that can be used to both classify and generate speech, offers an exciting alternative to concatenative methods for synthesising speech. As a consequence, most research effort around the world in speech synthesis is now focussed on HMMs because of the flexibility that they offer.

There are a number of topics we are interested in within HMM-based speech synthesis, including: speaker, language and accent adaptation; cross-language speaker adaptation; improving the signal processing and vocoding aspects of the model; unsupervised and semi-supervised learning.

Personification using affective speech synthesis

Supervisors: Simon King, Matthew Aylett

New approaches to capture, share and manipulate information in sectors such as health care and the creative industries require computers to enter the arena of human social interaction. Users readily adopt a social view of computers and previous research has shown how this can be harnessed in applications such as giving health advice, tutoring, or helping children overcome bullying.

However, whilst current speech synthesis technology is highly intelligible, it has not been able to deliver voices which aid this 'personification'. The lack of naturalness makes some synthetic voices sound robotic, while the lack of expressiveness makes others sound dull and lifeless.

In many of the above applications, it is less important to be able to render arbitrary text, than it is to convey a sense of personality within a more limited domain. So, this project would investigate two key problems (1) Merging expressive pre-recorded prompts with expressive unit selection speech synthesis. (2) Dynamically altering voicing in speech to convey underlying levels of stress and excitement using source filter decomposition techniques.

Cross-lingual acoustic models

Supervisors: Simon King, Steve Renals, Junichi Yamagishi

Adapting speech recognition acoustic models from one language to another, with a focus on limited resources and unsupervised training.

Current speech technology is based on machine learning and trainable statistical models.  These approaches are very powerful, but before a system can be developed for a new language considerable resources are required:  transcribed speech recordings for acoustic model training; large amounts of text for language model training; and a pronunciation dictionary.  Such resources are available for languages such as English, French, Arabic, and Chinese, but there are many less well-resourced languages.  There is thus a need for models that can be adapted to from one language to another with limited effort and resources.  To address this we are interested in two (complementary) approaches.  First, the development of lightly supervised and unsupervised training algorithms:  speech recordings are much easier to obtain than transcriptions.  Second, the development of models which can factor language-dependent and language-independent aspects of the speech signal, perhaps exploiting invariances derived from speech production.  We have a particular interest in approaches (1) building on the subspace GMM framework, or (2) using deep neural networks.

Factorised acoustic models

Supervisors: Simon King, Steve Renals, Junichi Yamagishi

Acoustic models which factor specific causes of variability, thus allowing more powerful adaptation for speech recognition, and greater control for speech synthesis.

Adaptation algorithms, such as MLLR, MAP, and VTLN, have been highly effective in acoustic modelling for speech recognition.  However, current approaches only weakly factor the underlying information - for instance "speaker" adaptation will typically adapt for the acoustic environment and the task, as well as for different aspects of the speaker.  It is of great interest to investigate speech recognition models which are able to factor the different sources of variability.  PhD projects in this area will explore the development of factored models that enable specific aspects of a system to be adapted.  For example, it is of great interest - for both speech recognition and speech synthesis - to be able to model accent in a specific way.  We are interested in two modelling approaches which hold great promise for this challenge: subspace Gaussian mixture models, and deep neural networks.

Robust broadcast speech recognition

Supervisor: Steve Renals

Current speech recognition technology has shown great promise in subtitling material such as news, but is brittle when faced with the full range of broadcast genres such as sport, game shows, and drama.  Our industry partners have identified the transcription of noisy, reverberant speech, such as sports commentaries, as a particular challenge.  We are interested in developing speech recognition models that can factorise different components of the audio signal, separating the target speech from sources of interfering acoustic sources (e.g. crowd noise) and echo. References:

P Swietojanski and S Renals (2015). Differentiable pooling for unsupervised speaker adaptation

In Proc IEEE ICASSP-2015.

P Swietojanski and S Renals (2014). Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models.

In Proc IEEE SLT-2014.

Distant speech recognition and overlapping speech

Supervisor: Steve Renals

Distant speech recognition, in which speech is captured using one or more distant microphones is a major challenge in speech recognition.  Specific problems include compensating for reverberation and dealing with multiple acoustic sources (including overlapping talkers). Research in this area will explore deep neural network and and recurrent neural network acoustic models to handle reverberation and overlapping talkers, building on our recent work using convolutional neural networks and RNN encoder-decoder approaches. References:

S Renals, T Hain, and H Bourlard (2007). Recognition and interpretation of meetings: The AMI and AMIDA projects

In Proc IEEE ASRU-2007.

S Renals and P Swietojanski (2014). Neural networks for distant speech recognition

In Proc HSCMA-2014.

L Lu, X Zhang, K Cho, and S Renals (2015). A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition.
In Proc Interspeech-2015.

Multi-genre broadcast speech recognition

Supervisor: Steve Renals

Broadcast speech has been a target domain for speech recognition since the 1990s;  however, most work has focused on specific genres such as news and weather forecasts.  Multi-genre broadcast speech recognition, covering all types of broadcast material (e.g. sport, films, and reality TV), is a significant challenge due to much greater variability in speaking style, music and other sound effects, and overlapping talkers.  In collaboration with the BBC, we have begun a programme of work in this area.  Specific research topics in this area will include  fundamental work in acoustic models and language models using broadcast speech recognition as a testbed and  rapid adaptation to changes in acoustic environment, genre/topic, and speaker by exploiting available metadata.  One topic of particular interest is recognition of broadcast speech with additive noise and reverberation (e.g. sports commentary).  Our current approaches to acoustic and language modelling include recurrent and convolutional networks. References:

P Bell, P Swietojanski, and S Renals (2013). Multi-level adaptive networks in tandem and hybrid ASR systems

In Proc IEEE ICASSP-2013.

P Bell and S Renals (2015). Complementary tasks for context-dependent deep neural network acoustic models

In Proc Interspeech-2015.

S Renals et al (2015).  The MGB Challenge.

Submitted to Proc IEEE ASRU-2015. The MGB Challenge


Multilingual and cross-lingual speech recognition

Supervisor: Steve Renals

We are interested in the development of new approaches to quickly and cheaply speech recognition systems for new languages, which maybe poorly resourced.  We are concerned in particular with cross-lingual techniques which are able to exploit and transfer information across languages. References:

P Swietojanski, A Ghoshal, and S Renals (2012). Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR

In Proc IEEE SLT-2012.

A Ghoshal, P Swietojanski, and S Renals (2013). Multilingual training of deep neural networks.
In Proc IEEE ICASSP-2013.

L Lu, A Ghoshal, and Renals, S. (2014). Cross-lingual subspace Gaussian mixture models for low-resource speech recognition.
IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(1):17–27.

P Bell, J Driesen, and S Renals (2014). Cross-lingual adaptation with multi-task adaptive networks.
In Proc Interspeech-2014.

Audio scene understanding

Supervisor: Steve Renals

The problem of audio scene understanding is to annotate an acoustic scene recorded using one or more microphones.  This involves locating the acoustic sources, identifying them, and extracting semantic information from them.  This is a new area for the group and we are interested in exploring recurrent neural networks and attention-based approaches to this task.

Natural Interactive Virtual Agents

Supervisor: Hiroshi Shimodaira

Development of a lifelike animated character that is capable of establishing natural interaction with humans in terms of non-verbal signals.

Embodied conversational agents (ECAs) aim to foster natural communication between machine and humans. State-of-the-art technology in computer graphics has made it possible to create photo-realistic animation of human faces. However, it is not the case when the interactions between ECA and human are concerned. Interactions of ECA with humans are not as natural as those between humans. Although there are many reasons for this, the present project focuses on the non-verbal aspect of communication such as gestures and gaze, and seeks to develop an ECA system that is capable of recognising user's non-verbal signals and synthesising appropriate signals of the agent.

Gesture Synthesis for Lifelike Conversational Agents

Supervisor: Hiroshi Shimodaira

Development of a mechanism for controlling the gestures of photo-realistic lifelike agents when the agents are in various modes; idling, listening, speaking and singing.

Lifelike conversational agents, behaving like humans with facial animation and gesture, and making speech conversations with humans, are one of the next-generation human-interface. Much effort has been made so far to make the agents natural, especially controlling mouth/lip movement, and eye movement. On the other hand, controlling the non-verbal movements of the head, facial expressions, and shoulders have not been studied that much, even though those motions sometimes plays a crucial role in naturalness and intelligibility. The purpose of the project is to develop a mechanism for creating such motions of photo-realistic lifelike agents when the agents are in various modes; idling, listening, speaking and singing. One of the outstanding features of the project is that it aims to give the agent virtual personality by imitating the manner of movements/gestures of an existing person with the help of machine learning techniques used for text-to-speech synthesis.

Evaluating the impact of expressive speech synthesis on embodied conversational agents

Supervisor: Hiroshi Shimodaira, Matthew Aylett

Evaluation of embodied conversational agents (ECAs) has tended to concentrate on userbility - do users like the system, does the system achieve its objectives. We are not aware of any studies of this type which have controlled for the speech synthesis used in the project where the speech synthesis used was close to the state of the art. This project will develop evaluation based on measured interaction where expressive speech synthesis is the topic of study. It will explore carefully controlled interactive environments, and measure a subjects performance as well as measuring the physiological effects of the experiment on the subject. It will explore how involved (emotionally or otherwise) our subject is with the ECA. The userbility approach described above is also important for these experiments but we also wish to determine how speech is affecting involvement over time. For example, we would expect to increase a subjects arousal by adding emotional elements to the speech, we might expect to destroy a subjects involvement by intentionally producing an error which undermines the ECAs believability.


Action and Decision Making Topics

Adapting Behaviour to the Discovery of Unforeseen Possibilities

Supervisor: Alex Lascarides, Subramanian Ramamoorthy

To design, implement and evaluate a model of agents whose intrinsic preferences change as they learn about unforeseen states and options in decision or game problems that they are engaged in.

Most models of rational action assume that all possible states and actions are pre-defined and that preferences change only when beliefs do. But there are many decision and game problems that lack these features: games where an agent starts playing without knowing the hypothesis space, but rather discovers unforeseen states and options as he plays. For example, an agent may start by preferring fish to meat, but when he discovers saffron for the first time, likes it enormously, but finds it goes with fish better than with meat, his preferences change to preferring fish to meat as long as saffron is available. In effect, an agent may find that the language he can use to describe his decision or game problem changes as he plays it (in this example, state descriptions are refined via the introduction of a new variable, saffron). The aim of this project is to design, implement and evaluate a model of action and decision making that supports reasoning about newly discovered possibilities and options. This involves a symbolic component, which reasons about how a game changes (both beliefs and preferences) as one adds or removes random variables or their range of values, and it involves a probabilistic component, that reasons about how the effects of these changes to the description of the game affect Bayesian calculations of optimal behaviour.

Document Actions