Personal tools
You are here: Home Prospective postgraduates Possible PhD Topics in ILCC

Possible PhD Topics in ILCC

This page lists possible PhD topics suggested by members of staff in ILCC. These topics are meant to give PhD applicants an idea of the scope of the work in the Institute.

As part of a PhD application, applicants have to submit a PhD proposal, which can be based on one of the topics on this page (but the proposal needs to be much more detailed). Of course applicants can also suggest their own topic. In both cases, they should contact the potential supervisor before submitting an application.

Note that there is no specific funding attached to any of the suggested topics; however, the School of Informatics offers a range of scholarships for PhD students.

A number of grant-funded PhD studentships are also available.


Language Processing Topics

Concurrency in (Computational) Linguistics

Improving understanding of synchronic and diachronic aspects of phonology.

Supervisor: Julian Bradfield

In several aspects of linguistic analysis, it is natural to think of some form of concurrent processing, not least because the brain is a massively concurrent system. This is particularly true in phonology and phonetics, and descriptions such as feature analyses, and especially autosegmental phonology, go some way to recognizing this. Although there has been some work on rigorous formal models of such descriptions, there has been little if any application of the extensive body of research in theoretical computer science on concurrent processes. Such a project has the potential to give better linguistic understanding of synchronic and diachronic aspects of phonology and perhaps syntax, and even to improve speech generation and recognition, by adding formal underpinning and improvement to the existing agent-based approaches.

Spectral Learning for Natural Language Processing

Supervisors: Shay Cohen, Mirella Lapata

Latent variable modeling is a common technique for improving the expressive power of natural language processing models. The values of these latent variables are missing in the data, but we are still required to predict these values and estimate the model parameters while assuming these variables exist in the model. This project seeks to improve the expressive power of NLP models at various levels (such as morphological, syntactic and semantic) using latent variable modeling, and also to identify key techniques, based on spectral algorithms, in order to learn these models. The family of latent-variable spectral learning algorithms is a recent exciting development that was seeded in the machine learning community. It presents a principled, well-motivated approach for estimating the parameters of latent variable models using tools from linear algebra.

Analyzing lexical networks

Supervisor: Sharon Goldwater

Psycholinguists often model the phonological structure of the mental lexicon as a network or graph, with nodes representing words and edges connecting words that differ by a single phoneme. Statistics gathered from this graph, such as the number of neighbors a word has, have been shown to predict behavioral measures, such as the word's recognition time. Researchers have also suggested that lexical networks across different languages share certain important properties, such as a short average path length between any two nodes. However, many important questions remain regarding the similarities and differences between lexical networks across languages, and even between the networks of children and adults in the same language community. Answering these questions could help us understand the language-universal cognitive and linguistic pressures that determine the structure of lexicons and the path of word learning in children. PhD projects in this general area will require developing new methodologies for comparing and analyzing graph structures and applying these methodologies to lexical networks from different languages. We aim to relate properties of the lexical networks to aspects of either adult language processing or child language acquisition. Possible projects could (but need not) include behavioural experiments.

Joint learning of morphology and syntax

Supervisor: Sharon Goldwater

Develop a system for joint unsupervised learning of morphology and syntax.

Morphological analysis (segmentation) is an important part of NLP and speech technology in many morphologically rich languages. Interestingly, unsupervised morphological analysis systems sometimes yield better results for downstream applications than do supervised systems.  However there is still room for improvement.  This project aims to improve unsupervised morphological analysis through incorporating some form of syntactic information (e.g., POS tags or dependencies) into a joint learning model.  Evaluation will be both against a gold standard but importantly also through incorporating the morphological analysis into a downstream application such as machine translation or speech recognition.  One goal is to determine what properties of a morphological segmentation are useful for that application.

Uniform Information Density and multimodal interaction

Supervisor: Sharon Goldwater

Explore whether and how the Uniform Information Density hypothesis applies to gestural aspects of communication.

The Uniform Information Density  (UID) hypothesis states that speakers attempt to communicate roughly equivalent amounts of information per unit time.  UID predicts, for example, that if a word is more predictable based on context then the speaker will pronounce the word less clearly or with shorter duration, as this loss in information balances out the gain due to context.  This and other predictions of UID have been shown to hold for speech, but in face-to-face communication, multiple channels are involved.  In particular, information is communicated through gesture as well as speech, so the predictions of UID should carry over into the domain of gesture.  This project aims to clarify and test these predictions.  Key challenges will be in identifying and/or collecting an appropriate corpus, and in defining a measure of reduction for gesture (analogous to reduction in word pronunciation).

An Integrated Model of Human Syntactic and Semantic Processing

Supervisor: Frank Keller

Develop a model of human language processing that combines incremental syntactic parsing with LSA-based semantic construction.

Over the past years, successful broad-coverage models of human syntactic processing have been proposed, e.g., Hale's Surprisal model. At the same time, human semantic intuitions have been captured successfully using semantic space models such as Latent Semantic Analysis. These two lines of work have been pursued largely independently, even though experimental results clearly show that humans combine syntactic and semantic information during sentence comprehension. The aim of this project is to build an integrated model of syntactic and semantic processing that combines Surprisal-based incremental parsing with LSA-based semantic construction. The key challenge is to integrate semantic knowledge into the parser, while also enabling LSA to compute sentence-level, rather than word-level semantic representations. The resulting model can be evaluated against a wealth of data, including eye-tracking corpora, garden-path experiments, and semantic priming data.

Modeling Syntactic Priming with Synchronous Grammars

Supervisor: Frank Keller

Develop a model of how two languages prime each other in bilingual speakers based on synchronous grammars trained on a bilingual corpus.

Syntactic priming refers to speakers' re-use of syntactic structures, i.e., a given syntactic rule tends to occur more often if the same rule has been used recently. This phenomenon is supported by a wealth of experimental, corpus, and modeling evidence. Recently, cross-linguistic priming effects have been attested: bilingual speakers that produce a given structure in one language tend to re-use it even when they switch to the other language.

The aim of this project is to develop a computational model of crosslinguistic priming. This will involve discovering which structures can prime each other by learning a syntactic alignment between two languages from a bilingual corpus. This will result in a synchronous grammar of the two languages, which can form the basis of a model of bilingual language production. The overall model also needs to incorporate an account of code switching, i.e., of the process of switching from one language to the other.

Exploiting Gaze Information for Natural Language Processing

Supervisor: Frank Keller

Explore whether gaze information can be utilized to improve tasks such as co-reference resolution, dialog act recognition, or summarization.

Previous work has shown that gaze (information about where people look) can be useful for NLP tasks, e.g., speech recognition or co-reference resolution. However, these results were obtained in restricted, artificial domains, and it is not clear whether they generalize to naturalistic dialogue. The aim of this project is exploit the gaze annotation in the AMI meeting corpus for NLP tasks such as co-reference resolution, dialog act recognition, or summarization. This may involve an elaboration of the existing coarse gaze annotation, and the utilization of other non-linguistic information in the AMI corpus, e.g., hand gestures and head movements. This work could be complemented by eye-tracking experiments conducted to collect gaze data specifically for NLP tasks.

Exploiting Eye-tracking and Linguistic Data for Training Visual Object Detectors

Supervisors: Frank Keller, Vittorio Ferrari

Develop algorithms that can use eye-tracking data for images and associated text as training data for object detection.

Studies of human visual processing show that humans tend to fixate objects (rather than non-object regions) when viewing an image. Human fixation data can be acquired using an eye-tracker (a device that records the x/y-coordinates of fixations on the screen). Given the object-based nature of human image viewing, it should be possible to use eye-tracking data as a supervisory signal for training automatic object detection algorithms.  Instead, current methods typically require large amounts of annotated training data, i.e., images where the labels and boundaries of the objects have been manually marked by humans, an expensive and cumbersome process.

The first aim of this project is to investigate whether eye-tracking data can be used to infer object boundaries, bypassing the need for manual annotation. Secondly, it may also be possible to infer object labels based on eye-tracking data of images with associated text, as humans are likely to fixate the relevant object when reading text describing it. The project will involve collecting relevant eye-tracking data, as well as developing schemes to make use of this data for weakly supervised learning of object detectors.

Gradability and Granularity

Supervisor: Ewan Klein

Develop a semantics for gradable predicates which systematically takes into account levels of granularity.

When we use propositions involving spatial proximity, we have to choose an appropriate level of granularity. For example, we evaluate the truth of "X is near to Edinburgh" at a granularity of miles, but "X is near my foot" at a granularity of inches. What kind of knowledge and reasoning does an agent require in order to choose the level of granularity when using gradable predicates such as "near" in different contexts?

Although there are formal theories of granularity, developed in various approaches to representing and reasoning with spatial and temporal categories, it is still an open question how these should be related to natural language semantics. From a cognitive perspective, it has been suggested that granularity is related to the 'approximate number system' (present in both human infants and animals) and operates logarithmically in accordance with Weber's law.

Maintaining Negative Polarity in Statistical Machine Translation

Supervisor: Bonnie Webber

Negative assertions, negative commands, and many negative questions all convey the opposite of their corresponding positives. Statistical machine translation (SMT), for all its other success, cannot be trusted to get this right: It may incorrectly render a negative clause in the source language as positive in the target; render negation of an embedded source clause as negation of the target matrix clause; render negative quantification in the source text as positive quantification in the target; and negative quantification in the source text as verbal negation in English, thereby significantly changing the meaning conveyed.

The goal of this research is a robust, language-independent method for improving accuracy in the translation of negative sentences in SMT. To assess negation-specific improvements, a bespoke evaluation metric must also be developed, to complement the standard SMT BLEU score.

Using discourse relations to inform sentence-level Statistical MT

Supervisor: Bonnie Webber

Gold standard annotation of the Penn Discourse TreeBank has enabled the development of methods for disambiguating the intended sense of an ambiguous discourse connective such as since or while, as well as for suggesting discourse relation(s) likely to hold between adjacent sentences that are not marked with a discourse connective.

Since a discourse connective and its two arguments can be viewed as in terms of constraints that hold between pairwise between the connective and each argument, or between the arguments, we should be able to use these constraints in Statistical MT, either in decoding or in re-ranking, preferring translations that are compatible with the constraints.  One might start this work either by looking at rather abstract, high-frequency discourse relations such as contrast, which have rather weak pair-wise constraints, or by looking at rather specific, low-frequency relations such as chosen alternative, which have very strong constraints between the arguments. 

Entity-coherence and Statistical MT

Supervisor: Bonnie Webber 

Entity-based coherence has been used in both Natural Language Generation (Barzilay and Lapata, 2008) : Elsner and Charniak, 2011) and essay scoring (Miltsakaki and Kukich, 2004), reflecting the observation that texts that display the entity-coherence patterns of well-written texts from the given genre are seen as being better written than texts that don't display these patterns. Recently, Guinaudeau and Strube (2013) have shown that matrix operations can be used to compute entity-based coherence very efficiently.

This project aims to apply the same insights to Statistical MT, and assess whether sentence-level SMT can be improved by promoting translations that better match natural patterns of entity-based coherence, or by getting better translations in the first place.

Improving the translation of Modality in SMT

Supervisor: Bonnie Webber   

Modal utterances are common in both argumentative (rhetorical) text and instructions. This project considers the translation of such texts (for example, TED talks as representative of argumentative text) and whether the translation of modality can be improved by considered discourse-level features of such texts. For example, there may be useful constraints between adjacent sentences or clauses, such that the appearance of a modal marker in one increases/decreases the likelihood of some kind of modal marker in the other.

Automatically Generating Comments for Source Code

Supervisors: Mirella Lapata, Charles Sutton, Shay Cohen

A number of studies have shown that good comments can help programmers quickly understand what a piece of source codes, aiding program comprehension and software maintenance. Unfortunately, few software projects adequately comment the code. One way to overcome the lack of human-written summary comments, and guard against obsolete comments, is to automatically generate them directly from the source code.  In addition to providing comments for existing code, generated comments encourage developers to comment new written code.

In this project our aim is to automatically generate descriptive summary comments for source code, whilst modeling the process after natural language generation. Specifically, we will break down the problem into content selection and text generation. Content selection inovlves choosing the most important code statements to be included in the summary comment. For a selected statement, text generation determines how to express the content in natural language sentences which are grammatical, coherent and non-repetitive. In developing a suitable code-to-summary model, We will borrow insights from syntax-inspired statistical machine translation as well as data-driven text-to-text generation.

Generation of Sentences and Images using Visual Abstraction

Supervisors: Mirella Lapata, Larry Zitnick

Learning the relation of language to its visual incarnation remains a challenging and fundamental problem in computer vision. Both text and image corpora offer substantial amounts of information about our physical world. Relating the information between these domains may improve existing computer vision and natural language processing (NLP) applications and lead to new applications.

In this project we will study the problem of generating descriptions for images and novel scenes for sentence descriptions. Specifically, we will use abstract scenes over real images which provides us with two main advantages. First, the difficulites in automatically detecting or hand-labeling relevant information in real images can be avoided. By construction, we know the visual arrangement and attributes of the objects in the scene and thus can focus on the core problem of scene understanding. Secondly, one can explore subtle nuances in the interplay between visual meaning and its verbalization, since it is possible to generate different, yet semantically similar, scenes and descriptions.  Our key idea is to explore synchronous context free grammars and related formalisms for both generating descriptions for images and synthesizing images for linguistic descriptions.

Improving Internet Access for Low-literacy Users with Automatic Text Simplification

Supervisor: Mirella Lapata

The ability to simplify a wide variety of documents, irrespectively of size, content, or style carries much practical import for a wide range of users. For example, it would render the internet more accessible to a broader audience as lower-literacy readers face severe challenges in reading long and dense documents, and navigating to desired information on individual sites. We propose to formalize simplification as a synchronous grammar learning problem and argue that Wikipedia constitutes a valuable resource for obtaining such a grammar. We will develop a modeling framework that is applicable to languages other than English and will evaluate it in a realistic web page simplification setting.

Modelling Non-Cooperative Conversation

Supervisor: Alex Lascarides

Develop and implement a model of conversation that can handle cases where the agents' goals conflict.

Work on adversarial strategies from game theory and signalling theory lack sophisticated models of linguistic meaning. Conversely, current models of natural language discourse typically lack models of human action and decision making that deal with situations where the agents' goals conflict.  The aim of this project is to fill this gap and in doing so provide a model of implicature in non-cooperative contexts.

This project involves analysing a corpus of human dialogues of users playing the game Settlers of Catan: a well-known adversarial negotiating game.  This will be used to leverage extensions to an existing state of the art dynamic semantic model of dialogue content with a logically precise model of the agents' mental states and strategies.  The project will also involve implementing these ideas into a working dialogue system that extends an existing open source agent that plays Settlers, but that has no linguistic capabilities.

Interpreting Hand Gestures in Face to Face Conversation

Supervisor: Alex Lascarides

Map hand shapes and movements into a representation of their form and meaning.

The technology for mapping an acoustic signal into a sequence of words and for estimating the position of pitch accents is very well established. But estimating which hand movements are communicative and which aren't, estimating which part of a communicative hand movement is the stroke or post-stroke hold (i.e., those part of the move that conveys meaning) is much less well understood. Furthermore, to build a semantic representation of the multimodal action, one must, for depicting gestures at least (that is, gestures whose form resembles their meaning) capture qualitative properties of its shape, position and movement (e.g., that the trajectory of the hand was a circle, or a straight line moving vertically upwards).  On the other hand, deictic gestures must be represented using quantitative values in 4D Euclidean space. Mapping hand movement to these symbolic and quantitative representations of form is also an unsolved problem.

The aim of this project is to create and exploit a corpus to learn mappings from the communicative multimodal signals to the representation of their form, as required by an existing online grammar of multimodal action, which in turn is designed to yield (underspecified) representations of the meaning of the multimodal action.  We plan to use state of the art models of visual processing using kinect cameras to estimate hand positions and hand shapes, and design Hidden Markov Models that exploit the visual signal, language models and gesture models to estimate the qualitative (and quantitative) properties of the gesture.

Incremental Interpretation for robust NLP using CCG and Dependency Parsing

Supervisor: Mark Steedman

Combinatory Categorial Grammar (CCG) is a computational grammar formalism that has recently been used widely in NLP applications including wide-coverage parsing, generation, and semantic parser induction.  The present project seeks to apply insights from these and other sources including dependency parsing to the problem of incremental word-by-word parsing and interpretation using statistical models.  Possible evaluation tasks include language modeling for automatic speech recognition, as well as standard parsing benchmarks.

Statistical NLP for Programming Languages

Supervisor: Charles Sutton

Find syntactic patterns in corpora of programming language text.

The goal of this project is to apply the advanced statistical techniques from natural language processing to a completely different and new textual domain: programming language text.  Think about how you program when you are using a new library or new environment for the first time. You "program by search engine", i.e., you search for examples of people who have used the same library, and you copy chunks of code from them. I want to systemize this process, and apply it at a large scale.  We have collected a corpus of 1.5 billion lines of source code from 8000 software projects, and we want to find syntactic patterns that recur across projects. These can then be presented to a programmer as she is writing code, providing an autocomplete functionality that can suggest entire function bodies.  Statistical techniques involved include language modeling, data mining, and Bayesian nonparametrics.  This also raises some deep and interesting questions in software engineering: i.e., Why do syntactic patterns occur in professionally written software when they could be refactored away?

Discovering hidden causes

Supervisor: Chris Lucas

In order to explain, predict, and influence events, human leaners must infer the presence of causes that cannot be directly observed. For example, we understand the behavior of other people by appealing to mental states, we can infer that a new disease is spreading when we see several individuals with novel symptoms, and we can speculate about why a device or computer program is prone to crashing. The aim of this project is to better understand how humans search for and discover hidden causes, using Bayesian models and behavioural experiments.

Approximate inference in human causal learning

Supervisor: Chris Lucas
A fundamental problem in casual learning is understanding what relationships hold among a large set of variables, and in general this problem is intractable. Nonetheless, humans are able to learn efficiently about the causal structure of the world around them, often making the same inferences that one would expect of an ideal or rational learner. How we achieve this performance is not yet well understood -- we rely on approximate inferences that deviate in certain systematic ways from what an ideal observer would do, but those deviations are still being catalogued and there are few detailed hypotheses about the underlying processes. This project is concerned with exploring these processes and developing models that reproduce human performance -- including error patterns -- in complex causal learning problems, with the aim of understanding and correcting for human errors and building systems that are of practical use

Statistical semantic models of translation

Supervisor: Adam Lopez
Statistical machine translation has been enormously successful over the last two decades, resulting in what is today a thriving industry highlighted by offerings such as Google Translate. Yet translation systems still often fail to preserve the semantics of sentences -- the "who did what to whom" relationships that they express -- because they model translation as simple substitution and permutation of words, or at best as the reordering of syntactic units, such as nouns and adjectives. To preserve semantics, they must model semantics. Computational linguists have developed formal, expressive mathematical models of language that exhibit high empirical coverage of semantically annotated linguistic data, correctly predict a variety of important linguistic phenomena in many languages, and can be processed with highly efficient algorithms. However, these models are untested as the basis of statistical translation models. The goal of projects in this area is to develop the mathematics of translation models based on these formalisms and to apply them to real translation tasks

Graph automata for statistical semantics

Supervisor: Adam Lopez
We now have large quantities of sentences annotated with their semantics, in the form of directed graphs expressing "who did what to whom" relationships. This creates the possibility to learn large-scale statistical models of semantic parsing or text generation. From a machine learning perspective, this is a structured prediction problem: the input and outputs are strings, trees, or graphs. The formal machinery of structured prediction models  for strings and trees is often based on weighted automata and transducers: extensions of classical automata and transducers with real-valued weights representing probabilities. To extend these models to graph-based representations, we require automata and transducers for graphs, but these types of automata are much less well-developed than automata on strings and trees. The goal of projects in this area is to extend the mathematics of weighted graph automata and transducers and to apply them to real data. 

Massive parallel algorithms for language

Supervisor: Adam Lopez
Many classical algorithms and models in NLP assume sequential computation, an assumption broken by the trend of placing many cores on the same die. There are now multiple solutions to utilizing the transistor budget: a small number of complex cores, or a large number of simple cores. The latter approach, exemplified by graphics processing units (GPUs) with thousands of parallel threads of execution, is especially tantalizing for computation-bound NLP problems. But harnessing the power of such processors requires a complete rethinking of the fundamental algorithms in the field. Projects in this area is focus on the design of new algorithms and data structures for difficult problems in NLP -- such as machine translation and syntactic parsing --with the goal of achieving vast improvements in speed, and leading to new applications.


Low-resource machine translation

Supervisor: Adam Lopez
Statistical machine translation has been enormously successful over the last two decades, resulting in what is today a thriving industry highlighted by offerings such as Google Translate. The most effective systems are based on statistical models learned from large numbers of translation examples, a classic application of machine learning on input/ output pairs. But for many language pairs and domains we have few examples. Even if we restrict ourselves to markets with many potential users by focusing only on languages with tens of millions of speakers, there are thousands of possible language pairs. At best, we have substantial quantities of data in a few hundred of these. In most of those cases, the data is government text. For the vast majority of languages and domains, there is hardly anything. But in many cases, there is side information that we can exploit: dictionaries or other knowledge sources, or text paired with weak signals, such as images, speech, or timestamps. How can we exploit such heterogeneous information in statistical translation models? The goal of this project is to develop statistical models and inference techniques that exploit such data, and apply them to real problems in translation.


Speech Processing Topics

Topics in unsupervised speech processing and/or modelling infant speech perception

Supervisor: Sharon Goldwater

Work in unsupervised (or 'zero-resource') speech processing (see Aren Jansen et al., ICASSP 2013) has begun to investigate methods for extracting repeated units (phones, words) from raw acoustic data as a possible way to index and summarize speech files without the need for transcription.  This could be especially useful in languages where there is little data to develop supervised speech recognition systems.  In addition, it raises the possibility of whether similar methods could be used to model the way that human infants begin to identify words in the speech stream of their native language.  Unsupervised speech processing is a growing area of research with many interesting open questions, so a number of projects are possible.  Projects could focus mainly on ASR technology or mainly on modeling language acquisition; specific research questions will depend on this choice.  here are just two possibilities: (1) unsupervised learners are more sensitive to input representation than are supervised learners, and preliminary work suggests that MFCCs are not necessarily the best option.  Investigate how to learn better input representations (e.g., using neural networks) that are robust to speaker differences but encode linguistically meaningful differences.  (2) existing work in both speech processing and cognitive modeling suggests that trying to learn either words or phones alone may be too difficult and in fact we need to develop *joint learners* that simultaneously learn at both levels.  Investigate models that can be used to do this and evaluate how joint learning can improve performance.

Deep Neural Network-based speech synthesis

Supervisor: Simon King, Steve Renals

DNNs may offer new ways to model the complex relationship between text and speech, compared to HMMs. In particular, they may enable more effective use of the highly-factorial nature of the linguistic representation derived from text, which in the HMM approach is a 'flat' sequence of phonetically- and prosodically-context-dependent labels.

We are interested in any topic concerning DNN-based speech synthesis, for example: novel text processing methods to extract new forms of linguistic representation, how to represent linguistic information at the input to the DNN, how to represent the speech signal (or vocoder parameters) at the output of the DNN, methods for control over the speaker/gender/accent/style of the output speech, combinations of supervised, semi-supervised and un-supervised learning.

Hidden Markov model-based speech synthesis

Supervisor: Simon King, Junichi Yamagishi

The HMM, which is a statistical model that can be used to both classify and generate speech, offers an exciting alternative to concatenative methods for synthesising speech. As a consequence, most research effort around the world in speech synthesis is now focussed on HMMs because of the flexibility that they offer.

There are a number of topics we are interested in within HMM-based speech synthesis, including: speaker, language and accent adaptation; cross-language speaker adaptation; improving the signal processing and vocoding aspects of the model; unsupervised and semi-supervised learning.

Personification using affective speech synthesis

Supervisors: Simon King, Matthew Aylett

New approaches to capture, share and manipulate information in sectors such as health care and the creative industries require computers to enter the arena of human social interaction. Users readily adopt a social view of computers and previous research has shown how this can be harnessed in applications such as giving health advice, tutoring, or helping children overcome bullying.

However, whilst current speech synthesis technology is highly intelligible, it has not been able to deliver voices which aid this 'personification'. The lack of naturalness makes some synthetic voices sound robotic, while the lack of expressiveness makes others sound dull and lifeless.

In many of the above applications, it is less important to be able to render arbitrary text, than it is to convey a sense of personality within a more limited domain. So, this project would investigate two key problems (1) Merging expressive pre-recorded prompts with expressive unit selection speech synthesis. (2) Dynamically altering voicing in speech to convey underlying levels of stress and excitement using source filter decomposition techniques.

Cross-lingual acoustic models

Supervisors: Simon King, Steve Renals, Junichi Yamagishi

Adapting speech recognition acoustic models from one language to another, with a focus on limited resources and unsupervised training.

Current speech technology is based on machine learning and trainable statistical models.  These approaches are very powerful, but before a system can be developed for a new language considerable resources are required:  transcribed speech recordings for acoustic model training; large amounts of text for language model training; and a pronunciation dictionary.  Such resources are available for languages such as English, French, Arabic, and Chinese, but there are many less well-resourced languages.  There is thus a need for models that can be adapted to from one language to another with limited effort and resources.  To address this we are interested in two (complementary) approaches.  First, the development of lightly supervised and unsupervised training algorithms:  speech recordings are much easier to obtain than transcriptions.  Second, the development of models which can factor language-dependent and language-independent aspects of the speech signal, perhaps exploiting invariances derived from speech production.  We have a particular interest in approaches (1) building on the subspace GMM framework, or (2) using deep neural networks.

Factorised acoustic models

Supervisors: Simon King, Steve Renals, Junichi Yamagishi

Acoustic models which factor specific causes of variability, thus allowing more powerful adaptation for speech recognition, and greater control for speech synthesis.

Adaptation algorithms, such as MLLR, MAP, and VTLN, have been highly effective in acoustic modelling for speech recognition.  However, current approaches only weakly factor the underlying information - for instance "speaker" adaptation will typically adapt for the acoustic environment and the task, as well as for different aspects of the speaker.  It is of great interest to investigate speech recognition models which are able to factor the different sources of variability.  PhD projects in this area will explore the development of factored models that enable specific aspects of a system to be adapted.  For example, it is of great interest - for both speech recognition and speech synthesis - to be able to model accent in a specific way.  We are interested in two modelling approaches which hold great promise for this challenge: subspace Gaussian mixture models, and deep neural networks.

Hidden Speech Production Models

Supervisor: Steve Renals

Building speech analysis and recognition models that respect the constraints of speech production.

This project is concerned with building speech analysis and recognition models that respect the constraints of speech production. Although speech production data may be available when training data is collected (if a specialised recording facility is used) it is not available in the general case. The aim of this project is to use observed articulatory data to construct a hidden space. When unseen acoustic data is presented this hidden space is inferred from the data, and can be used as a constraint for speech recognition. The advantage is that inference of the hidden space of articulation must respect the constraints learned from observed articulatory data.

Language Models for Multiparty Conversations

Supervisor: Steve Renals

Develop statistical models that integrate history information from multiparty conversations whilst remaining computationally feasible.

Large vocabulary speech recognition is largely based on language models in which the probability of the current word is estimated conditional on the previous words spoken (the history). However in multiparty conversations, there is not a single, linear stream of words, and it is not always obvious what the closest history is. This project aims to develop statistical models that integrate history information from both the talker in question, as well as other talkers, while remaining computationally feasible.  There are two approaches in which we are interested in pursuing for this task:  (1)  Language models using distributed representations, such as neural network language models and deep belief network langauge models;  (2) Hierarchical Bayesian models, building on non-parametric approaches such as the Hierarchical Dirichlet Process Language Model and the Hierarchical Pitman-Yor Language Model. Develop statistical models that integrate history information from multiparty conversations whilst remaining computationally feasible.

Microphone-Array Based Speech Recognition

Supervisor: Steve Renals

Investigate the use of microphone arrays attached to walls or tables, rather than users wearing microphones, for speech recognition to operate.

Rather than requiring users to wear microphones for speech recognition to operate, we are interested in using microphone arrays attached to walls or tables. Microphone array beamforming for localization and enhancement is a well studied, using microphone arrays for speech recognition, less so. We are interested in approaches in which the parameters used to combine the signals from the different microphones are treated as parameters of the speech recognition acoustic model, and optimized along with the rest of the system. Interesting research issues arise when talkers are moving, and when multiple talkers overlap.

Distributed linguistic representations for spoken language processing

Supervisor: Steve Renals

Development of language models utilising distributed representations for words.

Until recently, the dominant language models for speech recognition were based strongly on n-grams, in which probability models are built over a vocabulary of words, resulting in very high dimensions (frequently 1 million or more).  In recent years there has been growing interest in models which use distributed representations of words, for example latent semantic analysis language models and neural network language models.  The latter, in particular, have proven to be very attractive.  In this project we plan to explore models in which the distributed representation is automatically learned, enabling words to embedded in what may be considered a semantic space.  We are particularly in investigating approaches based on deep neural networks, hierarchical Bayes, and on ideas from factorised language model approaches such as Model M.  We are interested in applying such language models to speech recognition and other tasks such as topic identification and summarisation.

Detecting and linking events in multiparty conversations

Supervisor:  Steve Renals

In the AMI project we developed an approach called content linking, which uses realtime conversational speech recognition to automatically transcribe a meeting in progress, then constructs search queries from the recently detected words in order to retrieve relevant multimodal and text documents. Such a "search-without-query" approach may be viewed as a way to automatically provide context (in the form of relevant documents and media files) to an ongoing multiparty conversation, without requiring explicit search.  This baseline application opens a number of research challenges such as improving the core models and algorithms to link content, approaches to find the most important "events" in a conversation, to which content should be linked, and ways of automatically constructing relations to link conversational events.

Enhancing spoken language processing with social dynamics

Supervisor: Steve Renals

Identifying social signals in multiparty spoken interaction to enable better recognition and interpretation of what is being communicated.

Spoken conversations are not just streams of words - there are numerous social cues relating to things such as agreement/disagreement, positivity/negativity, and social roles adopted by participants in a conversation, as well as the interaction patterns between people in a multiparty conversation.  There are a number of research challenges ranging from the extraction and identification of social cues from audio or multimodal recordings of interactions, and their use as additional variables in spoken language processing models.  For example, is it possible to use social features of a conversation to enhance language modelling or topic identification?  Can social cues be used as a kind of prior for tasks like speaker diarization?  How can social signals be used to structure in conversation in terms of things like decisions that have been made?   This work would initially focus on the AMI corpus, a 100 hour corpus of multimodal meeting recorders, with many layers of annotation.

Implicit spoken dialogue systems

Supervisor:  Steve Renals

Development of dialogue systems technology that does not require 100% attention from users and learns when to intervene in a conversation.

Current dialogue systems do not interact very naturally - for example, they often demand 100% of the users' attention and attempt to interpret and respond to every utterance.  In the long term we would like to develop spoken language systems which learn how to interact in multiparty conversations more naturally.  This project concerns monitoring the social signals of the speakers, and the social context of the conversation, as well as the content of the conversation, to decide when to interact and what to say.  Such a system could be considered as an "implicit" question answering systems - answering queries which come up in a conversation, without being explicitly posed.

Natural Interactive Virtual Agents

Supervisor: Hiroshi Shimodaira

Development of a lifelike animated character that is capable of establishing natural interaction with humans in terms of non-verbal signals.

Embodied conversational agents (ECAs) aim to foster natural communication between machine and humans. State-of-the-art technology in computer graphics has made it possible to create photo-realistic animation of human faces. However, it is not the case when the interactions between ECA and human are concerned. Interactions of ECA with humans are not as natural as those between humans. Although there are many reasons for this, the present project focuses on the non-verbal aspect of communication such as gestures and gaze, and seeks to develop an ECA system that is capable of recognising user's non-verbal signals and synthesising appropriate signals of the agent.

Gesture Synthesis for Lifelike Conversational Agents

Supervisor: Hiroshi Shimodaira

Development of a mechanism for controlling the gestures of photo-realistic lifelike agents when the agents are in various modes; idling, listening, speaking and singing.

Lifelike conversational agents, behaving like humans with facial animation and gesture, and making speech conversations with humans, are one of the next-generation human-interface. Much effort has been made so far to make the agents natural, especially controlling mouth/lip movement, and eye movement. On the other hand, controlling the non-verbal movements of the head, facial expressions, and shoulders have not been studied that much, even though those motions sometimes plays a crucial role in naturalness and intelligibility. The purpose of the project is to develop a mechanism for creating such motions of photo-realistic lifelike agents when the agents are in various modes; idling, listening, speaking and singing. One of the outstanding features of the project is that it aims to give the agent virtual personality by imitating the manner of movements/gestures of an existing person with the help of machine learning techniques used for text-to-speech synthesis.

Evaluating the impact of expressive speech synthesis on embodied conversational agents

Supervisor: Hiroshi Shimodaira, Matthew Aylett

Evaluation of embodied conversational agents (ECAs) has tended to concentrate on userbility - do users like the system, does the system achieve its objectives. We are not aware of any studies of this type which have controlled for the speech synthesis used in the project where the speech synthesis used was close to the state of the art. This project will develop evaluation based on measured interaction where expressive speech synthesis is the topic of study. It will explore carefully controlled interactive environments, and measure a subjects performance as well as measuring the physiological effects of the experiment on the subject. It will explore how involved (emotionally or otherwise) our subject is with the ECA. The userbility approach described above is also important for these experiments but we also wish to determine how speech is affecting involvement over time. For example, we would expect to increase a subjects arousal by adding emotional elements to the speech, we might expect to destroy a subjects involvement by intentionally producing an error which undermines the ECAs believability.


Action and Decision Making Topics

Preference Change

Supervisor: Alex Lascarides, Subramanian Ramamoorthy

To design, implement and evaluate a model of agents whose intrinsic preferences change as they learn about unforeseen states and options in decision or game problems that they are engaged in.

Most models of rational action assume that all possible states and actions are pre-defined and that preferences change only when beliefs do. But there are many decision and game problems that lack these features: games where an agent starts playing without knowing the hypothesis space, but rather discovers unforeseen states and options as he plays. For example, an agent may start by preferring fish to meat, but when he discovers saffron for the first time, likes it enormously, but finds it goes with fish better than with meat, his preferences change to preferring fish to meat as long as saffron is available. In effect, an agent may find that the language he can use to describe his decision or game problem changes as he plays it (in this example, state descriptions are refined via the introduction of a new variable, saffron). The aim of this project is to design, implement and evaluate a model of action and decision making that supports reasoning about newly discovered possibilities and options. This involves a symbolic component, which reasons about how a game changes (both beliefs and preferences) as one adds or removes random variables or their range of values, and it involves a probabilistic component, that reasons about how the effects of these changes to the description of the game affect Bayesian calculations of optimal behaviour.

Document Actions