Personal tools
You are here: Home Prospective postgraduates Possible PhD Topics in ILCC

Possible PhD Topics in ILCC

This page lists possible PhD topics suggested by members of staff in ILCC. These topics are meant to give PhD applicants an idea of the scope of the work in the Institute.

As part of a PhD application, applicants have to submit a PhD proposal, which can be based on one of the topics on this page (but the proposal needs to be much more detailed). Of course applicants can also suggest their own topic. In both cases, they should contact the potential supervisor before submitting an application.

Note that there is no specific funding attached to any of the suggested topics; however, the School of Informatics offers a range of scholarships for PhD students.

A number of grant-funded PhD studentships is also available, including 5 studentships in speech technology.

 

Language Processing Topics

Concurrency in (Computational) Linguistics

Improving understanding of synchronic and diachronic aspects of phonology.

Supervisor: Julian Bradfield

In several aspects of linguistic analysis, it is natural to think of some form of concurrent processing, not least because the brain is a massively concurrent system. This is particularly true in phonology and phonetics, and descriptions such as feature analyses, and especially autosegmental phonology, go some way to recognizing this. Although there has been some work on rigorous formal models of such descriptions, there has been little if any application of the extensive body of research in theoretical computer science on concurrent processes. Such a project has the potential to give better linguistic understanding of synchronic and diachronic aspects of phonology and perhaps syntax, and even to improve speech generation and recognition, by adding formal underpinning and improvement to the existing agent-based approaches.

Cognitive Robotics: Plan Recognition Using Bounded Parallelism

Supervisor: Christopher Geib

Plan recognition is the problem of identifying the plans, goals, and intentions of an agent based on observations of their actions. (Geib 2009) describes an algorithm for plan recognition based on parsing formal plan grammars.  This algorithm performs an exhaustive search of the space of possible parses of a sequence of actions to determine the objectives of the observed agent.  The ubiquitous nature of current multiprocessor machine architectures begs the question, can this and similar algorithms benefit from bounded parallelism in current machine architectures?

Cognitive Robotics: Plan Recognition Using Approximate Search

Supervisor: Christopher Geib

Plan recognition is the problem of identifying the plans, goals, and intentions of an agent based on observations of their actions. (Geib 2009) describes an algorithm for plan recognition based on parsing plan grammars.  This algorithm performs an exhaustive search of the space of possible parses of a sequence of actions to determine the objectives of the observed agent.  However, current state of the art parsers from natural language processing research, use incomplete and heuristic search methods and achieve significant performance gains with relatively small sacrifices in accuracy.  The goal of this project is to study the possibilites for using incomplete search mechanisms for this problem.  Such mechanisms could include, beam search, particle filters, probabilistic A* and other heuristic methods.

Cognitive Robotics: Plan Recognition In Partially Observable Domains

Supervisor: Christopher Geib

The problem of recognizing the goals, plans and intentions of agents in partially observable domains is very difficult.  In general it requires significantly expanding the search space of possible explanations for an observed set of actions.  Given the already large search space of possible hypothesis, this can force plan recognition algorithms into a part of the performance space that is unacceptable for real world applications like computer network security, and assistive systems for the elderly.  This project seeks efficient, possibly approximate algorithms and methods to address this challenge.

Cognitive Robotics: Planning based on Lexicalized Representations

Supervisor: Christopher Geib

There are well known relationships between the expressiveness of various classes of formal grammars and planning representations (e.g. Context Free Grammars and Hierarchical Task Network Planning). The identification of formal grammars with expressiveness between that of context free grammars and context sensitive grammars and with polynomial parsing bounds (Tree Adjoining Grammars, Combinatory Categorial Grammars, and others), begs the question do these representations have a natural analog that would result in efficient and expressive planning algorithms?  This project seeks an answer to this question and to explore the relationship between current state of the art gramatical formalisms used in natural language processing and planning algorithms.

Joint learning of morphology and syntax

Supervisor: Sharon Goldwater

Develop a system for joint unsupervised learning of morphology and syntax.

Morphological analysis (segmentation) is an important part of NLP and speech technology in many morphologically rich languages. Interestingly, unsupervised morphological analysis systems sometimes yield better results for downstream applications than do supervised systems.  However there is still room for improvement.  This project aims to improve unsupervised morphological analysis through incorporating some form of syntactic information (e.g., POS tags or dependencies) into a joint learning model.  Evaluation will be both against a gold standard but importantly also through incorporating the morphological analysis into a downstream application such as machine translation or speech recognition.  One goal is to determine what properties of a morphological segmentation are useful for that application.

Uniform Information Density and multimodal interaction

Supervisor: Sharon Goldwater

Explore whether and how the Uniform Information Density hypothesis applies to gestural aspects of communication.

The Uniform Information Density  (UID) hypothesis states that speakers attempt to communicate roughly equivalent amounts of information per unit time.  UID predicts, for example, that if a word is more predictable based on context then the speaker will pronounce the word less clearly or with shorter duration, as this loss in information balances out the gain due to context.  This and other predictions of UID have been shown to hold for speech, but in face-to-face communication, multiple channels are involved.  In particular, information is communicated through gesture as well as speech, so the predictions of UID should carry over into the domain of gesture.  This project aims to clarify and test these predictions.  Key challenges will be in identifying and/or collecting an appropriate corpus, and in defining a measure of reduction for gesture (analogous to reduction in word pronunciation).

Unsupervised extraction of words and phrases from speech data

Supervisor: Sharon Goldwater

Investigate methods of extracting linguistically meaningful snippets from raw acoustic data, either as a model of early word learning in humans, or as a way to spot keywords in audio without transcription.

Recent work in speech recognition (e.g., Park and Glass, 2006) has begun to investigate methods for extracting repeated words from raw acoustic data as a possible way to index and summarize speech files without the need for transcription.  This could be especially useful in languages where there is little data to develop supervised speech recognition systems.  In addition, it raises the possibility of whether similar methods could be used to model the way that human infants begin to identify words in the speech stream of their native language.  This project takes inspiration from existing work, but aims to develop more robust and/or more more cognitively plausible methods.

Possible areas of investigation include: what kinds of features should be used in the input (MFCC vectors? acoustic-prosodic cues?), and is there some intermediate level of representation (e.g., syllables) that would be helpful?  How scalable are the existing methods, and how do their results differ when run on different kinds of input data (e.g., lectures vs. child-directed speech, English vs. other languages)? The project could focus mainly on ASR technology or mainly on modeling language acquisition; specific research questions will depend on this choice.

Incremental processing and memory effects in Bayesian models of word segmentation

Supervisor: Sharon Goldwater

Investigate different incremental processing algorithms for Bayesian models of infant word segmentation.

Word segmentation (identifying individual words from continuous speech) is one of the earliest language acquisition problems that infants face.  Bayesian models have proven to be successful in modeling certain aspects of the learning task, but current implementations have difficulty handling memory effects and incremental processing behavior.  This project aims to investigate different incremental processing algorithms for Bayesian models of word segmentation, and compare their results against human behavioral data in which memory limitations appear to play a role.  We hope to gain insight into the role of memory and processing limitations in human language acquisition.

Improving multilingual text recognition for NLP

Supervisor: Sharon Goldwater 

Use language modeling techniques to improve the output of text or handwriting recognition systems in less well-resourced languages in order to improve results of downstream NLP systems.

Optical character recognition, handwriting recognition, and document recognition are already quite successful in English and many European languages, but performance in many other languages (e.g., Thai, Bengali, Arabic, Hebrew) is much worse, and can degrade the performance of any automatic processing applied to the output of character recognition (e.g., machine translation, summarization). Most work on improving character recognition focuses on the visual aspect of the task (developing better machine learning methods for identifying individual characters), but there is little work on improving results by using the linguistic context to disambiguate difficult-to-recognize characters and words. In this project, we plan to apply better language modeling techniques for improving character recognition in various languages. We will evaluate results in part by examining the effects on downstream NLP tasks. The focus will be on less well-resourced languages, so will likely involve unsupervised and/or semi-supervised learning.

Modeling Syntactic Priming with Synchronous Grammars

Supervisor: Frank Keller

Develop a model of how two languages prime each other in bilingual speakers based on synchronous grammars trained on a bilingual corpus.

Syntactic priming refers to speakers' re-use of syntactic structures, i.e., a given syntactic rule tends to occur more often if the same rule has been used recently. This phenomenon is supported by a wealth of experimental, corpus, and modeling evidence. Recently, cross-linguistic priming effects have been attested: bilingual speakers that produce a given structure in one language tend to re-use it even when they switch to the other language.

The aim of this project is to develop a computational model of crosslinguistic priming. This will involve discovering which structures can prime each other by learning a syntactic alignment between two languages from a bilingual corpus. This will result in a synchronous grammar of the two languages, which can form the basis of a model of bilingual language production. The overall model also needs to incorporate an account of code switching, i.e., of the process of switching from one language to the other.

Exploiting Gaze Information for Natural Language Processing

Supervisor: Frank Keller

Explore whether gaze information can be utilized to improve tasks such as co-reference resolution, dialog act recognition, or summarization.

Previous work has shown that gaze (information about where people look) can be useful for NLP tasks, e.g., speech recognition or co-reference resolution. However, these results were obtained in restricted, artificial domains, and it is not clear whether they generalize to naturalistic dialogue. The aim of this project is exploit the gaze annotation in the AMI meeting corpus for NLP tasks such as co-reference resolution, dialog act recognition, or summarization. This may involve an elaboration of the existing coarse gaze annotation, and the utilization of other non-linguistic information in the AMI corpus, e.g., hand gestures and head movements. This work could be complemented by eye-tracking experiments conducted to collect gaze data specifically for NLP tasks.

Exploiting Eye-tracking and Linguistic Data for Training Visual Object Detectors

Supervisors: Frank Keller, Vittorio Ferrari

Develop algorithms that can use eye-tracking data for images and associated text as training data for object detection.

Studies of human visual processing show that humans tend to fixate objects (rather than non-object regions) when viewing an image. Human fixation data can be acquired using an eye-tracker (a device that records the x/y-coordinates of fixations on the screen). Given the object-based nature of human image viewing, it should be possible to use eye-tracking data as a supervisory signal for training automatic object detection algorithms.  Instead, current methods typically require large amounts of annotated training data, i.e., images where the labels and boundaries of the objects have been manually marked by humans, an expensive and cumbersome process.

The first aim of this project is to investigate whether eye-tracking data can be used to infer object boundaries, bypassing the need for manual annotation. Secondly, it may also be possible to infer object labels based on eye-tracking data of images with associated text, as humans are likely to fixate the relevant object when reading text describing it. The project will involve collecting relevant eye-tracking data, as well as developing schemes to make use of this data for weakly supervised learning of object detectors.

An Integrated Model of Human Syntactic and Semantic Processing

Supervisors: Frank Keller, Mirella Lapata

Building a computational model of human language processing, that brings together probabilistic accounts of syntax, semantics, and possibly discourse.

Over the past years, successful broad-coverage models of human syntactic processing have been proposed, e.g., Hale's Surprisal model. At the same time, human semantic intuitions have been captured successfully using semantic space models such as Latent Semantic Analysis. These two lines of work have been pursued largely independently, even though experimental results clearly show that humans combine syntactic and semantic information during sentence comprehension. The aim of this project is to build an integrated model of syntactic and semantic processing that combines Surprisal-based incremental parsing with LSA-based semantic construction.

The key challenge is to integrate semantic knowledge into the parser, while also enabling LSA to compute sentence-level, rather than word-level semantic representations. The resulting model can be evaluated against a wealth of data, including eye-tracking corpora, garden-path experiments, and semantic priming data.

Gradability and Granularity

Supervisor: Ewan Klein

Develop a semantics for gradable predicates which systematically takes into account levels of granularity.

When we use propositions involving spatial proximity, we have to choose an appropriate level of granularity. For example, we evaluate the truth of "X is near to Edinburgh" at a granularity of miles, but "X is near my foot" at a granularity of inches. What kind of knowledge and reasoning does an agent require in order to choose the level of granularity when using gradable predicates such as "near" in different contexts?

Although there are formal theories of granularity, developed in various approaches to representing and reasoning with spatial and temporal categories, it is still an open question how these should be related to natural language semantics. From a cognitive perspective, it has been suggested that granularity is related to the 'approximate number system' (present in both human infants and animals) and operates logarithmically in accordance with Weber's law.

Computer-Aided Translation

Supervisor: Philipp Koehn

Machine translation systems are increasingly used by human translators to aid their work. This raises a number of interesting research problems of making better use of statistical machine translation models, such as interactive machine translation, adapting machine translation models to specific domains and users, confidence measures, mining the statistics of the models for novel types of assistance, etc. Work on this topic ranges from developing novel algorithms for technical well-defined problems to user studies. A student is expected to take on a specific topic in this area, depending on her background.

Syntactic and Semantic Machine Translation Models

Supervisor: Philipp Koehn

The mainstream statistical machine translation models make use of only a restricted set of linguistic annotation. To better model the transformation that occur in translation, more advanced models should make use of morphological analysis, syntactic structure (phrase structure grammar and dependency grammar), semantics and discourse. The goal of a PhD project is develop a novel linguistically motivated statistical translation model, train such a model, and implement an efficient search algorithm.

Machine Learning Methods for Machine Translation

Supervisor: Philipp Koehn

Statistical machine translation models have a very large set of parameters that have to be optimised for translation performance for specific language pairs, domains, and usages. Learning these parameters on large data sets efficiently with advanced machine learning methods such as max-margin methods is an open challenge.

Automatic Text Illustration

Supervisor: Mirella Lapata

Automating searching and selection of accompanying illustrations for text using techniques from image processing, information retrieval, and natural language processing.

Stories are often accompanied with pictures. For example, children's books are illustrated with pictures, news articles, magazines, and web sites.  "Story picturing'' is concerned with finding one or more images to complement a document. The task is routinely performed by news writers.

However, choosing a few representative images from a collection of candidate pictures is a challenging task, since it is not possible to search the image database exhaustively. A broad range of computer vision methods have been used to search collections of images based on features computed from the entire image or from image regions without however taking natural language into account.

In this project we aim to automate the story picturing task using techniques from image processing, information retrieval, and natural language processing. Specifically, we will analyze the documents and the images in order to extract common features that will allow us to rank the images according to the document's content.

Unsupervised Semantic Role Induction

Supervisor: Mirella Lapata

Semantic role labeling (SRL) is the task of automatically classifying the arguments of a predicate with roles such as Agent, Patient or Location. These labels capture aspects of the semantics of sentences and can potentially improve applications such as question answering or summarization. Unfortunately, the reliance on manually annotated datasets, which are both difficult and highly expensive to produce, presents a major obstacle to the widespread application of SRL.  In this project we aim to develop unsupervised methods that induce the semantic roles of verbal arguments directly from unannotated text.  We propose to formalize role induction as a graph partitioning problem and to use Open Information Extraction as a testbed for assessing system performance.

Improving Internet Access for Low-literacy Users with Automatic Text Simplification

Supervisor: Mirella Lapata

The ability to simplify a wide variety of documents, irrespectively of size, content, or style carries much practical import for a wide range of users. For example, it would render the internet more accessible to a broader audience as lower-literacy readers face severe challenges in reading long and dense documents, and navigating to desired information on individual sites. We propose to formalize simplification as a synchronous grammar learning problem and argue that Wikipedia constitutes a valuable resource for obtaining such a grammar. We will develop a modeling framework that is applicable to languages other than English and will evaluate it in a realistic web page simplification setting.

Automatic Summarization of Online Product Reviews

Supervisor: Mirella Lapata

Bloggers, professional reviewers, and consumers continuously create opinion-rich web reviews about products and services, with the result that textual reviews are now abundant on the web and often convey a useful overall rating. However, an overall rating cannot express the multiple or conflicting opinions that might be contained in the text and screening the content of a large number of reviews could be a daunting task. For example, a restaurant might receive a great evaluation overall, while the service might be rated below-average due to slow and discourteous wait staff. Pinpointing opinions in documents, and the entities being referenced, would provide a finer-grained sentiment analysis and better summarize users' opinions. In addition, selecting salient sentences from the reviews to textually summarize opinions would add useful details to consumers that are not expressed by numeric ratings.  In this project we aim to create a system that summarizes product reviews using natural language processing techniques.

Modelling Non-Cooperative Conversation

Supervisor: Alex Lascarides

Develop and implement a model of conversation that can handle cases where the agents' goals conflict.

Work on adversarial strategies from game theory and signalling theory lack sophisticed models of linguistic meaning.  Conversely, current models of natural language discourse typically lack models of human action and decision making that deal with situations where the agents' goals conflict.   The aim of this project is to fill this gap and in so doing provide a model of implicature in non-cooperative contexts.

This project involves creating and analysing a corpus of human dialgoues of users playing the game Settlers of Catan: a well-known adversarial negotiating game.  This will be used to leverage extensions to an sisting state of the art dynamic semantic model of dialogue content with a logically precise model of the agents' mental states and strategies.  The project will also involve implementing these ideas into a working dialogue system that extends an existing open source agent that plays Setters, but that has no linguistic capabilities.

Interpreting Hand Gestures in Face to Face Conversation

Supervisor: Alex Lascarides

Map hand shapes and movements into a representation of their form and meaning.

The technology for mapping an acoustic signal into a sequence of words and for estimating the position of pitch accents is very well established. But estimating which hand movements are communicaitve and which aren't, estimating which part of a communicative hand movement is the stroke or post-stroke hold (i.e., those part of the move that conveys meaning) is much less well understood. Furthermore, to build a semantic representation of the multimodal action, one must, for depicting gestures at least (that is, gestures whose form resembles their meaning) capture qualitative properties of its shape, position and movement.  On the other hand, deictic gestures must be represented using quantitative values in 4D Euclidean space.  Mappingl hand movement to these symbolic and quantitative representations of form is also an unsolved problem.

The aim of this project is to create and exploit a corpus to learn mappings from the communicative multimodal signals to the representation of their form, as required by an existing online grammar of multimodal action, which in turn is designed to yield (underspecified) representations of the meaning of the multimodal action. Unlike the corpora developed so far, which require models for interpreting the video to compute symbolic representations of gestural form, the agents in our task will wear gloves that contain state of the art sensors. These sensors yield detailed mathematical information about the orientation and position of the various parts of the hand, and this information can be exploited as informative features for mapping hand movements to gestural form.

Real time News Detection in twitter

Supervisor:  Miles Osborne

Twitter is a great way for people to share information about events happening in the world -- eathquakes, riots, cats in trees, etc etc. Quickly finding interesting events within this stream could save lives, or make us rich perhaps.  Achieving this goal poses hard computational and modelling problems: how can we do this quickly, even though we see more than 750,000 messages a day (most of which are garbage)?   And what happens if a story breaks in Arabic? Or the story is actual false? How can we reduce the latency between first seeing a post and announcing it as an actual story?

Finding Hidden Personal Information in Social Media

Supervisor:  Miles Osborne

People reveal a lot of information about themselves online -- for example where they live, they age, their gender and so on.  This information is often either implicit (people reveal it) or else found in other sources of data (information about gender may be found in blog posts and this information can be linked to corresponding Twitter identities).  This project will look at the general task of automatically constructing profiles for users.

Bayesian Grammar Induction for Statistical Machine Translation

Supervisor:  Miles Osborne

Machine Translation has made great progress over the years (eg Google Translate), but there is still room for improvement.  This project will look at better ways to learn translation models from parallel data.  In particular, Bayesian methods (allowing for interesting priors) will be investigated. Such priors allow us to directly encode information about good translation (for example, that all things being equal, words have few possible translations).  Can we construct priors that are even more informative?

Incremental Interpretation for robust NLP using CCG and Dependency Parsing

Supervisor: Mark Steedman

Combinatory Categorial Grammar (CCG) is a computational grammar formalism that has recently been used widely in NLP applications including wide-coverage parsing, generation, and semantic parser induction.  The present project seeks to apply insights from these and other sources including dependency parsing to the problem of incremental word-by-word parsing and interpretation using statistical models.  Possible evaluation tasks include language modeling for automatic speech recognition, as well as standard parsing benchmarks.

Statistical NLP for Programming Languages

Supervisor: Charles Sutton

Find syntactic patterns in corpora of programming language text.

The goal of this project is to apply the advanced statistical techniques from natural language processing to a completely different and new textual domain: programming language text.  Think about how you program when you are using a new library or new environment for the first time. You "program by search engine", i.e., you search for examples of people who have used the same library, and you copy chunks of code from them. I want to systemize this process, and apply it at a large scale.  We have collected a corpus of 1.5 billion lines of source code from 8000 software projects, and we want to find syntactic patterns that recur across projects. These can then be presented to a programmer as she is writing code, providing an autocomplete functionality that can suggest entire function bodies.  Statistical techniques involved include language modeling, data mining, and Bayesian nonparametrics.  This also raises some deep and interesting questions in software engineering: i.e., Why do syntactic patterns occur in professionally written software when they could be refactored away?

Discourse in Statistical Machine Translation (SMT): Ellipsis

Supervisor: Bonnie Webber

Target-appropriate translation of verb phrase ellipsis, a common syntactic construction in English that does not occur in other languages.

Statistical Machine Translation should aim to translate entire texts, not simply sentences in isolation. To do this, SMT should be able to project the structuring and cohesion- and coherence-making devices in the source language into similarly appropriate devices in the target language. Unfortunately, the properties of these devices are rarely the same in pair of languages -- in particular, the locality of the information needed to make a correct translation decision.

For example, consider the cohesion-making English construction called "verb phrase ellipsis". No one such construction occurs in other languages, and tokens of verb phrase ellipsis are translated into a given target language in different ways in different places. What does this variation arise from? Unlike coreferring pronouns, the form of an ellipsed verb phrase in English is independent of properties of its antecedent such as number and gender. Nevertheless, in some cases, this antecedent may be made expicit in the target language translation, and in other cases, some reduced form may be used. What is not known is whether these forms are in free variation, with all of them correct in all cases, or only a subset may be appropriate translations in a given case. How to deal with this?

One could carry out this project using  either or both (1) an English corpus already annotated for verb phases ellipses (VPEs) by Bos and Spenader, along with its close translation into Czech (the PCEDT 2.0); (2) a parsed version of the English EuroParl corpus (created by Callison-Birch) annotated with VPEs using Bos and Spenader's same techniques, along with the translation of this corpus into another language in the EuroParl corpus.  One would use the English text annotation to identify the range of target language forms into which VPEs have been translated.

One would then take the most common forms and decide whether and/or when they are in free variation. The forms in free variation could be used to create additional valuable reference translations, as Kauchak and Barzilay (2006) did with lexical variants. For forms not in free variation, one could identify and extract features in the source and/or target language that could be used to bias translations to the right form.  Other directions are possible for this project as well, depending on student interest.

Discourse in Statistical Machine Translation (SMT): Coherence relations

Supervisor: Bonnie Webber

Target-appropriate translation of explicitly and implicitly indicated coherence relations, both between and within sentences.

Statistical Machine Translation should aim to translate entire texts, not simply sentences in isolation. To do this, SMT should be able to project the structuring and cohesion- and coherence-making devices in the source language into similarly appropriate devices in the target language. Unfortunately, the properties of these devices are rarely the same in pair of languages -- in particular, the locality of the information needed to make a correct translation decision.

In this regard, explicitly and implicitly indicated coherence relations pose different problems for SMT. Coherence relations flagged by explicit discourse connectives pose a problem because connectives do not have the same ambiguities in all languages. Meyer (2011) is trying to solve this by disambiguating connectives in the source language and then learning a translation model from the disambiguated connectives. The availability of only a single reference translation makes improvements hard to evaluate.

Coherence relations that are not explicitly indicated pose at least two more types of problems for SMT: (1) Alignment problems when a relation indicated explicitly in the source text is rendered implicit in the target, and vice versa; and (2) problems in choosing a correct translation when a relation realised implicitly in the source is rendered explicit in the target.

This project can exploit data on explicit and implicit discourse connectives in the Penn Discourse TreeBank, along with the Prague Czech-English Dependecy TreeBank and the EuroParl corpus (including Callison-Birch's parsed version of the English sub-corpus). 

Speech Processing Topics

Dynamic Bayesian Networks for Speech Recognition

Supervisor: Simon King

Hidden Markov models (HMMs) are the current model of choice for automatic speech recognition (ASR). Although they are seemingly very simple models, in fact they require a complex system of context-dependent models, parameter sharing and adaptation algorithms to achieve the best performance. HMMs are a member of a wider family of models - dynamic Bayesian networks (DBNs). There are an infinite variety of other DBNs waiting to be tried for ASR. DBNS can be formulated to reflect our understanding of the speech signal; one example of this would be multi-streamed DBNs (such as the factorial HMM) in which the factors have explicit linguistic interpretations - the factors might represent aspects of the speech production process. Dependencies can be introduced between these hidden factors, creating ever richer model structures at the cost of increased computational complexity. The goal is to find model structures that improve recognition accuracy whilst remaining computationally feasible. We have already started exploring various forms of DBN, but there is still a lot of scope for exciting and original research in this area. The tools and compute power are now available to work with models that were intractable until recently. It may be that some of the techniques developed for HMMs can be transferred to DBNs, or we could build things like parameter tying, pronunciation variation, language modelling and adaptation into the model structure itself.

Hidden Markov model-based speech synthesis

Supervisor: Simon King

The Trajectory HMM, which is a statistical model that can be used to generate speech, offers an exciting alternative to concatenative methods for synthesising speech.

There are a number of topics we are interested in within HMM-based speech synthesis, including: speaker, language and accent adaptation; cross-language speaker adaptation; improving the signal processing and vocoding aspects of the model; unsupervised and semi-supervised learning.

Personification using affective speech synthesis

Supervisors: Simon King, Matthew Aylett

New approaches to capture, share and manipulate information in sectors such as health care and the creative industries require computers to enter the arena of human social interaction. Users readily adopt a social view of computers and previous research has shown how this can be harnessed in applications such as: health advice, tutoring, or helping children overcome bullying.

However, current speech synthesis technology has not been able to deliver voices which aid personification The lack of naturalness makes some synthetic voices sound robotic, while the lack of expressiveness makes others sound dull and lifeless.

Rather than aim at 100% dynamic content at the expense of making the voice have less character, we would prefer to aim at 60% dynamic content and retain the features that make a synthetic voice reinforce the sense of embodiment in the agent rather than undermine it. This  project investigates two key problems (1) Merging expressive pre-recorded prompts with expressive unit selection speech synthesis. (2) Dynamically altering voicing in speech to convey underlying underlying levels of stress and excitement using source filter decomposition techniques.

Cross-lingual acoustic models

Supervisors: Simon King, Steve Renals, Junichi Yamagishi

Adapting speech recognition acoustic models from one language to another, with a focus on limited resources and unsupervised training.

Current speech technology is based on machine learning and trainable statistical models.  These approaches are very powerful, but before a system can be developed for a new language considerable resources are required:  transcribed speech recordings for acoustic model training; large amounts of text for language model training; and a pronunciation dictionary.  Such resources are available for languages such as English, French, Arabic, and Chinese, but there are many less well resourced languages.  There is this a need models that can be adapted to from one language to another with limited effort and resources.  To address this we are interested in two (complementary) approaches.  First, the development of lightly supervised and unsupervised training algorithms:  speech recordings are much easier to obtain than transcriptions.  Second, the development of models which can factor language-dependent and language-independent aspects of the speech signal, perhaps exploiting invariances derived from speech production.  We have a particular interest in approaches (1) building on the subspace GMM framework, or (2) using deep neural networks.

Factorised acoustic models

Supervisors: Simon King, Steve Renals, Junichi Yamagishi

Acoustic models which factor specific causes of variability, thus allowing more powerful adaptation for speech recognition, and greater control for speech synthesis.

Adaptation algorithms, such as MLLR, MAP, and VTLN, have been highly effective in acoustic modelling for speech recognition.  However, current approaches only weakly factor the underlying information - for instance "speaker" adaptation will typically adapt for the acoustic environment and the task, as well as for different aspects of the speaker.  It is of great interest to investigate speech recognition models which are able to factor the different sources of variability.  PhD projects in this area will explore the development of factored models that enable specific aspects of a system to be adapted.  For example, it is of great interest - for both speech recognition and speech synthesis - to be able to model accent in a specific way.  We are interested in two modelling approaches which hold great promise for this challenge: subspace Gaussian mixture models, and deep neural networks.

Hidden Speech Production Models

Supervisor: Steve Renals

Building speech analysis and recognition models that respect the constraints of speech production.

This project is concerned with building speech analysis and recognition models that respect the constraints of speech production. Although speech production data may be available when training data is collected (if a specialised recording facility is used) it is not available in the general case. The aim of this project is to use observed articulatory data to construct a hidden space. When unseen acoustic data is presented this hidden space is inferred from the data, and can be used as a constraint for speech recognition. The advantage is that inference of the hidden space of articulation must respect the constraints learned from observed articulatory data.

Language Models for Multiparty Conversations

Supervisor: Steve Renals

Develop statistical models that integrate history information from multiparty conversations whilst remaining computationally feasible.

Large vocabulary speech recognition is largely based on language models in which the probability of the current word is estimated conditional on the previous words spoken (the history). However in multiparty conversations, there is not a single, linear stream of words, and it is not always obvious what the closest history is. This project aims to develop statistical models that integrate history information from both the talker in question, as well as other talkers, while remaining computationally feasible.  There are two approaches in which we are interested in pursuing for this task:  (1)  Language models using distributed representations, such as neural network language models and deep belief network langauge models;  (2) Hierarchical Bayesian models, building on non-parametric approaches such as the Hierarchical Dirichlet Process Language Model and the Hierarchical Pitman-Yor Language Model. Develop statistical models that integrate history information from multiparty conversations whilst remaining computationally feasible.

Microphone-Array Based Speech Recognition

Supervisor: Steve Renals

Investigate the use of microphone arrays attached to walls or tables, rather than users wearing microphones, for speech recognition to operate.

Rather than requiring users to wear microphones for speech recognition to operate, we are interested in using microphone arrays attached to walls or tables. Microphone array beamforming for localization and enhancement is a well studied, using microphone arrays for speech recognition, less so. We are interested in approaches in which the parameters used to combine the signals from the different microphones are treated as parameters of the speech recognition acoustic model, and optimized along with the rest of the system. Interesting research issues arise when talkers are moving, and when multiple talkers overlap.

Distributed linguistic representations for spoken language processing

Supervisor: Steve Renals

Development of language models utilising distributed representations for words.

Until recently, the dominant language models for speech recognition were based strongly on n-grams, in which probability models are built over a vocabulary of words, resulting in very high dimensions (frequently 1 million or more).  In recent years there has been growing interest in models which use distributed representations of words, for example latent semantic analysis language models and neural network language models.  The latter, in particular, have proven to be very attractive.  In this project we plan to explore models in which the distributed representation is automatically learned, enabling words to embedded in what may be considered a semantic space.  We are particularly in investigating approaches based on deep neural networks, hierarchical Bayes, and on ideas from factorised language model approaches such as Model M.  We are interested in applying such language models to speech recognition and other tasks such as topic identification and summarisation.

Detecting and linking events in multiparty conversations

Supervisor:  Steve Renals

In the AMI project we developed an approach called content linking, which uses realtime conversational speech recognition to automatically transcribe a meeting in progress, then constructs search queries from the recently detected words in order to retrieve relevant multimodal and text documents. Such a "search-without-query" approach may be viewed as a way to automatically provide context (in the form of relevant documents and media files) to an ongoing multiparty conversation, without requiring explicit search.  This baseline application opens a number of research challenges such as improving the core models and algorithms to link content, approaches to find the most important "events" in a conversation, to which content should be linked, and ways of automatically constructing relations to link conversational events.

Enhancing spoken language processing with social dynamics

Supervisor: Steve Renals

Identifying social signals in multiparty spoken interaction to enable better recognition and interpretation of what is being communicated.

Spoken conversations are not just streams of words - there are numerous social cues relating to things such as agreement/disagreement, positivity/negativity, and social roles adopted by participants in a conversation, as well as the interaction patterns between people in a multiparty conversation.  There are a number of research challenges ranging from the extraction and identification of social cues from audio or multimodal recordings of interactions, and their use as additional variables in spoken language processing models.  For example, is it possible to use social features of a conversation to enhance language modelling or topic identification?  Can social cues be used as a kind of prior for tasks like speaker diarization?  How can social signals be used to structure in conversation in terms of things like decisions that have been made?   This work would initially focus on the AMI corpus, a 100 hour corpus of multimodal meeting recorders, with many layers of annotation.

Implicit spoken dialogue systems

Supervisor:  Steve Renals

Development of dialogue systems technology that does not require 100% attention from users and learns when to intervene in a conversation.

Current dialogue systems do not interact very naturally - for example, they often demand 100% of the users' attention and attempt to interpret and respond to every utterance.  In the long term we would like to develop spoken language systems which learn how to interact in multiparty conversations more naturally.  This project concerns monitoring the social signals of the speakers, and the social context of the conversation, as well as the content of the conversation, to decide when to interact and what to say.  Such a system could be considered as an "implict" question answering systems - answering queries which come up in a conversation, without being explicitly posed.

Head Motion Synthesis for Lifelike Conversational Agents

Supervisor: Hiroshi Shimodaira

Develop a mechanism for controlling the head movement of photo-realistic lifelike agents when the agents are in various modes; idling, listening, speaking and singing.

Lifelike conversational agents, behaving like humans with facial animation and gesture, and making speech conversations with humans, are one of the next-generation human-interface. Much efforts have been made so far to make the agents natural, especially controlling mouth/lip movement, and eye movement. On the other hand, controlling the movement of the agent's head including facial expression has not been studied that much, even though head movement sometimes plays a more important role in naturalness and intelligibility than mouths and eyes. The purpose of the project is to develop a mechanism for controlling the head movement of photo-realistic lifelike agents when the agents are in various modes; idling, listening, speaking and singing. One of the outstanding features of the project is that it aims to imitate the manner of head motion of an existing person to give the agents virtual personalities with the help of machine learning techniques and methods used for text-to-speech synthesis.

Evaluating the impact of expressive speech synthesis on embodied conversational agents

Supervisor: Hiroshi Shimodaira, Matthew Aylett

Evaluation of embodied conversational agents (ECAs) has tended to concentrate on userbility - do users like the system, does the system achieve its objectives. We are not aware of any studies of this type which have controlled for the speech synthesis used in the project where the speech synthesis used was close to the state of the art. This project will develop evaluation based on measured interaction where expressive speech synthesis is the topic of study. It will explore carefully controlled interactive environments, and measure a subjects performance as well as measuring the physiological effects of the experiment on the subject. It will explore how involved (emotionally or otherwise) our subject is with the ECA. The userbility approach described above is also important for these experiments but we also wish to determine how speech is affecting involvement over time. For example, we would expect to increase a subjects arousal by adding emotional elements to the speech, we might expect to destroy a subjects involvement by intentionally producing an error which undermines the ECAs believability.

Document Actions