Sign Language Recognition : Generalising to More Complex Corpora
The aim of this thesis is to find new approaches to Sign Language Recognition (SLR) which are suited to working with the limited corpora currently available. Data available for SLR is of limited quality; low resolution and frame rates make the task of recognition even more complex. The content is rarely natural, concentrating on isolated signs and filmed under laboratory conditions. In addition, the amount of accurately labelled data is minimal. To this end, several contributions are made: Tracking the hands is eschewed in favour of detection based techniques more robust to noise; for both signs and for linguistically-motivated sign sub-units are investigated, to make best use of limited data sets. Finally, an algorithm is proposed to learn signs from the inset signers on TV, with the aid of the accompanying subtitles, thus increasing the corpus of data available. Tracking fast moving hands under laboratory conditions is a complex task, move this to real world data and the challenge is even greater. When using tracked data as a base for SLR, the errors in the tracking are compounded at the classification stage. Proposed instead, is a novel sign detection method, which views space-time as a 3D volume and the sign within it as an object to be located. Features are combined into strong classfifiers using a novel boosting implementation designed to create optimal classifiers over sparse datasets. Using boosted volumetric features, on a robust frame differenced input, average classification rates reach 71\% on seen signers and 66\% on a mixture of seen and unseen signers, with individual sign classification rates gaining 95\%. Using a classifier per sign approach to SLR, means that data sets need to contain numerous examples of the signs to be learnt. Instead, this thesis proposes learnt classifiers to detect the common sub-units of sign. The responses of these classifiers can then be combined for recognition at the sign level. This approach requires fewer examples per sign to be learnt, since the sub-unit detectors are trained on data from multiple signs. It is also faster at detection time since there are fewer classifiers to consult, the number of these being limited by the linguistics of sign and not the number of signs being detected. For this method, appearance based boosted classifiers are introduced to distinguish the sub-units of sign. Results show that when combined with temporal models, these novel sub-unit classifiers, can outperform similar classifiers learnt on tracked results. As an added side effect; since the sub-units are linguistically derived they can be used independently to help linguistic annotators. Since sign language data sets are costly to collect and annotate, there are not many publicly available. Those which are, tend to be constrained in content and often taken under laboratory conditions. However, in the UK, the British Broadcasting Corporation (BBC) regularly produces programs with an inset signer and corresponding subtitles. This provides a natural signer, covering a wide range of topics, in real world conditions. While it has no ground truth, it is proposed that the translated subtitles can provide weak labels for learning signs. The final contributions of this thesis, lead to an innovative approach to learn signs from these co-occurring streams of data. Using a unique, temporally constrained, version of the Apriori mining algorithm, similar sections of video are identified as possible sign locations. These estimates are improved upon by introducing the concept of contextual negatives, removing contextually similar noise. Combined with an iterative honing process, to enhance the localisation of the target sign, 23 word/sign combinations are learnt from a 30 minute news broadcast, providing a novel method for automatic data set creation.