Computing Degree Show 2018

Automatic Date-Labelling of Parliamentary Proceedings using Natural Language Processing

Information extraction is only as effective as the data from which it is extracted, and as the abundance of data online and existent on various other means of electronic media continues to grow, so does the need for tools that are capable of validating or identifying creation dates of these pieces of data. Investigation has revolved around the rates of occurrence of linguistic features in texts; the results of which have provided contextual significance and ultimately assisted in the successful completion of the aforementioned base goal.

The domain of focus is that of the Hansard Archive, also known as the digitised recordings of the United Kingdom’s Parliament Proceedings. Written using Python in conjunction with the packages of scikit-learn, imbalanced-learn, pandas and the natural language toolkit, this project sought to provide a method of classification for structured, textual data within a range of time spanning over 200 years.

Daniel Brereton