Developing a new subject from the ground up is often demanding. Developing a new program is even more of a challenge. However developing a new program for an emergent discipline in its own right is a whole new ball game.
Curriculum development in the emerging fields of data science and machine learning requires the coordinated efforts of a range of specialist teachers in disciplines that span mathematics, statistics, computer science, information system architecture, design and visualization. But the rapid pace of curriculum development in this domain has meant that curricula across universities has developed in line with the internal disciplinary strengths of each institution rather than in response to the actual needs of industry. The result is a set of curricula across institutions that are biased, usually unintentionally, towards existing disciplinary capability.
To uncover sources of bias in curricula across universities we propose an approach that uses novel text mining techniques. We apply the analysis to undergraduate and postgraduate data science degrees to quantify patterns across the range of curricula on offer.
What is text mining?
Text mining is an algorithmic process of extracting meaningful information from text in an effort to uncover linkages between documents. Analyses of large corpora are increasingly being used to enhance ontology development in pharmaceuticals, medicine and intelligent tutoring systems.
We collected the curricula from 320 university-level data science programs through web crawling and constructed a database of terms (known as a corpus) to construct linkages. We then used an iterative term refinement process to produce a set of terms, so that meaningful terminology is filtered out from general narrative. Text mining techniques used for our analysis range from a simple assessment of word frequency and associations between given terms to more complex analyses which includes the use of term frequency-inverse document frequency assessments and hierarchical clustering. Inductive analyses of the resulting data uncovered concepts and relationships that are not easily interpreted through manual review.
Our study found that the fundamental structure of analysis across curricula is common while the tools and platforms used to conduct the analysis are unique to each curriculum. The clusters differentiate between the general reference to ‘data’ and the application of analysis to data.
While bivariate datasets are relatively easy to be represented in cluster formats, more complex datasets that are related by more than two variables have significant inter-object dissimilarities. To circumvent this limitation we used a cluster plot display to represent objects as bivariate points. The clusters are represented as ellipses of various dimensions which are based on the average and the covariance matrix of each cluster and their size.
similarity in approaches across universities for an emerging field indicates that institutions may merely be copying each other
The clusters reveal that certain elements of curricula can be arranged into a meaningful taxonomy into which homogenous groups from curricula summaries are related. This reveals that some curricula among universities (such as analytical techniques) are relatively homogenous and that curricula among institutions appear to largely mimic each other. Whether this is intentional or accidental is unknown but similarity in approaches across universities for an emerging field indicates that institutions may merely be copying each other rather than focusing on the actual needs of students and employers.
While natural language processing offers an objective perspective of curriculum breadth and depth, its limitations mean that it is unlikely to fully replace the engagement of experts to interpret and consolidate their thoughts on curriculum design. Current text mining capabilities in this context could be better served to validate the human interpretative approach rather than to adjudicate curriculum quality alone. It is a useful tool, not a replacement engine.
This approach is not limited to the domain of data science but can be applied to any curriculum area, particularly those with an interdisciplinary flavour. A basic web-based keyword analysis tool using the above approach has been made available at www.pluridisciplinary.com.au for course developers to use.
We acknowledge the support of the Australian Office of Learning and Teaching through OLT grant FS15-0252.