flowchart LR A[Ethics] --> B[Curation] B --> C[Analysis] C --> D[Presentation]
Data as a Science
Preface
Data has become the most important language of our era, informing everything from intelligence in automated machines, to predictive analytics in medical diagnostics. The plunging cost and easy accessibility of the raw requirements for such systems – data, software, distributed computing, and sensors – are driving the adoption and growth of data-driven decision-making.
As it becomes ever-easier to collect data about individuals a diverse range of professionals, who have never been trained for such requirements, grapple with inadequate analytic and data management skills, as well as the ethical risks arising from personal data possession and opaque algorithmic tools.
The key to unlocking data reuse, and new economic and social development opportunities from these data, rely on both data producers, and data users, having technical insight necessary to manage those who work with data, and a conscious and motivated understanding of the new algorithmic tools available to us.
A data scientist is a researcher who answers a research question using data, and can support the development of a research process. They may design the methods to acquire primary or secondary sources of data that inform the research process, monitor and ensure ethical responsibilities, curate the research data and results, or communicate the process and results to stakeholders.
This course is aimed at the stakeholders of these data scientists; the people who are expected to rely on these outputs to implement some response. It is their responsibility to interrogate these data, and know how to do so in a constructive and material way, This can be challenging. Outputs from a data-driven or algorithmic research process tend to have the illusion of truthiness; well-formatted data tables, complex visualisations, or generative analysis.
There are two objectives for this syllabus:
- Ensure students have a comprehensive grasp of a data-driven research process. Data as a Science guides learners to confidence in the ethics, curation, analysis, and presentation of data, integrating each of these topics into each lesson.
- Support the growing desire for universities around the world, but especially in emerging-market countries, to offer Data Science degree courses, by providing a free, openly-licenced core curriculum for adoption and adaptation by their degree programs.
Pedagogy
This course is inspired by the Sloyd model of technical training. Each lesson is a discrete mini-research exercise, integrating all the functional techniques of research, building on the techniques of the previous lesson, and providing a functional and holistic understanding of the scientific method as it applies to data.
We move from strength to strength. Each lesson should feel only slightly more challenging than the one before, permitting you to gain in experience and skill without needing to absorb everything at once. Education is not about learning an algorithm and applying it to abstract, arbitrary data. Algorithms or statistical techniques will not be taught separately from the research process.
Each lesson starts with a research question, and progresses by teaching a complete, and practical, set of skills allowing you to learn at a pace which suites your current understanding. Case-studies and tutorials are drawn from public health, economics and social issues, and the course is accessible to anyone with an interest in data.
Research is not a set of discrete topics. Issues of analysis and presentation will affect choices made for curation. Just when you think your data are neat and tidy, you’ll discover an error that you need to correct, and necessitating redoing your analysis. Ethical dilemmas will appear during analysis that require ad hoc solutions.
On a small project, one person may have to grapple with all these issues alone. Where a project may span multiple countries, or a vast research question, teams will need to communicate and collaborate to ensure there is a complete custodial chain validating all decisions.
Lesson structure and approach
Each lesson is guided by the following four topics:
- Ethics: The social and behavioural challenges a research question, and data collection process, pose to answering that question.
- Curation: The requirements for data collection and management arising from a research question.
- Analysis: The techniques required to explore and analysis data collected to answer a research question.
- Presentation: The methods and approaches to present the results of analysis so as to answer a question and promote a response.
This course can be taught with, or without a requirement for software development skills. An introductory course in Python is available.
Case-studies: review and replicate
Science is a set of defined methods that stands up to scrutiny, supports replication, and is supported by ethical measurement data acquired during the study process. The way to gain confidence in these methods is to review the work of others.
Each lesson will guide you through review of published scholarly work in the following ways:
- Review: apply learned techniques to open access published research, and review and reflect on the methodology, analysis and results presented.
- Replication: using source- or synthetic data, reproduce the methodology used in open access published research to test whether claimed analysis and results are replicable.
Synthetic data will include lessons in dependent randomisation, as well as agent-based modelling.
On completion of each lesson, students gain useful and meaningful skills, and are not left stranded. This means that even partial completion of the material permits students to be productive members of a research team.
Whois
My name is Gavin Chait, and I am an independent data scientist specialising in economic development and data curation. I spent more than a decade in economic and development initiatives in South Africa. I have led the implementation of open data technical and research projects around the world.
I have extensive experience in leading research projects, implementing open source software initiatives, and developing and leading seminars and workshops. I have taught for 30 years, including for undergraduates, adult education, and technical and analytical teaching at all levels.
This pedagogy and syllabus structure was developed with support from the Gates Foundation and WHO. Initial research into the need for education capacity building arose as a result of research supported by the Hewlett Foundation, Wellcome Trust and Public Health Research Data Forum.
Citation
Chait, Gavin (2020): Data as a Science. Whythawk. https://doi.org/10.5281/zenodo.4194973
And as a BibTeX entry:
@book{chait_data_2020,
title = {Data as a {Science}},
copyright = {Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International and the GNU Affero General Public License},
publisher = {Whythawk},
author = {Chait, Gavin},
year = {2020},
doi = {10.5281/zenodo.4194973},
url = {https://doi.org/10.5281/zenodo.4194973}
}
Licensing and release
Course content, materials and approach are copyright Gavin Chait, and released under both the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International and the GNU Affero General Public License licences.
The objective is to ensure reuse, and that any modifications or adaptations of the source material must be released under an equivalent licence.