flowchart LR A[Ethics] --> B[Curation] B --> C[Analysis] C --> D[Presentation]
1 Answering questions with data
Data has become the most important language of our era, informing everything from intelligence in automated machines, to predictive analytics in medical diagnostics. The plunging cost and easy accessibility of the raw requirements for such systems – data, software, distributed computing, and sensors – are driving the adoption and growth of data-driven decision-making.
The key to unlocking data reuse, and new economic and social development opportunities from these data, rely on data producers, and data users, having technical insight necessary to manage those who work with data, and a conscious and motivated understanding of the new algorithmic tools available to us.
When researchers look to answer questions, they begin an investigative process:
Ordinarily, when teaching data science, everyone - from teachers to students - prefers to focus on the last three points because analysis and presentation are more fun and require less frustration with messy data or ethical dilemmas. For those who work with source data, or information collected under often uncertain circumstances, the bulk of their time is taken up with the first five points.
The objective of this syllabus is to ensure you have a comprehensive grasp of a data-driven research process. Data as a Science guides learners to confidence in the ethics, curation, analysis, and presentation of data, integrating each of these topics into each lesson:
- Ethics: The social and behavioural challenges a research question, and data collection process, pose to answering that question.
- Curation: The requirements for data collection and management arising from a research question.
- Analysis: The techniques required to explore and analysis data collected to answer a research question.
- Presentation: The methods and approaches to present the results of analysis so as to answer a question and promote a response.
Our first steps are to prepare our work surface and lay out the tools - both philosophical and technical - that we will use.
We start with our research question for this lesson:
The war in Yemen has caused tremendous hardship and suffering to that region’s civilian population. Aid- and civil-rights organisations are concerned that cholera is spreading. International health response organisations want to assess the scale of the problem. Using information published by a credible international research organisation, map reported incidents of cholera over time and assess whether there is a growing regional cholera epidemic.
1.1 Ethics: a foundation for reasoning
Identify concepts in ethical reasoning which may influence our analysis and results from data.
Polygal was a gel made from beet and apple pectin. Administered to a severely wounded patient, it was supposed to reduce bleeding. To test this hypothesis, Sigmund Rascher administered a tablet to human subjects who were then shot or - without anaesthesia - had their limbs amputated.
During the Second World War, and under the direction of senior Nazi officers, medical experiments of quite unusual violence were conducted on prisoners of war and civilians regarded by the Nazi regime as sub-human. After the war, twenty medical doctors were tried for war crimes and crimes against humanity at the Doctor’s Trial held in Nuremberg from 1946 to 1949.
Throughout the trial, expert witnesses, prosecutors and judges proposed key tests as part of a universal code which could be used to assess whether a medical experiment could be regarded as ethical. This became known as the Nuremberg Code. Key to that was the very first test:
The voluntary consent of the human subject is absolutely essential.
This means that the person involved should have legal capacity to give consent; should be so situated as to be able to exercise free power of choice, without the intervention of any element of force, fraud, deceit, duress, over-reaching, or other ulterior form of constraint or coercion; and should have sufficient knowledge and comprehension of the elements of the subject matter involved, as to enable him to make an understanding and enlightened decision. This latter element requires that, before the acceptance of an affirmative decision by the experimental subject, there should be made known to him the nature, duration, and purpose of the experiment; the method and means by which it is to be conducted; all inconveniences and hazards reasonably to be expected; and the effects upon his health or person, which may possibly come from his participation in the experiment.
The duty and responsibility for ascertaining the quality of the consent rests upon each individual who initiates, directs or engages in the experiment. It is a personal duty and responsibility which may not be delegated to another with impunity.
It would be decades before such ethical principles would be more generally adopted into research practice, and only after great human cost.
The Tuskegee syphilis experiment is regarded as “arguably the most infamous biomedical research study in US history” and led to significantly greater fear by minority Americans of participating in medical research (Katz et al. 2006).
Between 1932 and 1972, the US Public Health Service conducted a clinical study on 600 poor, semi-literate African-American farm-workers from Alabama. The experiment was not designed to cure, but to observe the full progression of untreated syphilis. And, while the experiment began before the development of antibiotics, even after penicillin became the standard treatment in 1947, subjects were deliberately denied treatment. Researchers even went out of their way to prevent those subjects drafted to serve in the military in the Second World War from receiving treatment.
It wasn’t until 1978 that the US Belmont Report was issued which offered specific ethical principles when conducting medical experiments, and only in 1991 was a universal set of rules adopted across the US.
Despite these hard-won lessons, there remain numerous investors and researchers who see ethical considerations as placing a burden on their objectives. In 2017, for example, citing frustration with the US Food and Drug Administration Institutional Review Board, Rational Vaccines began human-subject testing of their controversial herpes vaccine on the Caribbean island of St Kitts, deliberately placing themselves outside of US jurisdiction.
Far from being settled, the philosophical concerns and controversies raised by ethics are all around us (Baggini and Fosl 2007).
1.1.1 The context of right and wrong
Computers cannot make decisions. Their output is an absolute function of the data provided as input, and the algorithms applied to analyse that input. The aid of computers in decision-making does not override human responsibility and accountability.
It should be expected that both data and algorithms should stand up to scrutiny so as to justify any and all decisions made as a result of their output. “Computer says no,” is not an unquestionable statement.
In any community, people interact with their environment and each other to produce their livelihoods. The process by which they acquire and accumulate things, and how their interactions affect each other and their environment, lead to asymmetrical changes (Bowles, Carlin, and Stevens 2017). Over time, and over multiple generations, these changes accumulate in different ways to different people leading to highly diverse and complex societies.
Economists study these changes and attempt to understand the causes and effects which underpin them, as well as the mechanisms by which economic change may be realised. The question of what sorts of change are good or bad - especially where they give rise to incredible disparities in wealth and opportunity - is one which data and analysis can do little to answer.
Facts can tell you what has happened, not what we should do about it. Yet we collect data and analysis with the objective of furthering a particular objective.
That objective may have unforeseen consequences.
For example, combustion engines which burn fossil fuels have revolutionised transport and manufacturing, and produced incredible wealth, yet a consequence has been a catastrophic change to our environment through immediate effects, like pollution, and long-term effects, like climate change.
The process by which we examine and explain why what we consider to be right or wrong is considered right or wrong in matters of human conduct belongs to the study of ethics (Folse 2005).
1.1.2 The weight of intent and what we choose to know
Our actions - as data scientists - are intended to persuade people to act or think other than the way they currently do based on nothing more than the strength of our analysis, and informed by data.
Whereas everything else we do - data curation, analysis and presentation - can describe human behaviour as it is, ethics provides a theoretical framework to describe the world as it should, or should not be. It gives us full range to describe an ideal outcome, and to consider all that we know and do not know which may impede or confound our desired result:
| unknown knowns |
known knowns |
| unknown unknowns |
known unknowns |
Each of these aspects can be dangerous, but what we insist on not knowing - the unknown-knowns; that which we do not like to know or do not want to know - constitute our inherent bias and can compromise any analysis irretrievably.
When we consider ethical outcomes, we use the terms good or bad to describe judgements about people or things, and we use right or wrong to refer to the outcome of specific actions. Understand, though, that - while right and wrong may sometimes be obvious - we are often stuck in ethical dilemmas.
How we consider whether an action is right or wrong comes down to the tension between what was intended by an action, and what the consequences of that action were. Are only intensions important? Or should we only consider outcomes? And how absolutely do you want to judge this chain: a good motivation, leading to a good intention, performing a good action, resulting in only the right consequences. How do we evaluate this against what it may be impossible to know at the time, even if that information will become available after a decision is made?
We also need to consider competing interests in right and wrong outcomes. A positive outcome for the individual making the decision may have a negative consequence for numerous others. Conversely, an altruistic person may act only for the benefit of others even to their own detriment.
Ethical problems do not always require a call to facts to justify a particular decision, but they do have a number of characteristics:
- Public: the process by which we arrive at an ethical choice is known to all participants.
- Informal: the process cannot always be codified into law like a legal system.
- Rational: despite the informality, the logic used must be accessible and defensible.
- Impartial: any decision must not favour any group or person.
Rather than imposing a specific set of rules to be obeyed, ethics provides a framework in which we may consider whether what we are setting out to achieve conforms to our values, and whether the process by which arrive at our decision can be validated and inspected by others.
No matter how sophisticated our automated machines become, unless our intention is to construct a society “of machines, for machines”, people will always be needed to decide on what ethical considerations must be taken into account.
1.2 Curation: keeping track of what we know
Understand the process of data curation, and the custodial duty of data science. Learn some essential tools used by data scientists.
Ensuring we know what we know would seem to be the simplest of our ethical responsibilities.
As a project grows in complexity, and as the number of people involved gets larger, sharing what we know in a format accessible to all gets more complex. Even if everyone starts out the same, the diversity of requirements across any project will lead individuals to specialise. Unique grammars may be used to describe the data each collects, and these grammars will become opaque to others.
Before we even begin, we need a grammar to describe and organise the data we collect, and we need a mechanism to ensure that both the grammar and data are accessible to anyone who requires them.
1.2.1 Data about data
A grammar which describes our source data is known as metadata. A specific, organisational, grammar is known as a schema.
Consider simply this online textbook. It has a title, author and url. But - hidden amidst all your other resources - how will you know what this book is about? That’s where a solid metadata system comes in; it improves your ability to understand the aboutness of an object.
There are numerous metadata systems, known as schemas, available, from the general, to the specialised.
As an introduction, let’s consider citations.
If you’ve spent any time in research, you’ll know the importance of citing your sources. You’ll also know the wide range of ways in which you could present your references. BibTeX is a standardised structure to format lists of references. It separates out the specific bibliographic information according to a defined grammar:
@book{DataScience,
title = "Data as a Science",
author = "Gavin Chait",
year = "2018",
publisher = "Whythawk.com",
institution = "Whythawk",
url = "https://data-as-a-science.com/"
}The term @book is an entry type which describes the type of information that the entry contains. You can have a range of such types: article, book, manual, techreport, and so on.
Each entry contains a number of fields. Some are compulsory, depending on the type, and some are optional. A book must contain an author, title, publisher and year. An article requires an author, title, journal, year and volume.
The term DataScience must be unique and references this book object. You can store a list of such objects in a standard text file with the extension .bib instead of .txt. Quarto, the software used to write this textbook, automatically looks up citations and places them in the References chapter.
One of the more ubiquitous general metadata systems used to describe web resources is Dublin Core.
Some of the terms will be immediately obvious (title, subject, publisher, language, description, date, creator) while others will be unclear (isFormatOf, isReferencedBy, isReplacedBy). The purpose of the schema is not only to describe the referenced object, but also to connect it to other datasets and describe each object’s relationship to other objects. This is the complete Dublin Core reference.
While it does happen that data-series end, it is unusual for a data-series to be ended because it has been replaced by a visualisation of that data. An image does not permit us to explore and investigate for ourselves. While these data meet the standards we require for research sources, the lack of continuity is certainly problematic.
This research question and case study was originally prepared while the data was still being updated. Given that there is no way to know when you will be studying this lesson, for the sake of our question, let us assume that you are performing this analyis in February 2018, prior to the ending of this series.
1.2.2 Tools for data curation, analysis and presentation
One thing we haven’t done yet is look at the data itself. Ordinarily, this is where a data science course would dive into programming. This is, however, a course for non-data scientists. Your job is to read, reflect and synthesise recommendations.
This course may use code, or present data, but your job is to focus on your core task.
1.3 Analysis: exploring data
Investigate and manipulate data to learn its metadata, shape and robustness.
There are a number of tools used by data scientists to understand and analyse data. We’ll get to those, but one of the fundamentals is simply exploring a new dataset.
1.3.1 Initial exploration
Usually, in data courses, you’re presented with a nice, clean dataset and run some algorithms on it and get some answers. That isn’t helpful to you. Except for data you collect, you’re unlikely to know the shape and contents of a dataset you import from others, no matter how good their research.
Especially for large datasets, it can be difficult to know how many unique terms you may be working with and how they relate to each other.
In Section 1.2.1 on data curation, we reviewed the Yemen: Cholera Outbreak Epidemiology Update Data. We could view this in a spreadsheet, like Excel, but we’re going to use Pandas, a Python library that makes working with large structured data simple. You can open these code blocks if you want, but we are focused on the outputs.
Code
# Comments to code are not executed and are flagged with this '#' symbol.
# First we'll import the pandas library.
# We use 'as' so that we can reference it as 'pd', which is shorter to type.
import pandas as pd
# In Python, we can declare our variables by simply naming them, like below
# 'data_url' is a variable name and the url is the text reference we're assigning to it
data_url = "data/lesson-1/Yemen Cholera Outbreak Epidemiology Data - Data_Governorate_Level.csv"
# We import our data as a 'dataframe' using this simple instruction.
# How did I know it was a CSV (Comma Separated Value) file? If you look at the end of the urls (above),
# you'll see 'output=csv'. A variable in Python can be anything. Here our variable is a Pandas dataframe type.
data = pd.read_csv(data_url)
# Lets see what that looks like (I limit the number of rows printed by using '[:10]',
# and Python is '0' indexed, meaning the first item starts at '0'):
data.head(10)| Date | Governorate | Cases | Deaths | CFR (%) | Attack Rate (per 1000) | COD Gov English | COD Gov Arabic | COD Gov Pcode | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2018-02-18 | Amran | 103965 | 176 | 0.17 | 89.582 | Amran | عمران | 29.0 |
| 1 | 2018-02-18 | Al Mahwit | 62887 | 151 | 0.24 | 86.122 | Al Mahwit | المحويت | 27.0 |
| 2 | 2018-02-18 | Al Dhale'e | 47136 | 81 | 0.17 | 64.438 | Al Dhale'e | الضالع | 30.0 |
| 3 | 2018-02-18 | Hajjah | 121287 | 422 | 0.35 | 52.060 | Hajjah | حجة | 17.0 |
| 4 | 2018-02-18 | Sana'a | 76250 | 123 | 0.16 | 51.859 | Sana'a | صنعاء | 23.0 |
| 5 | 2018-02-18 | Dhamar | 103214 | 161 | 0.16 | 51.292 | Dhamar | ذمار | 20.0 |
| 6 | 2018-02-18 | Abyan | 28243 | 35 | 0.12 | 49.477 | Abyan | أبين | 12.0 |
| 7 | 2018-02-18 | Al Hudaydah | 155908 | 282 | 0.18 | 48.147 | Al Hudaydah | الحديدة | 18.0 |
| 8 | 2018-02-18 | Al Bayda | 30568 | 36 | 0.12 | 40.253 | Al Bayda | البيضاء | 14.0 |
| 9 | 2018-02-18 | Amanat Al Asimah | 103184 | 71 | 0.07 | 36.489 | Amanat Al Asimah | أمانة العاصمة | 13.0 |
Definitions about the overall dataset is called descriptive metadata. Now we need information about the data within each dataset. That is called structural metadata; a grammar describing the structure and definitions of the data in a table.
Sometimes the data you’re working with has no further information and you need to experiment with similar data to assess what the terms mean, or what units are being used, or to gap-fill missing data. Sometimes there’s someone to ask. Sometimes you get a structural metadata definition to work with. And sometimes you are the recipient of someone else’s best guess. Always be sure to find out what definitions were used.
In this case, the publisher has helpfully provided another table containing the definitions for the structural metadata.
Code
# First we set the url for the metadata table
metadata_url = "data/lesson-1/Yemen Cholera Outbreak Epidemiology Data - Metadata.csv"
# Import it from CSV
metadata = pd.read_csv(metadata_url)
# Show the metadata:
metadata.style.set_properties(subset=["Description"], **{"width": "400px", "text-align": "left"})| Column | Description | |
|---|---|---|
| 0 | Date | Date when the figures were reported. |
| 1 | Governorate | The Governorate name as reported in the WHO epidemiology bulletin. |
| 2 | Cases | Number of cases recorded in the governorate since 27 April 2017. |
| 3 | Deaths | Number of deaths recorded in the governorate since 27 April 2017. |
| 4 | CFR (%) | The case fatality rate in governorate since 27 April 2017. |
| 5 | Attack Rate (per 1000) | The attack rate per 1,000 of the population in the governorate since 27 April 2017. |
| 6 | COD Gov English | The English name for the governorate according to the Inter Agency Standing Committee (IASC) Common Operation Datasets (CODs) for Yemen. |
| 7 | COD Gov Arabic | The Arabic name for the governorate according to the Inter Agency Standing Committee (IASC) Common Operation Datasets (CODs) for Yemen. |
| 8 | COD Gov Pcode | The PCODE name for the governorate according to the Inter Agency Standing Committee (IASC) Common Operation Datasets (CODs) for Yemen. |
| 9 | Bulletin Type | The type of bulletin from which the data was extracted. Bulletin types include Epidemiology bulletin, Weekly epidemiology bulletin, Daily epidemiology update. |
| 10 | Bulletin URL | The URL of the bulletin from which the data was extracted. |
Unless you work in epidemiology, “attack rate” may still be unfamiliar. The US Centers for Disease Control and Prevention has a self-study course (Dicker et al. 2012) which covers the principles of epidemiology and contains this definition: “In the outbreak setting, the term attack rate is often used as a synonym for risk. It is the risk of getting the disease during a specified period, such as the duration of an outbreak.”
An “Attack rate (per 1000)” implies the rate of new infections per 1,000 people in a particular population.
These data were sent to your team data scientist who fixed up what she could and sent back a chart.
Code
import pandas as pd
# Matplotlib for additional customization
from matplotlib import pyplot as plt
%matplotlib inline
def fix_governorates(data, fix_govs):
"""
Rename terms in a dataframe based on a key-value dictionary.
This is our function _fix_governorates_; note that we must pass it
two variables, known as arguments.
Args:
data - the dataframe we want to fix;
fix_govs - a dictionary of the governorates we need to correct.
The function will perform the following task:
For a given dataframe, date list, and dictionary of Governorates
loop through the keys in the dictionary and combine the list
of associated governorates into a new dataframe.
Return a new, corrected, dataframe.
Returns:
Pandas dataframe.
"""
# Create an empty list for each of the new dataframes we'll create
new_frames = []
# And an empty list for all the governorates we'll need to remove later
remove = []
# Create our list of dates
date_list = data["Date"].unique()
# Loop through each of the governorates we need to fix
for key in fix_govs.keys():
# Create a filtered dataframe containing only the governorates to fix
ds = data.loc[data.Governorate.isin(fix_govs[key])]
# New entries for the new dataframe
new_rows = {"Date": [],
"Cases": [],
"Deaths": [],
"CFR (%)": [],
"Attack Rate (per 1000)": []
}
# Divisor for averages (i.e. there could be more than 2 govs to fix)
num = len(fix_govs[key])
# Add the governorate values to the remove list
remove.extend(fix_govs[key])
# For each date, generate new values
for d in date_list:
# Data in the dataframe is stored as a Timestamp value
r = ds[ds["Date"] == pd.Timestamp(d)]
new_rows["Date"].append(pd.Timestamp(d))
new_rows["Cases"].append(r.Cases.sum())
new_rows["Deaths"].append(r.Deaths.sum())
new_rows["CFR (%)"].append(r["CFR (%)"].sum()/num)
new_rows["Attack Rate (per 1000)"].append(r["Attack Rate (per 1000)"].sum()/num)
# Create a new dataframe from the combined data
new_rows = pd.DataFrame(new_rows)
# And assign the values to the key governorate
new_rows["Governorate"] = key
# Add the new dataframe to our list of new frames
new_frames.append(new_rows)
# Get an inverse filtered dataframe from what we had before
ds = data.loc[~data.Governorate.isin(remove)]
new_frames.append(ds)
# Return a new contatenated dataframe with all our corrected data
return pd.concat(new_frames, ignore_index=True)
data_url = "data/lesson-1/Yemen Cholera Outbreak Epidemiology Data - Data_Governorate_Level.csv"
data = pd.read_csv(data_url)
# Removing commas for an entire column and converting to integers
data["Cases"] = [int(x.replace(",","")) for x in data["Cases"]]
# And converting to date is even simpler
data["Date"] = pd.to_datetime(data["Date"])
# A dictionary is a set of key: value pairs - the 'key' is a term used to index a specific value;
# the 'value' can be any Python object. Here it is a list of terms we want to search for:
fix = {"Hadramaut": ["Moklla","Say'on"],
"Al Hudaydah": ["Al Hudaydah", "Al-Hudaydah"],
"Al Jawf": ["Al Jawf", "Al_Jawf"],
"Al Maharah": ["Al Maharah", "AL Mahrah"],
"Marib": ["Marib", "Ma'areb"]
}
# First, we limit our original data only to the columns we will use,
# and we sort the table according to the attack rate:
data_slice = data[["Date", "Governorate", "Cases", "Deaths", "CFR (%)", "Attack Rate (per 1000)"]
].sort_values("Attack Rate (per 1000)", ascending=False)
data_slice = fix_governorates(data_slice, fix).sort_values("Attack Rate (per 1000)", ascending=False)
# First we create a pivot table of the data we wish to plot. Here only the "Cases", although you
# should experiment with the other columns as well.
drawing = pd.pivot_table(data_slice, values="Cases", index=["Date"], columns=["Governorate"])
# Then we set a plot figure size and draw
drawing.plot(figsize=(14,8), grid=False)
These are not glamorous charts or tables. This last is what I call a spaghetti chart because of the tangle of lines that make it difficult to track what is happening.
However, they are useful methods for investigating what the data tell us and contextualising it against the events behind the data.
Perhaps, given where we are, you feel some confidence that you could begin to piece together a story of what is happening in the Yemen cholera epidemic?
1.3.2 Data and the trouble with accuracy
Sitting at your computer in comfortable surroundings - whether in a quiet office, or the clatter and warmth of your favourite coffee shop - it is tempting to place confidence in a neat table of numbers and descriptions. You may have a sense that data are, in some reassuring way, truthy.
They are not.
All data are a reflection of the time when they were collected, the methodology that produced it, and the care with which that methodology was implemented. It is a sample of a moment in time and it is inherently imperfect.
Medical data, produced by interviewing patient volunteers, is reliant on self-reported experiences and people - even when they’re trying to be honest and reporting on something uncontroversial - have imperfect memories. Blood or tissue samples depend on the consistency with which those samples were acquired, and the chain which stretches from patient, to clinic, to courier, to laboratory, to data analyst. Anything can go wrong, from spillage to spoilage to contamination to overheating or freezing.
Even data generated autonomously via sensors or computational sampling is based on what a human thought was important to measure, and implemented by people who had to interpret instructions on what to collect and apply it to the tools at hand. Sensors can be in the wrong place, pointing in the wrong direction, miscalibrated, or based on faulty assumptions from the start.
Data carry the bias of the people who constructed the research and the hopes of those who wish to learn from it.
In future lessons we’ll consider methods of assessing the uncertainty in our data and how much confidence we can have. For this lesson, we’ll develop a theoretical understanding of the uncertainty and which data we can use to tell a story about events happening in Yemen.
In the space of six months (from May to November 2017), Yemen went from 35,000 cholera cases to almost 1 million. Deaths exceeded 2,000 people per month and the attack rate per 1,000 went from an average of 1, to 30. This reads like an out-of-control disaster.
At the same time, however, the fatality rate dropped from 1% to 0.2%.
Grounds for optimism, then? Somehow medical staff are getting on top of the illness even as infection spreads?
Consider how these data are collected. Consider the environment in which it is being collected.
- Yemen: Coalition Blockade Imperils Civilians - Human Rights Watch, 7 December 2017
- What is happening in Yemen and how are Saudi Arabia’s airstrikes affecting civilians - Paul Torpey, Pablo Gutiérrez, Glenn Swann and Cath Levett, The Guardian, 16 September 2016
- Saudi ‘should be blacklisted’ over Yemen hospital attacks - BBC, 20 April 2017
- Process Lessons Learned in Yemen’s National Dialogue - Erica Gaston, USIP, February 2014
According to UNICEF, as of November 2017, “More than 20 million people, including over 11 million children, are in need of urgent humanitarian assistance. At least 14.8 million are without basic healthcare and an outbreak of cholera has resulted in more than 900,000 suspected cases.”
Cholera incidence data are being collected in an active war zone where genocide and human rights violations are committed daily. Hospital staff are stretched thin, and many have been killed. Hospitals themselves are being deliberately targeted in aerial bombardment. Islamic religious law requires a body to be buried as soon as possible, and this is even more important in a conflict zone to limit further spread of disease.
The likelihood is that medical staff are overwhelmed, and that the living and ill must take precedence over the dead. They see as many people as they can, and it is a testament to their dedication and professionalism that these data continue to reach the WHO and UNICEF.
There are human beings behind these data. They have suffered greatly to bring it to you.
In other words, all we can be certain of is that the Cases and Deaths are the minimum likely, and that attack- and death rates are probably extremely inaccurate. The undercount in deaths may lead to a false sense that the death rate is falling relative to infection, but one shouldn’t count on this.
Despite these caveats, humanitarian organisations must use these data to prepare their relief response. Food, medication and aid workers must be readied for the moment when fighting drops sufficiently to get to Yemen. Journalists hope to stir public opinion in donor nations (and those outside nations active in the conflict), using these data to explain what is happening.
The story we are working on must accept that the infection rate has only a reasonable approximation of what is happening, and that these data should be developed to reflect events.
1.4 Presentation: simplicity and letting data tell a story
Identify an appropriate chart and present data to illustrate its core characteristics
Data scientists, and those who are informed by them, require confidence across a broad range of expertise and against a rapidly-changing environment in which the tools and methods used to pursue our profession are in continual flux. Most of what we do is safely hidden from view.
The one area where what we do rises to the awareness of the lay public is in the presentation of our results. It is also an area with continual development of new visualisation tools and techniques.
This is to highlight that the presentation part of this course may date the fastest and you should take from it principles and approaches to presentation, and not necessarily the presented implementation.
Presentation is everything from writing up academic findings for publication in a journal, to writing a financial and market report for a business, to producing journalism on a complex and fast-moving topic, to persuading donors and humanitarian agencies to take a particular health or environmental threat seriously.
It is, first and foremost, about organising your thoughts to tell a consistent and compelling story.
1.4.1 A language and approach to data-driven story-telling
There are “lies, damned lies, and statistics”, as Mark Twain used to say. Be very careful that you tell the story that is there, rather than one which reflects your own biases (cf. Section 1.1).
According to Edward Tufte (Tufte 2006), professor of statistics at Yale, graphical displays should:
- Show the data,
- Induce the viewer to think about the substance, rather than about the methodology, graphic design, the technology of graphic production, or something else,
- Avoid distorting what the data have to say,
- Present many numbers in a small space,
- Make large datasets coherent,
- Encourage the eye to compare different pieces of data,
- Reveal the data at several levels of detail, from a broad overview to the fine structure,
- Serve a reasonably clear purpose: description, exploration, tabulation, or decoration,
- Be closely integrated with the statistical and verbal descriptions of a dataset.
There are a lot of people with a great many opinions about what constitutes good visual practice. Manual Lima, in his Visual Complexity blog, has even come up with an Information Visualisation Manifesto.
Any story has a beginning, a middle, and a conclusion. The story-telling form can vary but the best and most memorable stories have compelling narratives easily retold.
Throwing data at a bunch of charts in the hopes that something will stick does not promote engagement anymore than randomly plunking at an instrument produces music.
Storytelling does not just happen.
Sun Tzu said, “There are not more than five musical notes, yet the combinations of these five give rise to more melodies than can ever be heard.”
These are the fundamental chart-types that form the data scientist’s toolkit:
- Line chart
- Bar chart
- Stacked / area variations of bar and line
- Bubble-charts
- Text charts
- Choropleth maps
- Tree maps
Plus we can use small-multiple versions of any of the above to enhance comparisons. Small multiples are simple charts placed alongside each other in a way that encourages analysis while still telling an engaging story. The axes are the same throughout and extraneous chart guides (like dividers between the charts and the vertical axes) have been removed. The simple line-chart becomes both modern and information-dense when presented in this way.
There are numerous special types of charts (such as Chernoff Faces) but you’re unlikely to have these implemented in your charting software.
Here is a simple methodology for developing a visual story:
- Write a flow-chart of the narrative encapsulating each of the components in a module
- Each module will encapsulate a single data-driven thought and the type of chart will be imposed by the data:
- Time-series can be presented in line charts, or by small multiples of other plots
- Geospatial data invites choropleths
- Complex multivariate data can be presented in tree maps
- In all matters, be led by the data and by good sense
- Arrange those modules in a series of illustrations
- Revise and edit according to the rules in the previous points
Writing a narrative dashboard with multiple charts can be guided by George Orwell’s rules from Politics and the English Language:
- Never use a pie chart; use a table instead.
- Never use a complicated chart where a simple one will do.
- Never clutter your data with unnecessary grids, ticks, labels or detail.
- If it is possible to remove a chart without taking away from your story, always remove it.
- Never mislead your reader through confusing or ambiguous axes or visualisations.
- Break any of these rules sooner than draw anything outright barbarous.
1.4.2 Telling the story of an epidemic in Yemen
We have covered a great deal in this first lesson and now we come to the final section. Before we go further, we need two new libraries. GeoPandas is almost identical to Pandas, but permits us to work with geospatial data (of which, more in a moment). Seaborn is similar to Matplotlib (and is a simplified wrapper around Matplotlib) but looks better, is designed for statistical data, and is simpler to use.
Our first step is to improve our line chart Figure 1.3 from our initial exploration. I mentioned the notion of small multiples earlier, and here is our first opportunity to draw it. We send away our guidance to our data scientist and get this in response:
Code
# Seaborn for plotting and styling
import seaborn as sns
sns.set_style("white")
# Everything you need to know about Seaborn FacetGrid
# https://seaborn.pydata.org/generated/seaborn.FacetGrid.html#seaborn.FacetGrid
sm = sns.FacetGrid(data_slice, col="Governorate", col_wrap=4, height=2, aspect=2, margin_titles=True)
sm = sm.map(plt.plot, "Date", "Cases")
# And now format the plots with appropriate titles and font sizes
# The backslash '\' permits us to split a long line into two for better legibility
sm.set_titles("{col_name}", size=12).set_ylabels(size=10).set_yticklabels(size=8)\
.set_xlabels(size=10).set_xticklabels(size=8, rotation=40)
Notice how, even with the condensed format, it is still straightforward to understand what is happening and the overall display makes for a compelling and engaging visual.
Unfortunately, unless you know Yemen well, this story is incomplete. It is difficult to see where these changes are taking place, or how each governorate is related to the others in physical space. For that we need to plot our data onto a map.
There are a number of limits for publishing data on maps:
- A choropleth map is really a type of bar-chart where the height of the bars is reflected by a colour gradient in 2D-space.
- Boundaries that make up regions, districts (or governorates) are of wildly different sizes and can mislead into prioritising size over colour scale.
Despite these limitations, map-based charts are useful for grounding data in a physical place. When used in combination with other charts (such as the line-charts above) one can build a complete narrative.
To draw a map we need a shapefile. These are a collection of several files developed according to a standard created by Esri that contain shapes defined by geographic points, polylines or polygons, as well as additional files with metadata or attributes.
HDX has exactly what we need as Yemen - Administrative Boundaries. We wake up our data scientist and invite them to get to work.
Code
# Import our GeoPandas library
import geopandas as gpd
# The 'contextily' library allows us to draw on a map
# import contextily as ctx
# Open the shapefile called "yem_admbnda_adm1_govyem_cso_20191002.shp" and note that - if you're
# doing this on your home computer, you'll need to load the file from where-ever you saved it.
# While you only need to load this file here, the other file data needed are cross-referenced,
# so make sure to unzip all the files.
shape_data = gpd.GeoDataFrame.from_file("data/lesson-1/yem_admbnda_adm1_govyem_cso_20191002.shp")
# We can also draw this boundary on a map so you can see what we have
# The 'geometry' column above has the coordinate data we need
# We need to change the map projection so that it will fit the same
# coordinate system - this is beyond the requirements of this class,
# but read up on projection systems if you're interested.
shape_data.to_crs(epsg=3857, inplace=True)
# and create the chart axes
# ax = shape_data["geometry"].plot(color="darkred", alpha=0.25, figsize=(12,12))
# ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron)
# ax.set_axis_off()
fix = {"Ad Dali'": ["Al Dhale'e"],
"Al Hodeidah": ["Al Hudaydah"],
"Hadramawt": ["Hadramaut"],
"Ma'rib": ["Marib"],
"Sa'dah": ["Sa'ada"],
"Sana'a City": ["Amanat Al Asimah"],
"Ta'iz": ["Taizz"]
}
data_slice = fix_governorates(data_slice, fix).sort_values("Attack Rate (per 1000)", ascending=False)
# We have no data for Socotra island, so we can drop this row
# Its name is listed in the ADM1_EN column
shape_data = shape_data.loc[~shape_data.ADM1_EN.isin(["Socotra"])]
# And now we can merge our existing data_slice to produce our map data
map_data = pd.merge(shape_data, data_slice, how="outer", left_on="ADM1_EN", right_on="Governorate", indicator=False)
# Let's draw a map
# First, define a figure, axis and plot size
fig, ax = plt.subplots(figsize=(12,8))
ax.set_axis_off()
# We'll look at one specific date, the last entry in the series
md = map_data.loc[map_data.Date == "2018-02-18"]
# And plot - note that 'cmap' is simply a colour gradient used in the visualisation - we're using OrRd, orange-red
md.plot(ax=ax, column='Cases', cmap='OrRd')
Again, a lot had to be fixed under the hood as data had to be aligned between what was in the shapefile and what was in our original source. Decisions had to be made by a tired data scientist working under pressure. And, while this helps, here we hit a fundamental limit of a map … it would be nice to show a time-series of how events progressed.
Well, remember the small multiple? … So, to end this first lesson, here’s what a small multiple map looks like.
Code
# This is a bit more complex than you may expect ... but think of it like this:
# We're going to create a figure and then iterate over the time-series to progressively
# add in new subplots. Since there are 125 dates - and that's rather a lot - we'll
# deliberately limit this to the first date in each month, and the final date.
# Create a datetime format data series
date_list = pd.Series([pd.Timestamp(d) for d in map_data["Date"].unique()])
# Sort dates in place
date_list.sort_values(inplace = True)
dl = {}
for d in date_list:
# A mechanism to get the last day of each year-month
k = "{}-{}".format(d.year, d.month)
dl[k] = d
# Recover and sort the unique list of dates
date_list = list(dl.values())
date_list.sort()
# Create our figure
fig = plt.figure(figsize=(16, 8))
# Set two check_sums, first_date and sub_count
first_date = 0
sub_count = 1
# Loop through the dates, using "enumerate" to count the number of times we loop
for i, d in enumerate(date_list):
# Convert the Numpy time format to a simpler Python format
check_date = pd.Timestamp(d).to_pydatetime()
# Check if we've seen this month before
if check_date.month == first_date:
# If we have, check if it's the last value in the loop
if i < len(date_list)-1:
# And skip the rest of this loop
continue
# Store the month we've just reached
first_date = check_date.month
# Get a dataframe for the subplot at this date
subplot = map_data.loc[map_data.Date == pd.Timestamp(d)]
# Add the appropriate subplot in a frame structured as 3 items in 4 rows
# (there are 10 time periods but we need a square coordinate grid)
ax = fig.add_subplot(3, 4, sub_count)
# Increment the count
sub_count+=1
# Do some visual fixes to ensure we don't distort the maps, and provide titles
ax.set_aspect('equal')
ax.set_axis_off()
ax.title.set_text(pd.Timestamp(d).date())
# And plot
subplot.plot(ax=ax, column='Cases', cmap='OrRd')
The two small-multiples - the line charts and the choropleth - permit us to answer our research question: there is an increasing incidence and prevalence of cholera in Yemen, with the rate of increase occuring both in scale and in geographic range. The challenge for any aid response will not only be in addressing the scale of the problem, but also in distributing care over an increasingly large physical area.
The visualisations themselves, along with the source data, are all part of our answer to our question.