1  Answering questions with data

Data has become the most important language of our era, informing everything from intelligence in automated machines, to predictive analytics in medical diagnostics. The plunging cost and easy accessibility of the raw requirements for such systems – data, software, distributed computing, and sensors – are driving the adoption and growth of data-driven decision-making.

The key to unlocking data reuse, and new economic and social development opportunities from these data, rely on data producers, and data users, having technical insight necessary to manage those who work with data, and a conscious and motivated understanding of the new algorithmic tools available to us.

When researchers look to answer questions, they begin an investigative process:

  1. Question: Identify a research question or problem
  2. Ethics: Consider the ethical framework to inform the research process
  3. Pre-publish: Develop, and publish, a research protocol
  4. Collect: Collect appropriate data
  5. Curate: Restructure data into a useable format
  6. Analyse: Analyse that data
  7. Prepare: Formulate recommendations and conclusions
  8. Present: Present and publish the complete study, from data to results

Ordinarily, when teaching data science, everyone - from teachers to students - prefers to focus on the last three points because analysis and presentation are more fun and require less frustration with messy data or ethical dilemmas. For those who work with source data, or information collected under often uncertain circumstances, the bulk of their time is taken up with the first five points.

The objective of this syllabus is to ensure you have a comprehensive grasp of a data-driven research process. Data as a Science guides learners to confidence in the ethics, curation, analysis, and presentation of data, integrating each of these topics into each lesson:

flowchart LR
  A[Ethics] --> B[Curation]
  B --> C[Analysis]
  C --> D[Presentation]

Our first steps are to prepare our work surface and lay out the tools - both philosophical and technical - that we will use.

We start with our research question for this lesson:

Research Question

The war in Yemen has caused tremendous hardship and suffering to that region’s civilian population. Aid- and civil-rights organisations are concerned that cholera is spreading. International health response organisations want to assess the scale of the problem. Using information published by a credible international research organisation, map reported incidents of cholera over time and assess whether there is a growing regional cholera epidemic.

1.1 Ethics: a foundation for reasoning

Identify concepts in ethical reasoning which may influence our analysis and results from data.

Polygal was a gel made from beet and apple pectin. Administered to a severely wounded patient, it was supposed to reduce bleeding. To test this hypothesis, Sigmund Rascher administered a tablet to human subjects who were then shot or - without anaesthesia - had their limbs amputated.

During the Second World War, and under the direction of senior Nazi officers, medical experiments of quite unusual violence were conducted on prisoners of war and civilians regarded by the Nazi regime as sub-human. After the war, twenty medical doctors were tried for war crimes and crimes against humanity at the Doctor’s Trial held in Nuremberg from 1946 to 1949.

Throughout the trial, expert witnesses, prosecutors and judges proposed key tests as part of a universal code which could be used to assess whether a medical experiment could be regarded as ethical. This became known as the Nuremberg Code. Key to that was the very first test:

The voluntary consent of the human subject is absolutely essential.

This means that the person involved should have legal capacity to give consent; should be so situated as to be able to exercise free power of choice, without the intervention of any element of force, fraud, deceit, duress, over-reaching, or other ulterior form of constraint or coercion; and should have sufficient knowledge and comprehension of the elements of the subject matter involved, as to enable him to make an understanding and enlightened decision. This latter element requires that, before the acceptance of an affirmative decision by the experimental subject, there should be made known to him the nature, duration, and purpose of the experiment; the method and means by which it is to be conducted; all inconveniences and hazards reasonably to be expected; and the effects upon his health or person, which may possibly come from his participation in the experiment.

The duty and responsibility for ascertaining the quality of the consent rests upon each individual who initiates, directs or engages in the experiment. It is a personal duty and responsibility which may not be delegated to another with impunity.

It would be decades before such ethical principles would be more generally adopted into research practice, and only after great human cost.

The Tuskegee syphilis experiment is regarded as “arguably the most infamous biomedical research study in US history” and led to significantly greater fear by minority Americans of participating in medical research (Katz et al. 2006).

Between 1932 and 1972, the US Public Health Service conducted a clinical study on 600 poor, semi-literate African-American farm-workers from Alabama. The experiment was not designed to cure, but to observe the full progression of untreated syphilis. And, while the experiment began before the development of antibiotics, even after penicillin became the standard treatment in 1947, subjects were deliberately denied treatment. Researchers even went out of their way to prevent those subjects drafted to serve in the military in the Second World War from receiving treatment.

It wasn’t until 1978 that the US Belmont Report was issued which offered specific ethical principles when conducting medical experiments, and only in 1991 was a universal set of rules adopted across the US.

Despite these hard-won lessons, there remain numerous investors and researchers who see ethical considerations as placing a burden on their objectives. In 2017, for example, citing frustration with the US Food and Drug Administration Institutional Review Board, Rational Vaccines began human-subject testing of their controversial herpes vaccine on the Caribbean island of St Kitts, deliberately placing themselves outside of US jurisdiction.

Far from being settled, the philosophical concerns and controversies raised by ethics are all around us (Baggini and Fosl 2007).

1.1.1 The context of right and wrong

Computers cannot make decisions. Their output is an absolute function of the data provided as input, and the algorithms applied to analyse that input. The aid of computers in decision-making does not override human responsibility and accountability.

It should be expected that both data and algorithms should stand up to scrutiny so as to justify any and all decisions made as a result of their output. “Computer says no,” is not an unquestionable statement.

In any community, people interact with their environment and each other to produce their livelihoods. The process by which they acquire and accumulate things, and how their interactions affect each other and their environment, lead to asymmetrical changes (Bowles, Carlin, and Stevens 2017). Over time, and over multiple generations, these changes accumulate in different ways to different people leading to highly diverse and complex societies.

Economists study these changes and attempt to understand the causes and effects which underpin them, as well as the mechanisms by which economic change may be realised. The question of what sorts of change are good or bad - especially where they give rise to incredible disparities in wealth and opportunity - is one which data and analysis can do little to answer.

Facts can tell you what has happened, not what we should do about it. Yet we collect data and analysis with the objective of furthering a particular objective.

That objective may have unforeseen consequences.

For example, combustion engines which burn fossil fuels have revolutionised transport and manufacturing, and produced incredible wealth, yet a consequence has been a catastrophic change to our environment through immediate effects, like pollution, and long-term effects, like climate change.

The process by which we examine and explain why what we consider to be right or wrong is considered right or wrong in matters of human conduct belongs to the study of ethics (Folse 2005).

Case-study

In 1947 Kurt Blome - Deputy Reich Health Leader, a high-ranking Nazi scientist - was acquitted of war crimes at the Nuremberg trials on the strength of intervention by the United States. Within two months, he was being debriefed by the US military who wished to learn everything he knew about biological warfare.

Do you feel the US was “right” or “wrong” to offer Blome immunity from prosecution in exchange for what he knew?

There were numerous experiments conducted by the Nazis that raise ethical dilemmas, including: immersing prisoners in freezing water to observe the result and test hypothermia revival techniques; high altitude pressure and decompression experiments; sulfanilamide tests for treating gangrene and other bacterial infections.

Do you feel it would be “right” or “wrong” to use these data in your experiments?

Further reading.

1.1.2 The weight of intent and what we choose to know

Our actions - as data scientists - are intended to persuade people to act or think other than the way they currently do based on nothing more than the strength of our analysis, and informed by data.

Whereas everything else we do - data curation, analysis and presentation - can describe human behaviour as it is, ethics provides a theoretical framework to describe the world as it should, or should not be. It gives us full range to describe an ideal outcome, and to consider all that we know and do not know which may impede or confound our desired result:

unknown
knowns
known
knowns
unknown
unknowns
known
unknowns

Each of these aspects can be dangerous, but what we insist on not knowing - the unknown-knowns; that which we do not like to know or do not want to know - constitute our inherent bias and can compromise any analysis irretrievably.

Case-study

A Nigerian man travels to a conference in the United States. After one of the sessions, he goes to the bathroom to wash his hands. The electronic automated soap dispenser does not recognise his hands beneath the sensor. A white American sees his confusion and places his hands beneath the device. Soap is dispensed. The Nigerian man tries again. It still does not recognise him.

How would something like this happen? Is it an ethical concern? Where would you place it in terms of the known vs unknown matrix above?

Further reading.

When we consider ethical outcomes, we use the terms good or bad to describe judgements about people or things, and we use right or wrong to refer to the outcome of specific actions. Understand, though, that - while right and wrong may sometimes be obvious - we are often stuck in ethical dilemmas.

How we consider whether an action is right or wrong comes down to the tension between what was intended by an action, and what the consequences of that action were. Are only intensions important? Or should we only consider outcomes? And how absolutely do you want to judge this chain: a good motivation, leading to a good intention, performing a good action, resulting in only the right consequences. How do we evaluate this against what it may be impossible to know at the time, even if that information will become available after a decision is made?

We also need to consider competing interests in right and wrong outcomes. A positive outcome for the individual making the decision may have a negative consequence for numerous others. Conversely, an altruistic person may act only for the benefit of others even to their own detriment.

Ethical problems do not always require a call to facts to justify a particular decision, but they do have a number of characteristics:

  • Public: the process by which we arrive at an ethical choice is known to all participants.
  • Informal: the process cannot always be codified into law like a legal system.
  • Rational: despite the informality, the logic used must be accessible and defensible.
  • Impartial: any decision must not favour any group or person.

Rather than imposing a specific set of rules to be obeyed, ethics provides a framework in which we may consider whether what we are setting out to achieve conforms to our values, and whether the process by which arrive at our decision can be validated and inspected by others.

No matter how sophisticated our automated machines become, unless our intention is to construct a society “of machines, for machines”, people will always be needed to decide on what ethical considerations must be taken into account.

1.2 Curation: keeping track of what we know

Understand the process of data curation, and the custodial duty of data science. Learn some essential tools used by data scientists.

Ensuring we know what we know would seem to be the simplest of our ethical responsibilities.

As a project grows in complexity, and as the number of people involved gets larger, sharing what we know in a format accessible to all gets more complex. Even if everyone starts out the same, the diversity of requirements across any project will lead individuals to specialise. Unique grammars may be used to describe the data each collects, and these grammars will become opaque to others.

Before we even begin, we need a grammar to describe and organise the data we collect, and we need a mechanism to ensure that both the grammar and data are accessible to anyone who requires them.

1.2.1 Data about data

A grammar which describes our source data is known as metadata. A specific, organisational, grammar is known as a schema.

Consider simply this online textbook. It has a title, author and url. But - hidden amidst all your other resources - how will you know what this book is about? That’s where a solid metadata system comes in; it improves your ability to understand the aboutness of an object.

There are numerous metadata systems, known as schemas, available, from the general, to the specialised.

As an introduction, let’s consider citations.

If you’ve spent any time in research, you’ll know the importance of citing your sources. You’ll also know the wide range of ways in which you could present your references. BibTeX is a standardised structure to format lists of references. It separates out the specific bibliographic information according to a defined grammar:

@book{DataScience,
    title = "Data as a Science",
    author = "Gavin Chait",
    year = "2018",
    publisher = "Whythawk.com",
    institution = "Whythawk",
    url = "https://data-as-a-science.com/"
}

The term @book is an entry type which describes the type of information that the entry contains. You can have a range of such types: article, book, manual, techreport, and so on.

Each entry contains a number of fields. Some are compulsory, depending on the type, and some are optional. A book must contain an author, title, publisher and year. An article requires an author, title, journal, year and volume.

The term DataScience must be unique and references this book object. You can store a list of such objects in a standard text file with the extension .bib instead of .txt. Quarto, the software used to write this textbook, automatically looks up citations and places them in the References chapter.

Exercise

Read through the Wikipedia BibTeX entry, then pick any book or article from your library and create a BibTeX object for it.

One of the more ubiquitous general metadata systems used to describe web resources is Dublin Core.

Some of the terms will be immediately obvious (title, subject, publisher, language, description, date, creator) while others will be unclear (isFormatOf, isReferencedBy, isReplacedBy). The purpose of the schema is not only to describe the referenced object, but also to connect it to other datasets and describe each object’s relationship to other objects. This is the complete Dublin Core reference.

Case-study

Responding to serious public health issues requires constant monitoring. The Humanitarian Data Exchange is provided by UNOCHA as their central repository for public data sharing. You can explore the site and discover datasets.

Our research question for this lesson requires us to assess a respectable public data source referencing Yemen’s cholera incident rate. Let’s look at Yemen: Cholera Outbreak Epidemiology Update Data. The dataset page contains a number of features useful when providing a centralised platform to share data:

  • Metadata: including Source, Contributor, Date of Dataset, Expected Update Frequency, Location, Visibility, License, Methdology, Comments, Tags
  • Data and resources: both for download and to explore online
  • Contact: importantly, if the data are entirely unclear, there’s a way to contact the data provider

Consider the metadata in full:

Source World Health Organisation (WHO)
Contributor HDX
Date of Dataset 8 November 2017
Expected Update Frequency Never (last update, 18 February 2018)
Location Yemen
Visibility Public
License Creative Commons Attribution for Intergovernmental Organisations
Methodology Registry
Caveats / Comments The data contains figures from epi bulletins, weekly epi bulletins and daily bulletins. Starting 19 June 2017, the unit used for the attack rate changed from per 10,000 to per 1,000. Previous data were adjusted to the new unit by dividing by 10. Starting 2 July 2017, the data included figures for Moklla. These data were not mapped into any of the govenorates in the Yemen CODs. Starting 6 July, the data included figures for Say’on. These data were not mapped into any of the govenorates in the Yemen CODs. Governorate level updates were discontinued in this dataset. The last governorate level update was on 18 February 2018. The governorate level updates were discontinued in this dataset because the governorate level data is published as an image since that date.
Tags ATTACK RATE CASE FATALITY RATE CASE FATALITY RATIO CASES CFR CHOLERA CONFLICT DEATHS HEALTH WAR

On the search results page you can see a funnel icon next to the search bar. When you click on it, you can filter your results by these metadata:

HDS Filter

While it does happen that data-series end, it is unusual for a data-series to be ended because it has been replaced by a visualisation of that data. An image does not permit us to explore and investigate for ourselves. While these data meet the standards we require for research sources, the lack of continuity is certainly problematic.

This research question and case study was originally prepared while the data was still being updated. Given that there is no way to know when you will be studying this lesson, for the sake of our question, let us assume that you are performing this analyis in February 2018, prior to the ending of this series.

1.2.2 Tools for data curation, analysis and presentation

One thing we haven’t done yet is look at the data itself. Ordinarily, this is where a data science course would dive into programming. This is, however, a course for non-data scientists. Your job is to read, reflect and synthesise recommendations.

This course may use code, or present data, but your job is to focus on your core task.

1.3 Analysis: exploring data

Investigate and manipulate data to learn its metadata, shape and robustness.

There are a number of tools used by data scientists to understand and analyse data. We’ll get to those, but one of the fundamentals is simply exploring a new dataset.

1.3.1 Initial exploration

Usually, in data courses, you’re presented with a nice, clean dataset and run some algorithms on it and get some answers. That isn’t helpful to you. Except for data you collect, you’re unlikely to know the shape and contents of a dataset you import from others, no matter how good their research.

Especially for large datasets, it can be difficult to know how many unique terms you may be working with and how they relate to each other.

In Section 1.2.1 on data curation, we reviewed the Yemen: Cholera Outbreak Epidemiology Update Data. We could view this in a spreadsheet, like Excel, but we’re going to use Pandas, a Python library that makes working with large structured data simple. You can open these code blocks if you want, but we are focused on the outputs.

Code
# Comments to code are not executed and are flagged with this '#' symbol.
# First we'll import the pandas library.
# We use 'as' so that we can reference it as 'pd', which is shorter to type.
import pandas as pd

# In Python, we can declare our variables by simply naming them, like below
# 'data_url' is a variable name and the url is the text reference we're assigning to it
data_url = "data/lesson-1/Yemen Cholera Outbreak Epidemiology Data - Data_Governorate_Level.csv"

# We import our data as a 'dataframe' using this simple instruction.
# How did I know it was a CSV (Comma Separated Value) file? If you look at the end of the urls (above),
# you'll see 'output=csv'. A variable in Python can be anything. Here our variable is a Pandas dataframe type.
data = pd.read_csv(data_url)

# Lets see what that looks like (I limit the number of rows printed by using '[:10]', 
# and Python is '0' indexed, meaning the first item starts at '0'):
data.head(10)
Date Governorate Cases Deaths CFR (%) Attack Rate (per 1000) COD Gov English COD Gov Arabic COD Gov Pcode
0 2018-02-18 Amran 103965 176 0.17 89.582 Amran عمران 29.0
1 2018-02-18 Al Mahwit 62887 151 0.24 86.122 Al Mahwit المحويت 27.0
2 2018-02-18 Al Dhale'e 47136 81 0.17 64.438 Al Dhale'e الضالع 30.0
3 2018-02-18 Hajjah 121287 422 0.35 52.060 Hajjah حجة 17.0
4 2018-02-18 Sana'a 76250 123 0.16 51.859 Sana'a صنعاء 23.0
5 2018-02-18 Dhamar 103214 161 0.16 51.292 Dhamar ذمار 20.0
6 2018-02-18 Abyan 28243 35 0.12 49.477 Abyan أبين 12.0
7 2018-02-18 Al Hudaydah 155908 282 0.18 48.147 Al Hudaydah الحديدة 18.0
8 2018-02-18 Al Bayda 30568 36 0.12 40.253 Al Bayda البيضاء 14.0
9 2018-02-18 Amanat Al Asimah 103184 71 0.07 36.489 Amanat Al Asimah أمانة العاصمة 13.0
Figure 1.1: Table of cholera epidemiological data at the governorate level

Definitions about the overall dataset is called descriptive metadata. Now we need information about the data within each dataset. That is called structural metadata; a grammar describing the structure and definitions of the data in a table.

Sometimes the data you’re working with has no further information and you need to experiment with similar data to assess what the terms mean, or what units are being used, or to gap-fill missing data. Sometimes there’s someone to ask. Sometimes you get a structural metadata definition to work with. And sometimes you are the recipient of someone else’s best guess. Always be sure to find out what definitions were used.

In this case, the publisher has helpfully provided another table containing the definitions for the structural metadata.

Code
# First we set the url for the metadata table
metadata_url = "data/lesson-1/Yemen Cholera Outbreak Epidemiology Data - Metadata.csv"
# Import it from CSV
metadata = pd.read_csv(metadata_url)
# Show the metadata:
metadata.style.set_properties(subset=["Description"], **{"width": "400px", "text-align": "left"})
  Column Description
0 Date Date when the figures were reported.
1 Governorate The Governorate name as reported in the WHO epidemiology bulletin.
2 Cases Number of cases recorded in the governorate since 27 April 2017.
3 Deaths Number of deaths recorded in the governorate since 27 April 2017.
4 CFR (%) The case fatality rate in governorate since 27 April 2017.
5 Attack Rate (per 1000) The attack rate per 1,000 of the population in the governorate since 27 April 2017.
6 COD Gov English The English name for the governorate according to the Inter Agency Standing Committee (IASC) Common Operation Datasets (CODs) for Yemen.
7 COD Gov Arabic The Arabic name for the governorate according to the Inter Agency Standing Committee (IASC) Common Operation Datasets (CODs) for Yemen.
8 COD Gov Pcode The PCODE name for the governorate according to the Inter Agency Standing Committee (IASC) Common Operation Datasets (CODs) for Yemen.
9 Bulletin Type The type of bulletin from which the data was extracted. Bulletin types include Epidemiology bulletin, Weekly epidemiology bulletin, Daily epidemiology update.
10 Bulletin URL The URL of the bulletin from which the data was extracted.
Figure 1.2: Metadata definitions for the Yemen cholera epidemiological data

Unless you work in epidemiology, “attack rate” may still be unfamiliar. The US Centers for Disease Control and Prevention has a self-study course (Dicker et al. 2012) which covers the principles of epidemiology and contains this definition: “In the outbreak setting, the term attack rate is often used as a synonym for risk. It is the risk of getting the disease during a specified period, such as the duration of an outbreak.”

An “Attack rate (per 1000)” implies the rate of new infections per 1,000 people in a particular population.

These data were sent to your team data scientist who fixed up what she could and sent back a chart.

Code
import pandas as pd
# Matplotlib for additional customization
from matplotlib import pyplot as plt
%matplotlib inline

def fix_governorates(data, fix_govs):
    """
    Rename terms in a dataframe based on a key-value dictionary.
    
    This is our function _fix_governorates_; note that we must pass it
    two variables, known as arguments.
    
    Args:
        data - the dataframe we want to fix;
        fix_govs - a dictionary of the governorates we need to correct.
    
    The function will perform the following task:
        For a given dataframe, date list, and dictionary of Governorates
        loop through the keys in the dictionary and combine the list 
        of associated governorates into a new dataframe.
        Return a new, corrected, dataframe.
        
    Returns:
        Pandas dataframe.
    """
    # Create an empty list for each of the new dataframes we'll create
    new_frames = []
    # And an empty list for all the governorates we'll need to remove later
    remove = []
    # Create our list of dates
    date_list = data["Date"].unique()
    # Loop through each of the governorates we need to fix
    for key in fix_govs.keys():
        # Create a filtered dataframe containing only the governorates to fix
        ds = data.loc[data.Governorate.isin(fix_govs[key])]
        # New entries for the new dataframe
        new_rows = {"Date": [],
                    "Cases": [],
                    "Deaths": [],
                    "CFR (%)": [],
                    "Attack Rate (per 1000)": []
                   }
        # Divisor for averages (i.e. there could be more than 2 govs to fix)
        num = len(fix_govs[key])
        # Add the governorate values to the remove list
        remove.extend(fix_govs[key])
        # For each date, generate new values
        for d in date_list:
            # Data in the dataframe is stored as a Timestamp value
            r = ds[ds["Date"] == pd.Timestamp(d)]
            new_rows["Date"].append(pd.Timestamp(d))
            new_rows["Cases"].append(r.Cases.sum())
            new_rows["Deaths"].append(r.Deaths.sum())
            new_rows["CFR (%)"].append(r["CFR (%)"].sum()/num)
            new_rows["Attack Rate (per 1000)"].append(r["Attack Rate (per 1000)"].sum()/num)
        # Create a new dataframe from the combined data
        new_rows = pd.DataFrame(new_rows)
        # And assign the values to the key governorate
        new_rows["Governorate"] = key
        # Add the new dataframe to our list of new frames
        new_frames.append(new_rows)
    # Get an inverse filtered dataframe from what we had before
    ds = data.loc[~data.Governorate.isin(remove)]
    new_frames.append(ds)
    # Return a new contatenated dataframe with all our corrected data
    return pd.concat(new_frames, ignore_index=True)

data_url = "data/lesson-1/Yemen Cholera Outbreak Epidemiology Data - Data_Governorate_Level.csv"
data = pd.read_csv(data_url)
# Removing commas for an entire column and converting to integers
data["Cases"] = [int(x.replace(",","")) for x in data["Cases"]]
# And converting to date is even simpler
data["Date"] = pd.to_datetime(data["Date"])
# A dictionary is a set of key: value pairs - the 'key' is a term used to index a specific value;
# the 'value' can be any Python object. Here it is a list of terms we want to search for:
fix = {"Hadramaut": ["Moklla","Say'on"],
       "Al Hudaydah": ["Al Hudaydah", "Al-Hudaydah"], 
       "Al Jawf": ["Al Jawf", "Al_Jawf"], 
       "Al Maharah": ["Al Maharah", "AL Mahrah"], 
       "Marib": ["Marib", "Ma'areb"]
      }
# First, we limit our original data only to the columns we will use,
# and we sort the table according to the attack rate:
data_slice = data[["Date", "Governorate", "Cases", "Deaths", "CFR (%)", "Attack Rate (per 1000)"]
                 ].sort_values("Attack Rate (per 1000)", ascending=False)
data_slice = fix_governorates(data_slice, fix).sort_values("Attack Rate (per 1000)", ascending=False)

# First we create a pivot table of the data we wish to plot. Here only the "Cases", although you
# should experiment with the other columns as well.
drawing = pd.pivot_table(data_slice, values="Cases", index=["Date"], columns=["Governorate"])
# Then we set a plot figure size and draw
drawing.plot(figsize=(14,8), grid=False)
Figure 1.3: Yemen time-series of cholera incidents in governorates
Exercise

There’s a lot that was done to get to this point. Assumptions were made. You can read the code to get a sense of what these may have been. Think about this chart. How much can you interpret and how much is wrong with it?

These are not glamorous charts or tables. This last is what I call a spaghetti chart because of the tangle of lines that make it difficult to track what is happening.

However, they are useful methods for investigating what the data tell us and contextualising it against the events behind the data.

Perhaps, given where we are, you feel some confidence that you could begin to piece together a story of what is happening in the Yemen cholera epidemic?

1.3.2 Data and the trouble with accuracy

Sitting at your computer in comfortable surroundings - whether in a quiet office, or the clatter and warmth of your favourite coffee shop - it is tempting to place confidence in a neat table of numbers and descriptions. You may have a sense that data are, in some reassuring way, truthy.

They are not.

All data are a reflection of the time when they were collected, the methodology that produced it, and the care with which that methodology was implemented. It is a sample of a moment in time and it is inherently imperfect.

Medical data, produced by interviewing patient volunteers, is reliant on self-reported experiences and people - even when they’re trying to be honest and reporting on something uncontroversial - have imperfect memories. Blood or tissue samples depend on the consistency with which those samples were acquired, and the chain which stretches from patient, to clinic, to courier, to laboratory, to data analyst. Anything can go wrong, from spillage to spoilage to contamination to overheating or freezing.

Even data generated autonomously via sensors or computational sampling is based on what a human thought was important to measure, and implemented by people who had to interpret instructions on what to collect and apply it to the tools at hand. Sensors can be in the wrong place, pointing in the wrong direction, miscalibrated, or based on faulty assumptions from the start.

Data carry the bias of the people who constructed the research and the hopes of those who wish to learn from it.

Data are inherently uncertain and any analysis must be absolutely cognizant of this. It is the reason we start with ethics. We must, from the outset, be truthful to ourselves.

In future lessons we’ll consider methods of assessing the uncertainty in our data and how much confidence we can have. For this lesson, we’ll develop a theoretical understanding of the uncertainty and which data we can use to tell a story about events happening in Yemen.

In the space of six months (from May to November 2017), Yemen went from 35,000 cholera cases to almost 1 million. Deaths exceeded 2,000 people per month and the attack rate per 1,000 went from an average of 1, to 30. This reads like an out-of-control disaster.

At the same time, however, the fatality rate dropped from 1% to 0.2%.

Grounds for optimism, then? Somehow medical staff are getting on top of the illness even as infection spreads?

Consider how these data are collected. Consider the environment in which it is being collected.

Background reading on what is happening in Yemen (December 2017)

According to UNICEF, as of November 2017, “More than 20 million people, including over 11 million children, are in need of urgent humanitarian assistance. At least 14.8 million are without basic healthcare and an outbreak of cholera has resulted in more than 900,000 suspected cases.”

Cholera incidence data are being collected in an active war zone where genocide and human rights violations are committed daily. Hospital staff are stretched thin, and many have been killed. Hospitals themselves are being deliberately targeted in aerial bombardment. Islamic religious law requires a body to be buried as soon as possible, and this is even more important in a conflict zone to limit further spread of disease.

The likelihood is that medical staff are overwhelmed, and that the living and ill must take precedence over the dead. They see as many people as they can, and it is a testament to their dedication and professionalism that these data continue to reach the WHO and UNICEF.

There are human beings behind these data. They have suffered greatly to bring it to you.

In other words, all we can be certain of is that the Cases and Deaths are the minimum likely, and that attack- and death rates are probably extremely inaccurate. The undercount in deaths may lead to a false sense that the death rate is falling relative to infection, but one shouldn’t count on this.

Despite these caveats, humanitarian organisations must use these data to prepare their relief response. Food, medication and aid workers must be readied for the moment when fighting drops sufficiently to get to Yemen. Journalists hope to stir public opinion in donor nations (and those outside nations active in the conflict), using these data to explain what is happening.

The story we are working on must accept that the infection rate has only a reasonable approximation of what is happening, and that these data should be developed to reflect events.

1.4 Presentation: simplicity and letting data tell a story

Identify an appropriate chart and present data to illustrate its core characteristics

Data scientists, and those who are informed by them, require confidence across a broad range of expertise and against a rapidly-changing environment in which the tools and methods used to pursue our profession are in continual flux. Most of what we do is safely hidden from view.

The one area where what we do rises to the awareness of the lay public is in the presentation of our results. It is also an area with continual development of new visualisation tools and techniques.

This is to highlight that the presentation part of this course may date the fastest and you should take from it principles and approaches to presentation, and not necessarily the presented implementation.

Presentation is everything from writing up academic findings for publication in a journal, to writing a financial and market report for a business, to producing journalism on a complex and fast-moving topic, to persuading donors and humanitarian agencies to take a particular health or environmental threat seriously.

It is, first and foremost, about organising your thoughts to tell a consistent and compelling story.

1.4.1 A language and approach to data-driven story-telling

There are “lies, damned lies, and statistics”, as Mark Twain used to say. Be very careful that you tell the story that is there, rather than one which reflects your own biases (cf. Section 1.1).

According to Edward Tufte (Tufte 2006), professor of statistics at Yale, graphical displays should:

  • Show the data,
  • Induce the viewer to think about the substance, rather than about the methodology, graphic design, the technology of graphic production, or something else,
  • Avoid distorting what the data have to say,
  • Present many numbers in a small space,
  • Make large datasets coherent,
  • Encourage the eye to compare different pieces of data,
  • Reveal the data at several levels of detail, from a broad overview to the fine structure,
  • Serve a reasonably clear purpose: description, exploration, tabulation, or decoration,
  • Be closely integrated with the statistical and verbal descriptions of a dataset.

There are a lot of people with a great many opinions about what constitutes good visual practice. Manual Lima, in his Visual Complexity blog, has even come up with an Information Visualisation Manifesto.

Any story has a beginning, a middle, and a conclusion. The story-telling form can vary but the best and most memorable stories have compelling narratives easily retold.

Throwing data at a bunch of charts in the hopes that something will stick does not promote engagement anymore than randomly plunking at an instrument produces music.

Storytelling does not just happen.

Sun Tzu said, “There are not more than five musical notes, yet the combinations of these five give rise to more melodies than can ever be heard.”

These are the fundamental chart-types that form the data scientist’s toolkit:

  • Line chart
  • Bar chart
  • Stacked / area variations of bar and line
  • Bubble-charts
  • Text charts
  • Choropleth maps
  • Tree maps

Plus we can use small-multiple versions of any of the above to enhance comparisons. Small multiples are simple charts placed alongside each other in a way that encourages analysis while still telling an engaging story. The axes are the same throughout and extraneous chart guides (like dividers between the charts and the vertical axes) have been removed. The simple line-chart becomes both modern and information-dense when presented in this way.

There are numerous special types of charts (such as Chernoff Faces) but you’re unlikely to have these implemented in your charting software.

Here is a simple methodology for developing a visual story:

  • Write a flow-chart of the narrative encapsulating each of the components in a module
  • Each module will encapsulate a single data-driven thought and the type of chart will be imposed by the data:
    • Time-series can be presented in line charts, or by small multiples of other plots
    • Geospatial data invites choropleths
    • Complex multivariate data can be presented in tree maps
  • In all matters, be led by the data and by good sense
  • Arrange those modules in a series of illustrations
  • Revise and edit according to the rules in the previous points

Writing a narrative dashboard with multiple charts can be guided by George Orwell’s rules from Politics and the English Language:

  • Never use a pie chart; use a table instead.
  • Never use a complicated chart where a simple one will do.
  • Never clutter your data with unnecessary grids, ticks, labels or detail.
  • If it is possible to remove a chart without taking away from your story, always remove it.
  • Never mislead your reader through confusing or ambiguous axes or visualisations.
  • Break any of these rules sooner than draw anything outright barbarous.

1.4.2 Telling the story of an epidemic in Yemen

We have covered a great deal in this first lesson and now we come to the final section. Before we go further, we need two new libraries. GeoPandas is almost identical to Pandas, but permits us to work with geospatial data (of which, more in a moment). Seaborn is similar to Matplotlib (and is a simplified wrapper around Matplotlib) but looks better, is designed for statistical data, and is simpler to use.

Our first step is to improve our line chart Figure 1.3 from our initial exploration. I mentioned the notion of small multiples earlier, and here is our first opportunity to draw it. We send away our guidance to our data scientist and get this in response:

Code
# Seaborn for plotting and styling
import seaborn as sns

sns.set_style("white")
# Everything you need to know about Seaborn FacetGrid
# https://seaborn.pydata.org/generated/seaborn.FacetGrid.html#seaborn.FacetGrid
sm = sns.FacetGrid(data_slice, col="Governorate", col_wrap=4, height=2, aspect=2, margin_titles=True)
sm = sm.map(plt.plot, "Date", "Cases")
# And now format the plots with appropriate titles and font sizes
# The backslash '\' permits us to split a long line into two for better legibility
sm.set_titles("{col_name}", size=12).set_ylabels(size=10).set_yticklabels(size=8)\
                                    .set_xlabels(size=10).set_xticklabels(size=8, rotation=40)
Figure 1.4: Yemen time-series of cholera incidents in governorates, small multiples

Notice how, even with the condensed format, it is still straightforward to understand what is happening and the overall display makes for a compelling and engaging visual.

Unfortunately, unless you know Yemen well, this story is incomplete. It is difficult to see where these changes are taking place, or how each governorate is related to the others in physical space. For that we need to plot our data onto a map.

There are a number of limits for publishing data on maps:

  • A choropleth map is really a type of bar-chart where the height of the bars is reflected by a colour gradient in 2D-space.
  • Boundaries that make up regions, districts (or governorates) are of wildly different sizes and can mislead into prioritising size over colour scale.

Despite these limitations, map-based charts are useful for grounding data in a physical place. When used in combination with other charts (such as the line-charts above) one can build a complete narrative.

To draw a map we need a shapefile. These are a collection of several files developed according to a standard created by Esri that contain shapes defined by geographic points, polylines or polygons, as well as additional files with metadata or attributes.

HDX has exactly what we need as Yemen - Administrative Boundaries. We wake up our data scientist and invite them to get to work.

Code
# Import our GeoPandas library
import geopandas as gpd
# The 'contextily' library allows us to draw on a map
# import contextily as ctx
# Open the shapefile called "yem_admbnda_adm1_govyem_cso_20191002.shp" and note that - if you're 
# doing this on your home computer, you'll need to load the file from where-ever you saved it.
# While you only need to load this file here, the other file data needed are cross-referenced, 
# so make sure to unzip all the files.
shape_data = gpd.GeoDataFrame.from_file("data/lesson-1/yem_admbnda_adm1_govyem_cso_20191002.shp")
# We can also draw this boundary on a map so you can see what we have
# The 'geometry' column above has the coordinate data we need
# We need to change the map projection so that it will fit the same
# coordinate system - this is beyond the requirements of this class,
# but read up on projection systems if you're interested.
shape_data.to_crs(epsg=3857, inplace=True)
# and create the chart axes
# ax = shape_data["geometry"].plot(color="darkred", alpha=0.25, figsize=(12,12))
# ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron)
# ax.set_axis_off()
fix = {"Ad Dali'": ["Al Dhale'e"],
       "Al Hodeidah": ["Al Hudaydah"], 
       "Hadramawt": ["Hadramaut"], 
       "Ma'rib": ["Marib"], 
       "Sa'dah": ["Sa'ada"],
       "Sana'a City": ["Amanat Al Asimah"], 
       "Ta'iz": ["Taizz"]
      }
data_slice = fix_governorates(data_slice, fix).sort_values("Attack Rate (per 1000)", ascending=False)
# We have no data for Socotra island, so we can drop this row
# Its name is listed in the ADM1_EN column
shape_data = shape_data.loc[~shape_data.ADM1_EN.isin(["Socotra"])]
# And now we can merge our existing data_slice to produce our map data
map_data = pd.merge(shape_data, data_slice, how="outer", left_on="ADM1_EN", right_on="Governorate", indicator=False)

# Let's draw a map
# First, define a figure, axis and plot size
fig, ax = plt.subplots(figsize=(12,8))
ax.set_axis_off()
# We'll look at one specific date, the last entry in the series
md = map_data.loc[map_data.Date == "2018-02-18"]
# And plot - note that 'cmap' is simply a colour gradient used in the visualisation - we're using OrRd, orange-red
md.plot(ax=ax, column='Cases', cmap='OrRd')
Figure 1.5: Yemen initial geospatial cholera distribution, February 2018

Again, a lot had to be fixed under the hood as data had to be aligned between what was in the shapefile and what was in our original source. Decisions had to be made by a tired data scientist working under pressure. And, while this helps, here we hit a fundamental limit of a map … it would be nice to show a time-series of how events progressed.

Well, remember the small multiple? … So, to end this first lesson, here’s what a small multiple map looks like.

Code
# This is a bit more complex than you may expect ... but think of it like this:
# We're going to create a figure and then iterate over the time-series to progressively
# add in new subplots. Since there are 125 dates - and that's rather a lot - we'll
# deliberately limit this to the first date in each month, and the final date.

# Create a datetime format data series
date_list = pd.Series([pd.Timestamp(d) for d in map_data["Date"].unique()])
# Sort dates in place
date_list.sort_values(inplace = True)
dl = {}
for d in date_list:
    # A mechanism to get the last day of each year-month
    k = "{}-{}".format(d.year, d.month)
    dl[k] = d
# Recover and sort the unique list of dates
date_list = list(dl.values())
date_list.sort()

# Create our figure
fig = plt.figure(figsize=(16, 8))
# Set two check_sums, first_date and sub_count
first_date = 0
sub_count = 1
# Loop through the dates, using "enumerate" to count the number of times we loop
for i, d in enumerate(date_list):
    # Convert the Numpy time format to a simpler Python format
    check_date = pd.Timestamp(d).to_pydatetime()
    # Check if we've seen this month before
    if check_date.month == first_date:
        # If we have, check if it's the last value in the loop
        if i < len(date_list)-1:
            # And skip the rest of this loop
            continue
    # Store the month we've just reached
    first_date = check_date.month
    # Get a dataframe for the subplot at this date
    subplot = map_data.loc[map_data.Date == pd.Timestamp(d)]
    # Add the appropriate subplot in a frame structured as 3 items in 4 rows 
    # (there are 10 time periods but we need a square coordinate grid)
    ax = fig.add_subplot(3, 4, sub_count)
    # Increment the count
    sub_count+=1
    # Do some visual fixes to ensure we don't distort the maps, and provide titles
    ax.set_aspect('equal')
    ax.set_axis_off()
    ax.title.set_text(pd.Timestamp(d).date())
    # And plot
    subplot.plot(ax=ax, column='Cases', cmap='OrRd')
Figure 1.6: Yemen geospatial time-series cholera distribution, February 2018

The two small-multiples - the line charts and the choropleth - permit us to answer our research question: there is an increasing incidence and prevalence of cholera in Yemen, with the rate of increase occuring both in scale and in geographic range. The challenge for any aid response will not only be in addressing the scale of the problem, but also in distributing care over an increasingly large physical area.

The visualisations themselves, along with the source data, are all part of our answer to our question.

1.5 Group tutorial

Exercise

It is 05h00 in the morning. In a little over two hours there will be a high-level government meeting to discuss an urgent response to various matters around the world. You will have 15 minutes - if that - to brief a distracted aid to a senior advisor to one of the ministers who will be at the meeting. You can prepare one visual aid, and you must get your point across.

This will be one agenda item on a long list.

What do you want this person to know about the situation in Yemen? What do you want them to do?

1.6 References

Alberg, Anthony J, Ji Wan Park, Brant W Hager, Malcolm V Brock, and Marie Diener-West. 2004. “The Use of Overall Accuracy to Evaluate the Validity of Screening or Diagnostic Tests.” Journal of General Internal Medicine 19 (5 Pt 1): 460–65. https://doi.org/10.1111/j.1525-1497.2004.30091.x.
Anwar, Nadia, and Ela Hunt. 2009. “Francisella Tularensis Novicida Proteomic and Transcriptomic Data Integration and Annotation Based on Semantic Web Technologies.” BMC Bioinformatics 10 (10): S3. https://doi.org/10.1186/1471-2105-10-S10-S3.
Baggini, Julian, and Peter Fosl. 2007. The Ethics Toolkit. Blackwell Publishing. http://www.blackwellpublishing.com/.
Bardi, Laura. 2020. “Early Kiwifruit Decline: A Soil-Borne Disease Syndrome or a Climate Change Effect on PlantSoil Relations?” Frontiers in Agronomy 2. https://doi.org/10.3389/fagro.2020.00003.
Bellary, Shantala, Binny Krishnankutty, and M. S. Latha. 2014. “Basics of Case Report Form Designing in Clinical Research.” Perspectives in Clinical Research 5 (4): 159–66. https://doi.org/10.4103/2229-3485.140555.
Bowles, Samuel, Wendy Carlin, and Margaret Stevens. 2017. The Economy. The CORE Project. http://www.core-econ.org/the-economy/.
Brembs, Björn. 2019. “Reliable Novelty: New Should Not Trump True.” PLoS Biology 17 (2). https://doi.org/10.1371/journal.pbio.3000117.
Budoff, Matthew J., Deepak L. Bhatt, April Kinninger, Suvasini Lakshmanan, Joseph B. Muhlestein, Viet T. Le, Heidi T. May, et al. 2020. “Effect of Icosapent Ethyl on Progression of Coronary Atherosclerosis in Patients with Elevated Triglycerides on Statin Therapy: Final Results of the EVAPORATE Trial.” European Heart Journal, August. https://doi.org/10.1093/eurheartj/ehaa652.
Chait, Gavin. 2014. “Technical Assessment of Open Data Platforms for National Statistical Organisations.” Text/{HTML}. Washington, D.C.: World Bank Group. https://openknowledge.worldbank.org/handle/10986/21111.
———. 2024. “Auditable and Reusable Crosswalks for Fast, Scaled Integration of Scattered Tabular Data.” Text/{HTML}. https://arxiv.org/abs/2409.01517.
Crane, Harry, and Ryan Martin. 2018. “In Peer Review We (Don’t) Trust: How Peer Review’s Filtering Poses a Systemic Risk to Science.” Researchers.One, September. https://www.researchers.one/article/2018-09-17.
Creswell, John W., and J. David Creswell. 2017. Research Design Qualitative, Quantitative, and Mixed Methods Approaches. Sage. https://us.sagepub.com/en-us/nam/research-design/book255675.
Dahmen, Jessamyn, and Diane Cook. 2019. SynSys: A Synthetic Data Generation System for Healthcare Applications.” Sensors (Basel, Switzerland) 19 (5). https://doi.org/10.3390/s19051181.
Dicker, Richard, Fátima Coronado, Denise Koo, and Roy Gibson Parrish. 2012. Principles of Epidemiology in Public Health Practice: An Introduction to Applied Epidemiology and Biostatistics. Third. U.S. Department of Health; Human Services. https://web1.sph.emory.edu/epicourses/SS1000_Principles_of_Epidemiology_Self_Study_Course.pdf.
Dietz, David M, Christopher D Barr, and Mine Çetinkaya-Rundel. 2015. OpenIntro Statistics. Third. OpenIntro.org. https://www.openintro.org/.
Downey, Allen B. 2014. Think Stats 2 - Exploratory Data Analysis in Python. 2.1.0 ed. Needham, Massachusetts: Green Tea Press. https://greenteapress.com/wp/think-stats-2e/.
Emanuel, Ezekiel J., David Wendler, and Christine Grady. 2000. “What Makes Clinical Research Ethical?” JAMA 283 (20): 2701–11. https://doi.org/10.1001/jama.283.20.2701.
Folse, Henry. 2005. Some Fundamental Concepts in Ethics. Department of Philosophy, College of Arts; Sciences, Loyola University. http://people.loyno.edu/~folse/ethics.html.
Greco, Marco, Flavio Crippa, Roberto Agresti, Ettore Seregni, Alberto Gerali, Riccardo Giovanazzi, Andrea Micheli, et al. 2001. “Axillary Lymph Node Staging in Breast Cancer by 2-Fluoro-2-Deoxy-d-Glucose–Positron Emission Tomography: Clinical Evaluation and Alternative Management.” JNCI: Journal of the National Cancer Institute 93 (8): 630–35. https://doi.org/10.1093/jnci/93.8.630.
Hume, David. 1740. Treatise of Human Nature. https://en.wikisource.org/wiki/Page:Treatise_of_Human_Nature_(1888).djvu/109.
Jabbou, Richard J., Matthew J. Shun‐Shin, Judith A. Finegold, Sohaib S. M. Afzal, Christopher Cook, Sukhjinder S. Nijjer, Zachary I. Whinnett, Charlotte H. Manisty, Josep Brugada, and Darrel P. Francis. 2015. “Effect of Study Design on the Reported Effect of Cardiac Resynchronization Therapy (CRT) on Quantitative Physiological Measures: Stratified MetaAnalysis in NarrowQRS Heart Failure and Implications for Planning Future Studies.” Journal of the American Heart Association 4 (1): e000896. https://doi.org/10.1161/JAHA.114.000896.
Jurca, Ales, Jure Žabkar, and Sašo Džeroski. 2019. “Analysis of 1.2 Million Foot Scans from North America, Europe and Asia.” Scientific Reports 9 (1): 19155. https://doi.org/10.1038/s41598-019-55432-z.
Katz, Ralph, Steven Kegeles, Nancy Kressin, Lee Green, Min Qi Wang, Sherman James, Stefanie Luise Russell, and Cristina Claudio. 2006. “The Tuskegee Legacy Project: Willingness of Minorities to Participate in Biomedical Research.” Journal of Health Care for the Poor and Underserved 17 (4): 698–715. https://doi.org/10.1353/hpu.2006.0126.
Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion Through Social Networks.” Proceedings of the National Academy of Sciences 111 (24): 8788–90. https://doi.org/10.1073/pnas.1320040111.
Lam, Diana L., Pari V. Pandharipande, Janie M. Lee, Constance D. Lehman, and Christoph I. Lee. 2014. “Imaging-Based Screening: Understanding the Controversies.” AJR. American Journal of Roentgenology 203 (5): 952–56. https://doi.org/10.2214/AJR.14.13049.
Leavy, Patricia. 2017. Research Design: Quantitative, Qualitative, Mixed Methods, Arts-Based, and Community-Based Participatory Research Approaches. https://www.guilford.com/books/Research-Design/Patricia-Leavy/9781462514380.
Maslin, Douglas, and Marc Wallace. 2018. “Cutaneous Larva Migrans with Pulmonary Involvement.” Case Reports 2018 (February): bcr. https://doi.org/10.1136/bcr-2017-223508.
Maxim, L. Daniel, Ron Niebo, and Mark J. Utell. 2014. “Screening Tests: A Review with Examples.” Inhalation Toxicology 26 (13): 811–28. https://doi.org/10.3109/08958378.2014.955932.
Oaten, Megan, Richard J. Stevenson, and Trevor I. Case. 2011. “Disease Avoidance as a Functional Basis for Stigmatization.” Philosophical Transactions of the Royal Society B: Biological Sciences 366 (1583): 3433–52. https://doi.org/10.1098/rstb.2011.0095.
Packer, Milton, Stefan D. Anker, Javed Butler, Gerasimos Filippatos, Stuart J. Pocock, Peter Carson, James Januzzi, et al. 2020. “Cardiovascular and Renal Outcomes with Empagliflozin in Heart Failure.” New England Journal of Medicine 0 (0): null. https://doi.org/10.1056/NEJMoa2022190.
Pishro-Nik, Hossein. 2014. Introduction to Probability, Statistics, and Random Processes. Kappa Research LLC. https://www.probabilitycourse.com/.
Popper, Karl. 1959. The Logic of Scientific Discovery. http://archive.org/details/PopperLogicScientificDiscovery.
Rafferty, Elizabeth A., Jeong Mi Park, Liane E. Philpotts, Steven P. Poplack, Jules H. Sumkin, Elkan F. Halpern, and Loren T. Niklason. 2013. “Assessing Radiologist Performance Using Combined Digital Mammography and Breast Tomosynthesis Compared with Digital Mammography Alone: Results of a Multicenter, Multireader Trial.” Radiology 266 (1): 104–13. https://doi.org/10.1148/radiol.12120674.
Safra, Lou, Coralie Chevallier, Julie Grèzes, and Nicolas Baumard. 2020. “Tracking Historical Changes in Trustworthiness Using Machine Learning Analyses of Facial Cues in Paintings.” Nature Communications 11 (1): 4728. https://doi.org/10.1038/s41467-020-18566-7.
Savian, Francesco, Fabrizio Ginaldi, Rita Musetti, Nicola Sandrin, Giulia Tarquini, Laura Pagliari, Giuseppe Firrao, Marta Martini, and Paolo Ermacora. 2020. “Studies on the Aetiology of Kiwifruit Decline: Interaction Between Soil-Borne Pathogens and Waterlogging.” Plant and Soil 456 (1): 113–28. https://doi.org/10.1007/s11104-020-04671-5.
Schünemann, Holger, Jan Brożek, Gordon Guyatt, and Andrew Oxman, eds. 2013. GRADE Handbook.” https://gdt.gradepro.org/app/handbook/handbook.html.
Schwarzer, U., F. Sommer, T. Klotz, M. Braun, B. Reifenrath, and U. Engelmann. 2001. “The Prevalence of Peyronie’s Disease: Results of a Large Survey.” BJU International 88 (7): 727–30. https://doi.org/10.1046/j.1464-4096.2001.02436.x.
Shaw, David. 2015. “Facebook’s Flawed Emotion Experiment: Antisocial Research on Social Network Users:” Research Ethics, May. https://doi.org/10.1177/1747016115579535.
Stern, Bodo M., and Erin K. O’Shea. 2019. “A Proposal for the Future of Scientific Publishing in the Life Sciences.” PLoS Biology 17 (2). https://doi.org/10.1371/journal.pbio.3000116.
Tufte, Edward. 2006. Beautiful Evidence. Graphics Pr. https://www.edwardtufte.com/tufte/books_be.
Versi, E. 1992. “"Gold Standard" Is an Appropriate Term.” BMJ : British Medical Journal 305 (6846): 187. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1883235/.
Vickers, Andrew J., Barry S. Kramer, and Stuart G. Baker. 2006. “Selecting Patients for Randomized Trials: A Systematic Approach Based on Risk Group.” Trials 7 (1): 30. https://doi.org/10.1186/1745-6215-7-30.
Vu, Julie, and David Harrington. 2020. Introductory Statistics for the Life and Biomedical Sciences. First. OpenIntro.org. https://www.openintro.org/book/biostat/.
Walonoski, Jason, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, and Scott McLachlan. 2018. “Synthea: An Approach, Method, and Software Mechanism for Generating Synthetic Patients and the Synthetic Electronic Health Care Record.” Journal of the American Medical Informatics Association 25 (3): 230–38. https://doi.org/10.1093/jamia/ocx079.
Wu, Xiaolin, and Xi Zhang. 2016. “Automated Inference on Criminality Using Face Images.” arXiv:1611.04135 [Cs], November. http://arxiv.org/abs/1611.04135.