4  Sampling, data distribution, and secure data custody

Predicting the future is easier than knowing the present.

Any research process requires a general understanding of the question under review; a grounding in some causal mechanism behind an observed effect. How, though, do you begin when you have an effect characterised by series of unusual observations but no background information at all? More challenging still, how do you protect the integrity of the information you collect when you don’t know what will be important, or what is unnecessary and risks the safety of study participants?

The objective of this lesson is to learn about the initial investigative process for scoping a research topic, understand distribution patterns, and appreciate the risks which we may expose others to in the course of a study.

Session 4 lecture, Thursday 10 July 2025

Research Question

It is challenging to identify the causal mechanism behind an observed effect even when you have a body of research to lean on and guide you. When you start with no causal hypothesis, research has to be more open-ended. Evaluate two papers (Schwarzer et al. 2001), (Savian et al. 2020) and assess whether the measurements and approach to research is appropriate and likely to lead to a useful foundation for further research.

The two papers we will refer to are these:

Schwarzer U., Sommer F., Klotz T. et al., The prevalence of Peyronie’s disease: results of a large survey, BJU International, vol. 88, number 7, pp. 727–730, 2001. doi.org/10.1046/j.1464-4096.2001.02436.x

Savian Francesco, Ginaldi Fabrizio, Musetti Rita et al., Studies on the aetiology of kiwifruit decline: interaction between soil-borne pathogens and waterlogging, Plant and Soil, vol. 456, number 1, pp. 113–128, November 2020. doi.org/10.1007/s11104-020-04671-5

Don’t worry if you don’t understand them on first read. Our objective is to learn how to evaluate research even when you are not immersed in the topic.

4.1 Ethics: causal reasoning and risks to data subjects

Acknowledge the privacy and confidentiality issues in data storage and security of personal data.

On 21 December 1998, Gugu Dlamini, a bright and conscientious 36-year-old volunteer field worker for the National Association of People Living With HIV-AIDS, was dragged out of her home by her neighbours in KwaMashu, South Africa, and beaten to death in front of her 12-year-old daughter.

Dlamini had tried to end the stigma associated with HIV by announcing her own positive status in a series of radio and television appearances on World AIDS Day, on 1 December. Despite continuing threats of violence, she received no police protection.

More than 20 years later, 3 million South Africans have died as a result of AIDS and 7.5 million people are infected.

The stigma associated with the illness prevented diagnosis and treatment early in the epidemic, and continues to permit the disease to spread unchecked. In 2020, an estimated 70,000 deaths annually, 25% of total mortality, are attributed to it out of a population of 57 million people.

How differently would South Africa’s experience of the HIV pandemic have been if not for the secrecy imposed by the threat of violence on those who are infected?

4.1.1 The gap between stigma and standpoint

No research question can be separated from the people and communities affected by the answer. Merely to ask a question is to intervene into unknown circumstances causing unknown consequences.

When a chicken dies in a shed on a rural farm in Tanzania, is it because of a common disease, toxins in the feedstock, or a new and unknown virus? When a café fails in a popular retail street in Mexico, is it because of bad luck, a pattern of poor management, political or criminal interference, or a precursor to a broader economic downturn?

And do those effected have any reason to keep information from you?

Any researcher in South Africa in the 1990s working to understand the emerging AIDS pandemic needed to know the prevalence and growth rate of the disease, as well as precisely who was at risk. Finding out meant asking vulnerable people to reveal their HIV status to a complete stranger knowing that any public knowledge of that status could cause their murder.

The challenge for investigating any disease or social condition is that your study subjects may have reasons to avoid participating.

Stigma - the chronic social and physical avoidance of a person by other people - is a terrifying and legitimate concern (Oaten, Stevenson, and Case 2011). Potential victims risk losing their jobs, families, social position, or even their lives.

Consider the initial announcement of “A.I.D.” back in 1982, when the New York Times described it as “gay cancer” since many infections had presented as a rare cancer - Kaposi’s sarcoma - in gay men. Since we now know HIV is transmitted through sexual activity, blood transfusions, and sharing of injection needles, how far did the virus spread because of a false sense of security in people outside of the gay community? How much has subsequent bias against the victims of HIV been set by prejudice against gay people?

Research bias is inevitable when only a small subset of those effected are willing to provide evidence to inform a study, and that can lead to erroneous conclusions and solutions. Compounding that problem will be the researcher’s own standpoint.

Standpoint theory states two propositions (Baggini and Fosl 2007):

  • How a person perceives the world is related to their social, economic and gendered position from where they see that world,
  • Moral reasoning is neither uniform nor universal.

As illustration, minorities - be they ethnic, religious, or gendered - will understand both their world and that of the majority - where they work or interact - whereas a member of the majority or privileged must make an active effort to see beyond the world they inhabit by default.

There is an instinct to relate any field of interest, and the reasons things happen the way they do, to ones’ own standpoint. Any researcher working on a study concerning things they have not experienced directly, or amongst a community substantially different from their own, must always make an effort to see beyond their standpoint. That lack of context could cause them to miss important potential risks for their study subjects or themselves.

This is as true for biologists who must not impute human behaviour onto animals, as historians who must consider the context of people living millenia in the past.

Between a researcher’s standpoint, and a study subject’s hidden motivations, lies a research process expected to reveal an underlying truth.

4.1.2 Pragmatic causal reasoning and risks to data subjects

Causal reasoning assumes that, if facts exist, they can be known and permit us the opportunity to build a rigorous causal mechanism by which a research method will produce a consistent and repeatable answer to a clearly defined research question.

The problem is that facts are not always objective. You can measure a height, but can you measure a risk? And will such causal reasoning survive contact with a future premised on very different technology and knowledge?

A pragmatic approach to deliberation holds that reasoning and theorising cannot be separated from context. There may never be an independent point of view where the actual practices, conditions, or problems can be observed impartially (Baggini and Fosl 2007).

Causal reasoning is always directly connected to social, concrete and practical life, grounded in the experience, history and habits of those who engage in it. Different societies may face different problems, have different needs and sensibilities, and hold different values.

What a pragmatic approach to causal reasoning recognises is that both actions and their consequences must be considered equally relevant since they can never be separated from one another.

Where this gets you is that initial causal research requires an incredibly broad set of speculative questions which - if human subjects are involved - risks straying far into the personal lives of those studied. And that requires evaluation of both direct and indirect risks inherent in data collection, movement, and storage.

All data are political, and politics change.

Immigration is a contentious subject in many countries. Undocumented migrants - those with an ambiguous legal right to live or work in their country of residence - do their best to remain invisible to government agencies. People who arrived in a country as children struggle under a threat of deportation to a country they may have no connection to, and where they may not even speak the language.

In 2012, United States President Barack Obama established the Deferred Action for Childhood Arrivals (DACA) policy permitting individuals who arrived in the US as undocumented child immigrants to maintain deferred action from deportation as long as they obeyed specific rules, and eventually become eligible for a work permit.

By 2018, almost 700,000 people had applied for recognition through the program.

After the election of President Donald Trump in 2016, DACA was revoked and the database of DACA applicants was passed to US Immigration and Customs Enforcement (ICE) with the risk that everyone on that list would subsequently be deported.

The data was originally collected with the promise that applicants would not be targeted for deportation. That has now turned into a risk not only in the immediate sense, for the applicants, but also for the state. There are an enormous number of situations where governments need to keep track of minority groups who may be at particular risk to exploitation, disease, or even who may pose a threat to the broader public.

If a government can not be trusted to maintain confidentiality with data collected - or, worse, compels access to data it does not control - then future queries aimed at minority groups are likely to be ignored or avoided.

When we collect data we take on a great responsibility. Not only will people use these data to conduct research, build commercial applications or even make life-changing decisions about where to live, work or study, but some people - adversaries - will try to use these data to do harm.

Stigma cuts both ways. If people, who you hope will support your work indirectly or directly as data subjects, come to regard researchers as universally untrustworthy or unethical, then your ability to conduct research is at risk.

4.1.3 Metadata and American revolutionary politics of 1772

Data, no matter how poorly structured, can be understood with appropriate metadata. Data, no matter how well structured, cannot be understood without it.

Metadata are critical to understanding the meaning behind data and, in and of itself, is highly suggestive.

As illustration, consider Paul Revere, the American revolutionary hero of their war of independence against the British. That war, which ran from 1775 to 1783 pitched colonial settlers against the British Empire.

Kieran Healy, a professor of Sociology at Duke University, imagined what would happen if the British spymasters of the time had access to social network analysis tools of the present. In “Using Metadata to find Paul Revere” he demonstrates how metadata alone can reveal Paul Revere’s centrality and importance to the revolutionary efforts of the nascent uprising.

In 1772, the British were struggling to deal with rebellion and an emerging popular uprising. Leaders of various independence cells were arrested, but they had no sense of how these rebellions were organised. Using data simply stating which organisation various people belong to - without knowing anything about these people or their beliefs - Healy goes from a table (of which this is an extract):

StAndrewsLodge LoyalNine NorthCaucus LongRoomClub TeaParty Bostoncommittee LondonEnemies
Adams.John 0 0 1 1 0 0 0
Adams.Samuel 0 0 1 1 0 1 1
Allen.Dr 0 0 1 0 0 0 0
Appleton.Nathaniel 0 0 1 0 0 1 0
Ash.Gilbert 1 0 0 0 0 0 0
Austin.Benjamin 0 0 0 0 0 0 1
Austin.Samuel 0 0 0 0 0 0 1
Avery.John 0 1 0 0 0 0 1
Baldwin.Cyrus 0 0 0 0 0 0 1
Ballard.John 0 0 1 0 0 0 0

To eventually building up a complete network of social interactions between various people:

Revolutionary social network

That may not mean much to you, but you can see different clusters, or organisations, of people linked by individuals who span these clusters. One such link is seen here:

Paul Revere

A government arresting various people at the centre of each of the major clusters may want to discover who may be uniting these disparate groups. Are there hidden organisers? Such a network map - offering no other information than who knows who - is incredibly powerful.

British analysts didn’t know these techniques, simple as they are, in 1772, and the rest is history.

4.1.4 Assessing risk from data publication

Network and statistical analysis are central to all research, from epidemiology, to healthcare, education, environmental science, physics and engineering. It’s also at the heart of advertising recommendation engines used online.

Some of these uses are profoundly beneficial - such as, identifying symptoms specifically indicative of disease - and some are terrifyingly abusive - identifying people by their religious beliefs or personal characteristics.

Singapore’s Personal Data Protection Commission describe the following disclosure risks:

  • Identity disclosure (re-identification): permitting the identity of an individual described by a specific record. This could arise from scenarios such as insufficient anonymisation, re-identification by linking, or pseudonym reversal.
  • Attribute disclosure: determining that an attribute described in the dataset belongs to a specific individual, even if the individual’s record cannot be distinguished.
  • Inference disclosure: making an inference about an individual even if they are not in the dataset, by statistical properties of the dataset.

A typical example of such a risk is in the release of medical history. e.g. a dataset containing anonymised patient records of a surgeon reveals that all his patients below the age of 30 have undergone a particular procedure. If it is known that a specific individual is 28 years old and is a client of this surgeon, we then know that this individual has undergone the particular procedure, even if the individual’s record cannot be distinguished from others in the anonymised dataset.

The risk is not only from one dataset, but from the way in which multiple datasets can be recombined, and how these data are stored and shared.

4.1.5 Risks from data-at-rest and data-in-motion

Data are not always in use. It can consist of research data in spreadsheets on unused storage media, or being emailed between people, or open on someone’s computer. These various moments each constitute a different type of risk to accidental disclosure or inadvertent exposure.

  • Data-at-rest: inactive data stored digitally or physically in any format, whether on paper in files, or in databases or spreadsheets on hard-drives, phones or other forms of storage. Data-at-rest are data which are either not connected to a network and powered off, or stored securely in a physical vault.
  • Data-in-motion: active data either in transit between users, or in use in applications. Data-in-motion are considered to be data accessible via a network, where data have the capacity - even if securely stored and encrypted - to be accidentally released to the public either via the internet, or via email.

On conclusion of your study, your database will be closed and - aside from any potential need for post-surveillance monitoring - your data are unlikely to require regular access. At this stage, direct network-based access to your data may be removed, and data drives placed either in a secure vault or stored on indirect access systems such as tape drives.

No matter where or how they are stored, data are at risk not just from physical attackers - adversaries - but also from accidental destruction (fires, floods, erasure, etc.). If you are collaborating to ensure anonymisation of a dataset, you need to protect that data from accidental disclosure when emailing it to colleagues in various states of processing. Email is inherently insecure.

Data are also subject to the law of applicable governments. If an individual in your study is being investigated by the police, and they demand your data, are you able to extract just that person’s information, or have you so designed things that you expose people to police intrusion who are not part of that investigation? And what happens if the law is imposed by a government seeking to commit acts generally regarded as human rights abuse?

The process of securing data protection are not only about the end-result, but also about all the intermediate steps. Data must be kept secure not just in its final state, but also through its use, transmission and storage.

Your responsibilities towards data subject confidentiality and continued data integrity do not end. The risks from data at rest include:

  • Physical degradation or destruction of the physical storage (fire, or simple decay);
  • Physical theft of the data;
  • Electronic theft via unsecured networks for data left accessible;
  • Electronic destruction or degradation via remote vandalism or hacking;

Each of these require a backup plan and risk mitigation and must be included in your risk register. Similarly, a long-term budget for maintaining data at rest must be included in the initial study.

You can take all the care in the world prior to publication, and then leave your source data on a flash-drive on a bus or expose sensitive information while working on your notebook computer in a public place.

As we strive to protect data subjects we should always ensure that first we do no harm.

4.1.6 Strategies for risk avoidance and risk management

There are a number of risks which need to be identified before a study begins, while it is underway, and once it is complete.

These include compliance with regulatory environments, the security and privacy of the data under management, and the implications of poor data subject selection criteria on the eventual observations and results produced by the study.

4.1.6.1 Use a risk register to identify and describe risks

A risk register operates at the level of overall study operations (including assessing risks in operating procedures, computerised systems, and personnel) and in the study itself (including data integrity, data subject safety and privacy, and sample bias).

While it is certainly not possible to identify every risk in advance, a useful approach is to bring all stakeholders together to consider the assumptions required for success, and then estimate different degrees of impact should those assumptions be compromised.

A good method for working through the assumptions which underpin the success of various objectives and specific activities is to draw up a logframe (logical framework) diagram.

Development objective

This is the final objective to which you are trying to work.

Indicators

These are measures (direct or indirect) to verify to what extent the left hand objectives have been fulfilled.

(Means of verification should be specified.)
External factors

Important conditions, events or decisions that are outside the control of the study.

These could affect the outcome of the left column objectives.
Immediate objective

This is the effect that is expected to be achieved as the result of the study.
Outputs

The results which the study is expected to guarantee will occur.
Activities

The activities which have to be done to produce the outputs.
Inputs

What equipment or skilled personnel do you need to undertake the activities.

From the list of external factors and indicators you can draw up your risk register. The indicators provide an expected range, and to the degree outside that expected range, you can consider external factors, and mitigation.

4.1.6.2 Analyse risks and establish strategies for mitigation or response

Each risk should be rated according to the likelihood it will occur, the degree of impact it will have on data subjects and the reliability of study results, and the mechanism and degree to which it may be detected.

Effectively:

  • How will you know if a risk is happening?
  • How big an affect will it have on the study?
  • What will you do to reduce that affect?

You can rate these risks using a 1-10 scale, by some classification, or any way which works for you. Note that rare risks with an outsize impact do happen. And note how bias in a sample can have an extreme impact.

For instance, if your sample size in a clinical trial is extremely small and one of your participants dies - whether this has anything to do with your trial or not - the likelihood is that your trial will be entirely compromised. If you are under-represented in certain demographies (e.g. female, or youth participants) and anything happens to your under-represented participant, then - even where your overall sample size is good - you are likely to bias your results.

Some risk mitigation imposes requirements prior to the commencement of your study and cannot be mitigated afterwards. It is essential to undertake risk analysis during protocol development.

Some rare risks are very easy to mitigate, but very expensive to respond to. As a simple example, leaving home five minutes earlier to catch the bus is a small mitigation, in comparison to needing to wait an additional hour for the bus for arriving late, and then missing connecting transport that wrecks an entire day.

What this means is that you should be - where possible - measuring for risk in terms of variance from expected results and variances outside of expected ranges should trigger a formal process for risk management. Some risks will need to trigger formal engagement with regulators.

Having some sort of expected range means having some idea of the prevalence of the effect you are attempting to study, which you may not know, and so you will need a replicable and reasonable set of assumptions to justify these ranges.

Each component of the study needs to be assessed and risks added to your register, and you need a formalised process for reassessing and revalidating those risks throughout the project.

4.1.7 Producing replicable research results

Since 2012, a plant sickness has periodically devastated kiwi orchards in Italy.

Researchers have looked for the cause in irrigation practices, bacteria, fungi, soil composition and specific plant disease - but found no clear culprit: the more they studied, the more anomalies cropped up.

Farmers are calling the disease morìa, or “die-off”. Experts are being called upon to figure it out, yet experts are baffled.

The causal mechanism - the indirect or causal mediator resulting in the observed effect - remains unknown.

There are no shortage of perplexing research questions you can tackle. The challenge - especially where no-one has any idea what the range of causal mechanisms may be - is where do you start? How do you go about finding out? And the risks are that the intersection of politics, standpoint, stigma, and other forms of vested interest may make impartial investigation almost impossible.

Some questions - such as the impact of immigration on employment, or sexual orientation on suicide, or drug dealing and organised crime - create risks for data subjects and researchers. You will need to consider issues in data loss, and whether the risk from data loss outweighs the research value of collecting it. You will be the one asking the questions and you will need to ensure that your data subjects can trust you.

Ultimately, though, whatever you do, and however you go about answering your research question, others must be able to replicate and review your work. That does not mean that the deliberative process isn’t given to counter-intuitive leaps of genius, but that the product of that process results in a method anyone can follow.

Science is a method of deliberation: an act of sound reasoning allowing anyone to intelligently and effectively figure out, in particular situations, a morally right practice. This process of deliberation should be accessable and repeatable. Anyone with appropriate skills should be able to understand the deliberative process informing the research, and be able to duplicate it and achieve the same result.

4.2 Curation: securing and anonymising data at risk

Recognise responsibilities and mechanisms for securing data-at-rest and data-in-motion.

Research projects end. Sometimes with the result investigators were hoping for, but most often not. Your responsibility for the data you collate during the study continues indefinitely.

Data custody - the set of techniques ensuring security of data-at-rest and data-in-motion - requires recognition of responsibilities for securing that data, including encryption, authentication, authorisation, and anonymisation. This is true for owners, users, and intermediaries.

While securing data custody is a complex task involving multiple components, they can be treated as those which a data scientist is likely to be responsible for, and those where the data scientist will provide support to security professionals working on the complete chain of custody.

  • Operations security (opsec) and information security (infosec) form the body of knowledge and practice that are critical to securing data custody, but not ordinarily the responsibility of a data scientist.
  • Anonymisation forms the body of techniques to ensure the protection of data subjects and are part of the data scientist’s responsibilities.

Data custody is the set of techniques which acts to irreversibly alter personal data through encryption, redaction or aggregation so that a data subject can no longer be identified directly or indirectly from a dataset. To support the integrity of research and analysis, anonymisation must preserve the semantics of the original data to ensure entity resolution - the internal relationships between data - are still realisable.

The complexity of managing these risks cannot be understated. This lesson can only be an overview of the techniques and skills required. Please take guidance from operations and information security professionals.

4.2.1 Operations and information security

Security processes are usually set at the organisation level and managed and implemented by specialists. Researchers and data scientists will be subject to these requirements and have a responsibility to ensure that they communicate regularly with the security team to bring their information inside the secure environment.

Operations security is a five-step iterative process to help identify information requiring protection, and developing suitable measures to provide that protection:

  1. Identify sensitive information: critical information about research techniques, safeguarding, data subjects, finance, employees and researchers, and intellectual property. This can be narrowed down to the creation of a Critical Information List (CIL) to focus resources on vital information, rather than attempting to protect all information.

  2. Analyse threats: threats come from adversaries; any individual or group that may overtly, or accidentally, disrupt or compromise your information. Adversaries may come from inside your research team, or from outside, and can be classified in terms of their intent and capability to cause harm. The greater the combined intent and capability of the adversary, the greater the threat.

  3. Analyse vulnerabilities: threat is the strength of adversaries, while vulnerability is the weakness in your chain of information and data custody. These vulnerabilities occur everywhere along the data management process, from transmission to movement between systems and people, to mechanisms of access and sharing.

  4. Assess risk: prioritise vulnerabilities in terms of a risk score, calculated based on the probability and vulnerability of information release, and the impact if such a release occurs. Identify possible countermeasures for each vulnerability.

  5. Apply appropriate measures: countermeasures must be continually monitored to ensure that they continue to protect current information against relevant threats. Countermeasures may include:

    • documented change-management processes,
    • network restrictions to authorised individuals,
    • the minimum access to each component as required by individuals - no matter how senior - to perform their jobs,
    • task automation with auditable records,
    • surveillance, incident response, and disaster recovery plans in place.

In reality, we cannot mitigate every threat, so we have three possible responses:

  • Reduce or mitigate threats through the implementation of safeguards and countermeasures.
  • Assign or transfer the threat, or the cost of the threat, to another organisation which can be by outsourcing a service, or even purchasing insurance.
  • Accept the threat if the cost of countering it is greater than the loss.

Ethics has a huge role in assessing threats and response since a cost is not one borne only by the researcher or the organisation they work for. Other moral agents and subjects are exposed to this risk. If the death of a moral subject is a risk during a study, and the cost of countering that is insurmountable, then the option to accept is not to be treated lightly.

Information security is organised around these key concepts:

  • Confidentiality: ensuring that information is not made available or disclosed to unauthorised individuals, entities or processes; confidentiality can be compromised through anything from lost USB drives, stolen passwords, or emails being sent to the wrong address.
  • Authenticity and integrity: maintaining and assuring the accuracy and completeness of data over the entire research cycle, meaning data shouldn’t be modified in ways that are unauthorised or undetected, and that information can be validated or verified as accurate and complete whenever required.
  • Availability: supporting the continuous functioning of all mechanisms used to store, process, distribute, access and communicate information, with mitigations for disruption from fire, flood, power outages, hardware failures, denial-of-service attacks, systems upgrades and updates, amongst others.
  • Non-repudiation: a legal concept supported by technology and processes which assures parties to a transaction that information has been requested, sent and received, and that both sender and receiver are able to prove that information authenticity and integrity can be validated.

As the data “owner”, responsibility for designing mechanisms for validation, or assessing the risk from information disclosure, rests with the data scientist. They will need to work with infosec teams to ensure secure data custody.

4.2.2 Data anonymisation through redaction and aggregation

There are a wide range of techniques available to support anonymisation. Broadly, though, they fit into two types:

  • Redaction: in which we remove fields or line-item information while maintaining sufficient integrity to permit semantic analysis;
  • Aggregation: in which we deliberately aggregate data to ensure outlier anonymity;

We will use a single dataset for this lesson, produced by Synthea, an open-source project which allows the creation of synthetic data. Such data are produced via randomness algorithms to generate pseudo-information useful for testing analytical and anonymisation systems (Walonoski et al. 2018).

Note that Synthea is a much more sophisticated version of the synthetic data generator we developed during the last two lessons.

Code
import pandas as pd
import numpy as np
import uuid
import geopandas as gd
import matplotlib.pyplot as plt

df = pd.read_csv("data/lesson-4/patient-data-anonymisation-exercise.csv")
df.head()
PATIENT_ID START STOP ENCOUNTERCLASS DESCRIPTION TOTAL_CLAIM_COST PAYER_COVERAGE REASONDESCRIPTION BIRTHDATE DEATHDATE ... ETHNICITY GENDER BIRTHPLACE ADDRESS CITY STATE COUNTY ZIP LAT LON
0 034e9e3b-2def-4559-bb2a-7850888ae060 2010-01-23T17:45:28Z 2010-01-23T18:10:28Z ambulatory Encounter for symptom 129.16 54.16 Acute bronchitis (disorder) 14/11/1983 NaN ... nonhispanic M Danvers Massachusetts US 422 Farrell Path Unit 69 Somerville Massachusetts Middlesex County 2143.0 42.360697 -71.126531
1 034e9e3b-2def-4559-bb2a-7850888ae060 2012-01-23T17:45:28Z 2012-01-23T18:00:28Z wellness General examination of patient (procedure) 129.16 129.16 NaN 14/11/1983 NaN ... nonhispanic M Danvers Massachusetts US 422 Farrell Path Unit 69 Somerville Massachusetts Middlesex County 2143.0 42.360697 -71.126531
2 034e9e3b-2def-4559-bb2a-7850888ae060 2015-01-26T17:45:28Z 2015-01-26T18:15:28Z wellness General examination of patient (procedure) 129.16 129.16 NaN 14/11/1983 NaN ... nonhispanic M Danvers Massachusetts US 422 Farrell Path Unit 69 Somerville Massachusetts Middlesex County 2143.0 42.360697 -71.126531
3 034e9e3b-2def-4559-bb2a-7850888ae060 2016-12-29T17:45:28Z 2016-12-29T18:00:28Z ambulatory Encounter for symptom 129.16 54.16 Acute bronchitis (disorder) 14/11/1983 NaN ... nonhispanic M Danvers Massachusetts US 422 Farrell Path Unit 69 Somerville Massachusetts Middlesex County 2143.0 42.360697 -71.126531
4 034e9e3b-2def-4559-bb2a-7850888ae060 2017-01-09T17:45:28Z 2017-01-09T18:00:28Z outpatient Encounter for check up (procedure) 129.16 54.16 NaN 14/11/1983 NaN ... nonhispanic M Danvers Massachusetts US 422 Farrell Path Unit 69 Somerville Massachusetts Middlesex County 2143.0 42.360697 -71.126531

5 rows × 28 columns

Figure 4.1: Synthea synthetic patient data

These are synthetic data. Nothing here is real, but imagine it was. These data reveal not only the name of each patient and their medical history, but also their address, driver’s licence, passport number, ethnicity and the latitude and longitude of their address.

4.2.2.1 Redaction strategies and methods

Anyone conducting clinical research needs these data, but not all of it. How much of it can be redacted without compromising its research value?

If we removed all the names and addresses, we could certainly protect individuals but at the expense of understanding disease progression.

Consider the patient history of Carmelia Konopelski:

Code
df[df["PATIENT_ID"] == "71ba0469-f0cc-4177-ac70-ea07cb01c8b8"]
PATIENT_ID START STOP ENCOUNTERCLASS DESCRIPTION TOTAL_CLAIM_COST PAYER_COVERAGE REASONDESCRIPTION BIRTHDATE DEATHDATE ... ETHNICITY GENDER BIRTHPLACE ADDRESS CITY STATE COUNTY ZIP LAT LON
703 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 2002-01-15T20:46:46Z 2002-01-15T21:01:46Z ambulatory Encounter for problem 129.16 54.16 NaN 21/11/2000 21/11/2012 ... nonhispanic F Lee Massachusetts US 1025 Collier Arcade Ashland Massachusetts Middlesex County NaN 42.291986 -71.463724
704 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 2002-01-25T20:46:46Z 2002-01-25T21:37:46Z ambulatory Encounter for problem 129.16 54.16 NaN 21/11/2000 21/11/2012 ... nonhispanic F Lee Massachusetts US 1025 Collier Arcade Ashland Massachusetts Middlesex County NaN 42.291986 -71.463724
705 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 2002-11-28T20:46:46Z 2002-12-12T20:46:46Z ambulatory Encounter for symptom 129.16 54.16 Perennial allergic rhinitis with seasonal vari... 21/11/2000 21/11/2012 ... nonhispanic F Lee Massachusetts US 1025 Collier Arcade Ashland Massachusetts Middlesex County NaN 42.291986 -71.463724
706 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 2003-04-29T20:46:46Z 2003-04-29T21:01:46Z wellness Well child visit (procedure) 129.16 54.16 NaN 21/11/2000 21/11/2012 ... nonhispanic F Lee Massachusetts US 1025 Collier Arcade Ashland Massachusetts Middlesex County NaN 42.291986 -71.463724
707 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 2003-05-11T20:46:46Z 2003-05-11T21:01:46Z ambulatory Encounter for symptom 129.16 54.16 Streptococcal sore throat (disorder) 21/11/2000 21/11/2012 ... nonhispanic F Lee Massachusetts US 1025 Collier Arcade Ashland Massachusetts Middlesex County NaN 42.291986 -71.463724
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
783 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 2012-05-08T20:46:46Z 2012-05-08T21:01:46Z ambulatory Postoperative follow-up visit (procedure) 129.16 54.16 Malignant neoplasm of breast (disorder) 21/11/2000 21/11/2012 ... nonhispanic F Lee Massachusetts US 1025 Collier Arcade Ashland Massachusetts Middlesex County NaN 42.291986 -71.463724
784 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 2012-05-08T20:46:46Z 2012-05-09T21:07:46Z inpatient Screening surveillance (regime/therapy) 129.16 54.16 Malignant neoplasm of breast (disorder) 21/11/2000 21/11/2012 ... nonhispanic F Lee Massachusetts US 1025 Collier Arcade Ashland Massachusetts Middlesex County NaN 42.291986 -71.463724
785 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 2012-05-08T20:46:46Z 2012-05-09T21:16:46Z inpatient Gynecology service (qualifier value) 129.16 54.16 Malignant neoplasm of breast (disorder) 21/11/2000 21/11/2012 ... nonhispanic F Lee Massachusetts US 1025 Collier Arcade Ashland Massachusetts Middlesex County NaN 42.291986 -71.463724
786 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 2012-08-14T20:46:46Z 2012-08-14T21:01:46Z ambulatory Postoperative follow-up visit (procedure) 129.16 54.16 Malignant neoplasm of breast (disorder) 21/11/2000 21/11/2012 ... nonhispanic F Lee Massachusetts US 1025 Collier Arcade Ashland Massachusetts Middlesex County NaN 42.291986 -71.463724
787 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 2012-11-20T20:46:46Z 2012-11-20T21:01:46Z ambulatory Postoperative follow-up visit (procedure) 129.16 54.16 Malignant neoplasm of breast (disorder) 21/11/2000 21/11/2012 ... nonhispanic F Lee Massachusetts US 1025 Collier Arcade Ashland Massachusetts Middlesex County NaN 42.291986 -71.463724

85 rows × 28 columns

Figure 4.2: Synthea synthetic patient data for Carmelia Konopelski

These data describe ten years of her life, from her birth in 2002, to her death from cancer in 2012. The progression of her illness is critical for research, but her personal details are not.

Before we start doing anything, we need to understand our dataset, and understand how we intend to redact it while maintaining its internal integrity so that we can continue to conduct analysis.

Consider if these data contained information on all patients in a particular region, but children living near a specific factory had an increased risk of dying from childhood cancer, we would definitely want to know where those children lived.

  • Ensure that individual patient data cannot be recovered.
  • Ensure that geospatial characteristics associated with patient morbidity and mortality are maintained.
  • Consider the risk of outlier anonymity.
  • Test methods for deanonymisation to ensure anonymised data cannot be reconstituted.

This means we can remove data like names, but we need a method for connecting data associated with each patient. We can remove addresses, but need a method to ensure a geographic relationship is maintained.

4.2.2.1.1 Attribute suppression

An attribute is also known as a field. This method requires that we delete an entire field. It is one of the first, and easiest, steps we can take.

  • Remove data we do not need
  • Remove data we cannot easily redact

This is a destructive step since suppression deletes the original data.

4.2.2.1.2 Record suppression

Some data are outliers; sufficiently rare that - in and of itself - these data can’t be anonymised. With record suppression we remove all data related to these individuals. However, tread carefully.

Outliers may be of significant interest if their status is part of the study. If a person’s illness is unusual for the area where they live, for their ethnicity, gender or sexual orientation, then that would make them an outlier. However, that would also be important for understanding the disease.

On the other hand, if their location, ethnicity, gender or sexual orientation have no bearing on the disease, then these could be safely redacted.

How can we know which this is?

The method of analysing outliers is called k-anonymity and is beyond the scope of this course. However, there are a series of techniques (listed in the References) which permit the analysis of data to assess the presence of outliers and the risk of deanonymisation.

4.2.2.1.3 Pseudonymisation

Pseudonymisation is the replacement of identifying data with randomised values. This can be reversable, if you create a key between the data and the generated values, but irreversable if you deliberately throw away the keys. Persistent pseudonyms support linkage between the same individual across different datasets.

  • strings: pseudonymise through replacement

Our dataset already contains a pseudonymous field, PATIENT_ID, but the problem is that this is a reversable record. It does, absolutely, connect confidential patient information to these records. We can generate genuinely anonymous keys to further suppress the data.

4.2.2.1.4 Generalisation

Generalisation is a deliberate reduction in the precision of data, such as converting a person’s age into a range, or a precise location into a less precise location.

  • range: conversion of precise numbers into quantiles or statistical ranges
  • cluster: aggregation of geospatial data into statistically less significant clusters - this can also be used to mask outliers

Design the data ranges with appropriate sizes. Sometimes quantiles are the most appropriate, sometimes we use statistical definitions (such as geospatial ranges that are designed to include sufficient numbers of people so as to reduce deanonymisation).

4.2.2.1.5 Shuffling

Shuffling is where data are rearranged such that the individual attribute values are still represented in the dataset, but generally, do not correspond to the original records. This is not appropriate for all data. Swapping diseases amongst different patients will certainly render the data anonymous, but will also confuse any epidemiological analysis.

4.2.2.1.6 Data Perturbation

Perturbation involves adding random noise to data to “blur” it. This can include rounding, shifting dates, or adding geospatial displacement (jitter) to coordinate data. This means artificially moving the precision within a small range to obscure the exact details of the person.

  • dates: shift exact dates by days or months
  • rounding: round off to the nearest decile or whole number, depending on the precision of the data
  • coordinates: perturb the data through geospatial displacement (jitter)

Care must be taken not to add too little or too much perturbation.

4.2.2.2 Redaction process

Each of these steps permits us to continue maintaining the relationship within the record, and to the place where these patients recorded their disease progression, while also ensuring patient confidentiality is maintained.

Code
from collections import defaultdict

class Pseudonymise:
    
    def __init__(self, sigma=0.001):
        # Initialise a defaultdict, this creates a default dictionary item if it doesn't exist
        # We use this to ensure we maintain data integrity while still randomising
        # http://ikigomu.com/?p=28
        self.sigma = sigma
        self.mu = 0 # we want a deviation from the true point
        # Pseudo-patient dict
        self.pp = defaultdict(lambda: {"uuid":str(uuid.uuid4()),
                                       "lat": np.random.normal(self.mu, self.sigma),
                                       "lon": np.random.normal(self.mu, self.sigma)
                                      })
        
    def create_data(self, identities):
        """
        For each unique identity produce a unique UUID, and a Gaussian randomised `LAT` and `LON`.

        Paramaters
        ----------
        identities: list of strings

        Returns
        -------
        dict
            Each dict entry contains a record containing a "uuid" and modifier for "lat" and "lon"
        """
        for _id in identities:
            self.pp[_id]
        return self.pp
    
    def redact(self, row):
        """
        For a given row in a dataframe, return the pseudonymised version of "PATIENT_ID", "LAT", "LON".
        
        Parameters:
        row: DataFrame row
        
        Returns:
        DataFrame slice of row
        """
        return [
            self.pp[row["PATIENT_ID"]]["uuid"],
            row["LAT"] + self.pp[row["PATIENT_ID"]]["lat"],
            row["LON"] + self.pp[row["PATIENT_ID"]]["lon"],
        ]

# Specify columns for removal
suppression = ["SSN", "DRIVERS", "PASSPORT", "FIRST", "LAST", "MAIDEN", "ADDRESS", "ZIP"]
# And drop them
df.drop(suppression, axis=1, inplace=True)

p = Pseudonymise()
pp_data = p.create_data(df["PATIENT_ID"])
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
fields = ["PATIENT_ID", "LAT", "LON"]
df[fields] = df[fields].apply(p.redact, axis=1, result_type="expand")
df.head()
PATIENT_ID START STOP ENCOUNTERCLASS DESCRIPTION TOTAL_CLAIM_COST PAYER_COVERAGE REASONDESCRIPTION BIRTHDATE DEATHDATE MARITAL RACE ETHNICITY GENDER BIRTHPLACE CITY STATE COUNTY LAT LON
0 c91fbd2b-4795-4d47-b240-2e09465a8365 2010-01-23T17:45:28Z 2010-01-23T18:10:28Z ambulatory Encounter for symptom 129.16 54.16 Acute bronchitis (disorder) 14/11/1983 NaN M white nonhispanic M Danvers Massachusetts US Somerville Massachusetts Middlesex County 42.36022 -71.125387
1 c91fbd2b-4795-4d47-b240-2e09465a8365 2012-01-23T17:45:28Z 2012-01-23T18:00:28Z wellness General examination of patient (procedure) 129.16 129.16 NaN 14/11/1983 NaN M white nonhispanic M Danvers Massachusetts US Somerville Massachusetts Middlesex County 42.36022 -71.125387
2 c91fbd2b-4795-4d47-b240-2e09465a8365 2015-01-26T17:45:28Z 2015-01-26T18:15:28Z wellness General examination of patient (procedure) 129.16 129.16 NaN 14/11/1983 NaN M white nonhispanic M Danvers Massachusetts US Somerville Massachusetts Middlesex County 42.36022 -71.125387
3 c91fbd2b-4795-4d47-b240-2e09465a8365 2016-12-29T17:45:28Z 2016-12-29T18:00:28Z ambulatory Encounter for symptom 129.16 54.16 Acute bronchitis (disorder) 14/11/1983 NaN M white nonhispanic M Danvers Massachusetts US Somerville Massachusetts Middlesex County 42.36022 -71.125387
4 c91fbd2b-4795-4d47-b240-2e09465a8365 2017-01-09T17:45:28Z 2017-01-09T18:00:28Z outpatient Encounter for check up (procedure) 129.16 54.16 NaN 14/11/1983 NaN M white nonhispanic M Danvers Massachusetts US Somerville Massachusetts Middlesex County 42.36022 -71.125387
Figure 4.3: Synthea redacted data after applying appropriate techniques

Each latitutde and longitude has a very small additional factor added to it which shifts the physical location of the coordinates. This obscures the true location of the data subject but not so far as to make the data unusable. If you’re worried that this isn’t sufficiently obscured, you can decrease the sigma value to increase the scale of the jitter.

4.2.3 Aggregation and differential methods

Latanya Sweeney, Director of the Data Privacy Lab in the Institute of Quantitative Social Science (IQSS) at Harvard, proved that 87% of the US population can be re-identified using zip code, gender, and date of birth. Our standard redaction techniques may not have reduced the risk for our study participants as much as we hoped.

Aggregation is far more destructive than is redaction. We will lose resolution on patient morphology, and we will lose the direct relationships between data in exchange for summaries of that data. But we will gain security for the individuals concerned.

4.2.3.1 Differential privacy and assessing measures of risk

The problem with aggregation is that there are few generalised methods which can guide you as to how to create one. Each are specific to the data under consideration, and the range of research questions you want to support. There are mechanisms to test for anonymity, but not to support strategy.

Differential privacy is an advanced topic, and is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.

A simple example is where we query our database to ask for a specific patient record and then ask for that person’s gender according to the following procedure:

  • Toss a coin.
  • If heads, then toss the coin again (ignoring the outcome), and answer the question honestly.
  • If tails, then toss the coin again and answer “Male” if heads, “Female” if tails.

The seemingly redundant extra toss in the first case is needed in situations where just the act of tossing a coin may be observed by others, even if the actual result stays hidden. The confidentiality then arises from the refutability of the individual responses.

This approach can ensure individual confidentiality simply because the person doing the query can never be certain if they received a correct response or not. Statistically, though, the aggregate result would be the same ratio as in the original data.

K-anonymity (and similar approaches to it like L-diversity and T-closeness) are not aggregation techniques but are measures used to test whether a risk threshold has been exceeded.

4.2.3.2 Aggregation approach and process

Aggregation requires a purpose, that purpose is defined by a research question we need to answer.

Where redaction is guided by the data almost exclusively, aggregation is guided by the research objectives for the data. Any form of aggregation will limit what can be done. Awareness of these limitations is critical.

Census data are usually aggregated in this way, with the individual microdata (responses from each household) only made available to accredited researchers, while the aggregated versions are made available to the public.

Our objective will be to create groups of data and then perform aggregations on each group. The range of aggregations we can form include:

  • count: count of the individual members of the group;
  • totals: sums of values, and sums of sub-groups within the values (e.g. total duration of illness, and duration of each type of illness);
  • averages: including mean, median and mode of data sequences;
  • distributions: including quantiles, normals or other types of distribution.

The groups can be by specific categories or geospatial ranges.

Aggregations require familiarity and experience with the data being aggregated.

It’s very difficult to simply pick up a random dataset and know how to aggregate it in a way that supports analysis and extracts meaning from it. You are unlikely to be responsible for aggregating data you don’t have experience with, and when you have that experience, knowing how to aggregate it will become clearer.

Given all this, what follows is a very basic worked aggregation example. Let’s create a basic research question to demonstrate aggregation: For deceased patients, summarise the range of ages, specific illnesses, and total and median cost of treatment.

We’ll start by creating a new DEATHAGE column and then getting a distribution of that field:

Code
# Create a copy of the source data since aggregations are destructive
dfa = df.copy()
# Convert the date fields to pandas dates
for c in ["START", "STOP", "BIRTHDATE", "DEATHDATE"]:
    dfa[c] = dfa[c].apply(lambda x: pd.to_datetime(x, errors="coerce", dayfirst=True).date())
# Filter for the final medical treatment for all deceased patients
dfa = dfa[dfa.STOP==dfa.DEATHDATE]
dfa.sort_values(by="STOP", inplace=True)
dfa.drop_duplicates(subset="PATIENT_ID", keep="last", inplace=True)
# Create the DEATHAGE 
# This is a VERY approximate age at death calculation
dfa["DEATHAGE"] = dfa.DEATHDATE.apply(lambda x: x.year) - dfa.BIRTHDATE.apply(lambda x: x.year)
# And draw the DEATHAGE distribution, divided into n bins
dfa["DEATHAGE"].hist(bins=10)
Figure 4.4: Synthea synthetic patient data distribution at age of death
Exercise

The remainder of this aggregation is left as an exercise for you. Your objective is to create a table with the following fields:

  • Ages divided into the 10 ranges;
  • Count of specific reasons for hospital visit (from DESCRIPTION) for each age range, and decide how to present this;
  • Total of all the claim costs for each age range;
  • Median of all the claim costs for each age range;

Present this as a dataframe.

4.3 Analysis: investigating observed effects without a known causal mechanism

Apply different sampling methods to assess normal and binomial distributions. Identify appropriate measurements to research causal mechanisms and observed effects.

When we are challenged with studying any real-world phenomenon, be it a one-time event or some ongoing process, our first task is to conceptualise some way to classify the nature of the phenomenon so that we may measure it.

For any set of events, we must define what to measure, and how to measure it.

We may investigate people experiencing an illness, or fish stock viability, or the commercial success of a shopping centre in a town. In each case, we need to define what is meant by illness, viability, or success. We also need to compare our study population to something else to get a relative measure of the difference between the group experiencing the phenomenon, and a similar population that appears unaffected.

In the case of illness, we could define a discrete set of variables - a person is, or is not ill. Viability could be that any female fish is able to produce some number of fertilised eggs. For success, we could state that a shopping centre is successful if during some period of time only a small number of shops are unoccupied by a trading business.

In each of these cases, we can be guided by existing research to support the definitions for the measurements we choose, but we’re doing this without knowing how probable the events we’re measuring are likely to be, or what the frequency of those events is within the population of our sample.

For now, though, we need to consider how we design our measurements so that we can generate the data we will use to plot these distributions.

4.3.1 Investigating causality in research design

In 1837, cholera broke out in India and spread around the world. There was a global pandemic and, over the course of successive waves that only ended in 1863, millions died. In England, 1854 was the worst year, with 23,000 people dying.

At the time, the leading theory for what caused it was miasma, literally “bad air”, the theory that dirty particles and poison rose into the air and caused disease. The germ theory of disease was in development, but not yet accepted. It wouldn’t be till 1861 that Louis Pasteur would present an entire theory of disease, including developing some of the most important vaccines.

“On the Mode of Communication of Cholera” by John Snow

Dr John Snow was an English physician based in London, and opposed to the miasma theory. He believed that cholera was spreading via London’s competing water suppliers, and that the pollution in the Thames River - from where most people drew their supply - was the cause.

Very famously, he plotted the home location of individual fatalities, and demonstrated that these could be linked directly to water drawn from the Broad Street pump, which was a well. In so doing, he developed not only a form of the double-blind study, but also demonstrated how a multitude of data - physical patient location, civil water infrastructure, and maps - could be used together to build a testable evidence-based hypothesis for infectious spread.

He was the first person to recommend boiling water as a means to kill pathogens.

He started with a hypothesis that water was the vector of disease, but his approach would have delivered meaningful insight even without that hypothesis. By studying individual behaviour, and assessing what they all shared, he could potentially trace the source of cholera. He may have had to include more measurements on his map, but the approach is sound.

There are an ever-lengthening list of observed effects for which we have no known causal mechanism. Etiology is the study of causation or origination, and effects without known causes are, unsurprisingly, of unkown etiology. In disease research, such conditions are called idiopathic.

How do we move from an observation to a replicable explanation of the causal mechanism?

4.3.1.1 Methods for induction and falsifiability of hypotheses

An initial research question usually follows an observation, and any decision to research further requires us to decide what do we measure, and how do we measure it? These choices may be limited by the nature of the observation.

Any criminal investigation begins with a process of recording matters of fact, and establishing the motive, means, opportunity and intent of any prospective suspects to commit the crime. In some complex cases, investigators may even re-enact the crime to see whether their hypotheses are correct.

The challenge from a research perspective is that we can’t necessarily round up suspects, because that implies we have some idea as to the causality behind our research question. Testing ideas based on an acute observation of the world, a knowledge of how everything pertaining to our observation interacts - induction - may require knowledge and time we do not have. And, as you’ve seen in our sampling and randomisation experiments, you may select a biased sample to study that either erroneously confirms or refutes your hypothesis.

The nature of standpoint and randomness means that there is unlikely to be a single experiment that can settle a research question absolutely and for all time. Some new insight or observation may upend any historical evidence. We cannot apply a specific set of observations to a more general set of observations. One chicken that died of botulism does not imply all dead chickens died in the same way. Conversely, one chicken that did not die of botulism doesn’t refute the risk of botulism either.

David Hume, a Scottish philosopher, declared that “there can be no demonstrative arguments to prove, that those instances, of which we have had no experience, resemble those, of which we have had experience.” (Hume 1740)

Carl Popper, an Austrian philosopher, went further, stating that “non-reproducible single occurrences are of no significance to science. Thus a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a reproducible effect which refutes the theory”. (Popper 1959)

Popper used the words falsifiable and falsifiability to describe his approach to research. “A theory is scientific if and only if it divides the class of basic statements into the following two non-empty sub-classes: (a) the class of all those basic statements with which it is inconsistent, or which it prohibits - this is the class of its potential falsifiers (i.e., those statements which, if true, falsify the whole theory), and (b) the class of those basic statements with which it is consistent, or which it permits (i.e., those statements which, if true, corroborate it, or bear it out).”

Popper wasn’t saying this is a way to fake research, but saying that it should be possible to test whether a hypothesis is false and that such a false result should then discredit the hypothesis.

That itself is problematic. It’s possible that research measurements weren’t sufficiently precise and created conditions for specifying a hypothesis incorrectly, or that a theory doesn’t predict what you think it does; proving that the prediction is false doesn’t necessarily discredit the theory.

What this provides is a start to creating a method to answer the question:

If \(X\) is a theory that explains a small set of observations \(Y\), how do we test it? And where do we start if we have a small set of observations \(Y\) but no theory \(X\) to explain it?

4.3.1.2 Developing research methods to measure observations

There are a number of initiatives designed to deliver consistent and effective research methods. The Bradford Hill criteria were developed in 1965, and offer nine principles for establishing epidemiological evidence between causal evidence and an observed effect. The MAGIC criteria are aimed at statistical analysis.

The most comprehensive, and widely-used, are the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach, which are used in systematic reviews and clinical practice guidelines. (Schünemann et al. 2013)

Each of these is useful and offers a structured approach, but they are subjective. Neither can they tell you exactly what to do when faced with a set of interesting observations and no hypothesis as to what caused them.

Before we can propose a hypothesis to explain our observations, we need to reduce what we see to a repeatable and objective set of measurements:

  • Subject: who or what is the subject of your observations and why?
  • Access: can you measure your subjects directly, or via an intermediary? Are the subjects human, or are they property where access is controlled by another?
  • Scale: how large is the population of which the subject is a member?
  • Coherence: how coherent is that population? Do they interact regularly, and do they interact with overlapping alternative populations (example, chickens interacting with wild birds)?
  • Recurrence: is the observation repeating in some way, either in the same subject, or to additional subjects, or is it a one-time observation?
  • Frequency: if the observations are recurring, at what frequency and is that frequency increasing or decreasing?
  • Timing: when does, or did, the observed effect start, and how long does it last?
  • Magnitude: how big is the observed effect on the subject?
  • Consistency: how reliable are the observations, and do they change depending on who or what is observing?
  • Measurement: how easy is it to define the observation and reduce it into something to measure?
  • Objectivity: can the effect be measured directly from the subject, or do you need to measure indirectly via an intermediary, or through some derived effect?
  • Comparison: is there a normal or typical subject not experiencing the observed effect, and which permits comparative measurements?
  • Review: without biasing or prejudicing your approach, is there any existing research published which may offer insight into what you have observed?

If our subjects are human, we can ask them to tell us about the effect. However, if they’re not human, or if they’re human but we can’t find them easily or if they are only prepared to talk through an intermediary (like a lawyer or doctor), or there are no impartial human observers of the effect, we’re going to need a different approach. The fewer incidents there are the more measurements we are likely to need to take from our smaller sample population.

The challenge will always be, though, that we need to measure the effect without unconsciously imposing any initial expectations on what we choose to measure. If you’ve decided in advance that something causes an observed effect, you may inadvertently not measure things that contradict that theory and explain the actual cause.

You need to isolate the effect you wish to measure from other noise to ensure you have a clear signal. There are numerous ways to do so, and you can review Section 2.2.2.2.

We also need a language to describe types of measurements of random variables, our methods to sample from our study population, and the range of probabilities that possible random variables occur.

4.3.2 Sampling methods

How we define what we measure, and the methods we use to produce a subset of the population to measure, are entirely up to us.

Before we learned how to select the population from amongst simple, stratified or clustered approaches. Now we need to consider the actual method for taking a subset from that population.

The foundation for calculating the total number of possibilities available to us from a sample is based on the multiplication principle. If we perform \(r\) experiments with the \(k\)th experiment having \(n_k\) possible outcomes for \(k = 1, 2, \dots ,r\), then there are a total of \(n_1 \times n_2 \times \dots \times n_r\) possible outcomes for the sequence of \(r\) experiments. (Pishro-Nik 2014)

Sampling is the means by which we choose an element from a population. Ordinarily, we randomly choose multiple elements from a population so that we can perform multiple measurements. This can be:

  • with or without replacement where we either “return” a sampled element back to the population with the potential to redraw it as a sampled element, or we exclude sampled elements from future draws;
  • ordered or unordered where, if ordering matters (i.e. \(a_1, a_2, a_3 \neq a_2, a_3, a_1\)), we sample accordingly, otherwise the default is unordered sampling.

This leads to four different sampling approaches, each giving a different probabilistic outcome.

4.3.2.1 Ordered sampling with replacement

In a population of three elements (\(n = 3\)), from where we sample two elements (\(k = 2\)) and the sampled element is returned to the population, there are nine different possibilities: (1,1), (1,2), (1,3), (2,1), (2,2), (2,3), (3,1), (3,2), (3,3).

Since the order is important, the order in which we draw a sample is a factor in calculating the number of possible sample sets. That means the total number of ways we can select \(k\) elements from a population with \(n\) elements is:

\[ n \times n \times \dots \times n = n^k \]

4.3.2.2 Ordered sampling without replacement

If no repetition of elements is allowed - if we don’t return elements to be resampled - then we reduce our sample space.

Using the same example as above, we reduce our \(n = 3\), \(k = 2\) sample space to six possibilities: (1,2), (1,3), (2,1), (2,3), (3,1), (3,2).

What that means is that the sample population reduces by 1 after each, so sampling from \(n\) elements, then \((n - 1)\), then \((n - 2)\), etc till \((n - k + 1)\) elements. This is called a \(k\)-permutation of set \(A\), with the following notation for defining \(k\) objects from a set with \(n\) elements:

\[ P^n_k = n \times (n - 1) \times \dots \times (n - k + 1) \]

If \(k > n\) then \(P^n_k = 0\) since \((n - k + 1)\) must be zero since there is no way to sample more items from a population than there are elements in that population.

There’s a very famous paradox which illustrates this approach to sampling, and it hinges on a very specific definition. It’s called the birthday paradox.

If \(k\) people are in a classroom, what is the probability that at least two of them have the same birthday?

Read closely. This is not the probability that one specific person shares a birthday with anyone else in a group, but that any combination of two different people share a common birth day. You can see (hopefully) immediately, that this is an ordered sampling without replacement. Meaning, for \(n = 365\) days in a year, and all days are equally likely to be the birthday for any specific person.

If \(k \ge n\) then the probability of an event where two people share a birthday, \(A\), is \(P(A) = 1\) (since there are more people than there are days), so let’s look at the case for where \(k < n\). We want the minimum probability - only two people share a birth day - so it is better to describe the complement of the event \(P(A^c)\), the event that no two people share a birth day:

\[ P(A) = 1 - \frac{|A^c|}{|S|} \]

\(|S|\) is the total number of potential sequences of birth days for \(k\) people given \(n\) days (i.e. it’s an ordered sample with replacement).

\[ |S| = n^k \]

\(|A^c|\) is similar to \(|S|\) with one difference, it’s an ordered sample without replacement, which we’ve also already defined:

\[ |A^c| = P^n_k = n \times (n - 1) \times \dots \times (n - k + 1) \]

Which leads to:

\[\begin{equation} \begin{split} P(A) &= 1 - \frac{|A^c|}{|S|} \\ &= \frac{P^n_k}{n^k} \end{split} \end{equation}\]

We can run this and calculate it:

Code
# set k people in the class room as a range from 20 to 25, and n = 365 days

def order_with_replacement(n, k):
    """
    Returns the probability calculation for an ordered list with replacement. 
    Where n is the total number of elements, and k is the samples from the population.
    
    Args:
        n: int, total number of elements
        k: int, sample size
        
    Returns:
        float, calculation of: 1 - [n x (n - 1) x ... x (n - k + 1)] / [n^k]
    """
    p = 1
    for i in range(k):
        p *= (n - i)
    return 1 - p / (n**k) # exponential in Python is n**k

n = 365
k = range(20, 26)
for i in k:
    print(f"For {i} people, probability of any two having a birthday is {order_with_replacement(n, i):.2f}")
For 20 people, probability of any two having a birthday is 0.41
For 21 people, probability of any two having a birthday is 0.44
For 22 people, probability of any two having a birthday is 0.48
For 23 people, probability of any two having a birthday is 0.51
For 24 people, probability of any two having a birthday is 0.54
For 25 people, probability of any two having a birthday is 0.57
Figure 4.5

This is why it’s a paradox. The probability of any two people sharing a birth day is much more likely than you may expect from a small group of people. It’s the reason we can’t make assumptions about probabilities, because measurements interact in ways that may not seem intuitive.

Compare that to the specific case of one specific person sharing a birth day within anyone else in a group of \(k\) people. Here, the total number of potential birthdays is \(n^{k - 1}\) (i.e. all the other birth days besides the person being compared). While the total number of ways to choose birth days so that no one shares the chosen birth day is \((n - 1)^{k - 1}\), which leads here:

Code
n = 365
k = 25
p = 1 - ((n - 1)/n)**(k - 1)
print(f"Probability that {k-1} people share a birth day with one specific person is {p:.2f}")
Probability that 24 people share a birth day with one specific person is 0.06
Figure 4.6

Compare that … any two people in a group of 25 has a probability of 0.57, but the probability that one specific person shares a birthday falls to 0.06.

Before we go further, let’s improve the way we represent \(P^n_k\).

We use the term \(n!\) to represent n factorial which is the same as \(n \times (n - 1) \times \dots \times (n - n + 1)\). The special case is 0, such that \(0! = 1\). This leads to the general definition that the number of \(k\)-permutations of \(n\) different elements is:

\[ P^n_k = \frac{n!}{(n - k)!} \text{ for } 0 \le k \le n \]

4.3.2.3 Unordered sampling without replacement

Using our simple example of a population of three elements (\(n = 3\)), from where we sample two elements (\(k = 2\)) without replacement and where ordering is not relevant, there are only three different possibilities: (1,2), (1,3),(2,3).

This can be written as from a population of \(n\), choose \(k\) elements, or n choose k sampling, written as \(\dbinom{n}{k}\). This is the approach we’ve used to date.

Unordered is a subset of ordered, and \(P^n_k\) can be ordered in \(k!\) ways, meaning:

\[ P^n_k = \dbinom{n}{k} \times k! \]

Meaning the number of \(k\)-combinations of \(n\) different elements:

\[ \dbinom{n}{k} = \frac{n!}{k! (n - k)!} \text{ for } 0 \le k \le n \]

Noting that permutations are ordered samples and combinations are unordered samples.

4.3.2.4 Unordered sampling with replacement

As before, with a population of three elements (\(n = 3\)), from where we sample two elements (\(k = 2\)) with replacement and where ordering is not relevant, there are six different possibilities: (1,1), (1,2), (1,3), (2,2), (2,3), (3,3).

We can generalise this by defining \(x_i\) as the number of repetitions of elements at the \(i\)th position from a population of \(n\) elements, where the sum of all repetitions equals \(k\):

\[ x_1 + x_2 + \dots + x_n = k \text{ where } x_i \in \{0, 1, 2, 3, \dots, k \} \]

We can rewrite our toy-problem in this format, where for population \(n = (1, 2, 3)\):

\[\begin{equation} \begin{split} 1,1 \to (x_1, x_2, x_3) = (2,0,0) \\ 1,2 \to (x_1, x_2, x_3) = (1,1,0) \\ 1,3 \to (x_1, x_2, x_3) = (1,0,1) \\ 2,2 \to (x_1, x_2, x_3) = (0,2,0) \\ 2,3 \to (x_1, x_2, x_3) = (0,1,1) \\ 3,3 \to (x_1, x_2, x_3) = (0,0,2) \end{split} \end{equation}\]

If we sample \(k\) times from population \(n\), with replacement, we could draw the same element \(k\) times. If we have a population \(n\), and draw \(k\) elements, but only one is unique to \(n\) (i.e. \(x_i = k\)), then that is equivalent to drawing from a population that consists of \((n + k - 1)\) elements (since a completely repeated sample sequence consists of one unique and \(k - 1\) repeats).

That means we can define unordered sampling with replacement as being:

\[ \dbinom{n + k - 1}{k} \]

4.3.2.5 Sampling summary

If you want to work through examples of these approaches, please review Introduction to probability, statistics, and random processes by Hossein Pishro-Nik (Pishro-Nik 2014). For now, here is a summary table of all the sampling methods:

  • Ordered sampling with replacement: \(n^k\)
  • Ordered sampling without replacement: \(P^n_k = \frac{n!}{(n - k)!}\)
  • Unordered sampling without replacement: \(\dbinom{n}{k} = \frac{n!}{k! (n - k)!}\)
  • Unordered sampling with replacement: \(\dbinom{n + k - 1}{k}\)

Any research sampling from a population must consider not only the way of structuring the population which will be sampled, but also the method by which samples from that population structure will be drawn.

4.3.3 Discrete and continuous probability distributions

A probability distribution is a measure of the probabilities of possible values for a random variable chosen from a sample space. The sample space, \(\Omega\), is a set of all possible outcomes of the variable being measured. In a coin-flip, for example, \(\Omega = {\{\text{heads, tails}\}}\).

Where the sample space consists of random variables of two types, either discrete (i.e. a finite number of values), or continuous (i.e. infinite set of values within a defined interval).

If a die has six sides, then the probability of any side occurring is 1/6th. If you were measuring neonatal birth weight and specified “normal” as being 3.5 kg, then the probability of any one baby having that weight would be zero. Assuming your scale offer many digits of precision, then even babies close to 3.5kgs would likely be a few grams either side of that ideal. \(1/\infty\) is zero. For that reason continuous measurements are usually described in terms of ranges, e.g. normal neonatal weights are normal if 98% occur within 2.5 - 4.5 kgs.

When we work with other people’s data, we are compelled to use their definitions for the measurements they chose. If they decided to define age as a discrete variable (e.g. each year is considered as discrete as the number on a multi-sided die) there is little we can do to recover the continuous information that was lost. Similarly, if their continuous variables are defined in ranges that are too broad (e.g. age ranges of 18 to 35, 36 to 65) there is also little we can do.

But it does make it easier if you don’t have to think about defining your measurements yourself, for the maths is complex and what you measure limits what you can analyse.

4.3.3.1 Probability Mass Function and Cumulative Distribution Function for independent random variables

If \(X\) is a discrete random variable with a range \(R_X = \{x_1, x_2, x_3, \dots \}\) (being finite, or countably finite), then the probability that \(X = x_k\) is defined by the function:

\[ P_X(x_k) = P(X = x_k), \text{ for } k = 1, 2, 3, \dots, \]

This is known as the probability mass function (PMF) of \(X\), and gives us the probabilities of the possible values of a random variable.

In most research, we tend to deal with more than one random variable at a time - height and mass, for example - measuring each random variable separately, and also testing whether there is a dependency (or correlation) between them.

These variables can be independent or dependent. For now, let’s consider the scenario of independent random variables. Consider two discrete random variables \(X\) and \(Y\). These variables are independent if:

\[ P(X = x \cap Y = y) = P(X = x)P(Y = y), \text{ for all } x, y. \]

This is similar to independent events where, if the events \(A\) and \(B\) are independent, then \(P(A \cap B) = P(A)P(B)\). In general, if two random variables are independent, we can define them as:

\[ P(X \in A \cap Y \in B) = P(X \in A)P(Y \in B), \text{ for all sets } A \text{ and } B. \]

This should be intuitive. If two random variables are independent, then knowing the value of one offers no insight into the values of the other.

This is equivalent to consecutive coin tosses. Knowing the result of one coin toss does not offer any new insight into the result of a subsequent coin toss.

For \(n\) discrete random variables, then we can say the variables are independent if:

\[ P(X_1 = x_2 \cap X_2 = x_x \cap \dots \cap X_n = x_n) = P(X_1 = x_1)P(X_2 = x_2) \dots P(X_n = x_n), \text{ for all } x_1, x_2, \dots, x_n. \]

PMFs can’t be used for continuous random variables. Instead, we can use the cumulative distribution function (CDF) which can define any type of random variable, discrete, continuous, or mixed. Any random variable \(X\) can be defined as:

\[ F_X(x) = P(X \le x), \text{ for all } x \in \mathbb{R}. \]

Where \(\mathbb{R}\) is the set of all real numbers.

4.3.3.2 Bernoulli and binomial distributions

Defining discrete and continuous variables may be contentious and is often selected for convenience. Depending on the nature of your research question, definition simplifications may either be reasonable approximations, or lead to research failures.

One can define the two sides of a coin as straightforwardly being either of heads or tails. We often do the same with gender, but gender expression is a complex interaction of multiple factors meaning that it is closer to a continuous variable. The same goes for ethnicity, or even something like hair colour. Consider all the different hair colours within the category “brown”.

Designing experiments around discrete variables does simplify analysis, so we need to take care that we don’t force things.

In social and behavioural science, we can design experiments so that, for example, a subject presses - or does not press - a button. Suppose we’re testing memory in rats. If they push the blue button, they get a food pellet and the test is successful. If they push the red button, they get nothing, and the test is a failure.

If the probability of success is \(p\), then the probability of failure can be written as \(q = 1 - p\).

When an individual trial only has two possible outcomes, then this is the special case of a binomial distribution, called the Bernoulli random variable. While it doesn’t matter which outcome is labeled ‘success’ or ‘failure’, success is scored as \(1\) and failure as \(0\).

If we have ten trials, with six successes and four failures, then it looks like this:

\[ \text{0 1 1 1 1 0 1 1 0 0} \]

Note that each of these tests is independent. If these were 10 different rates, then which button each pushes does not depend on the behaviour of any of the other rats. Assuming we don’t let them tell each other the appropriate response.

The sample proportion, \(\hat{p}\), is the sample mean of these observations:

\[ \hat{p} = \frac{\text{no. of successes}}{\text{no. of trials}} = 0.6 \]

We can also define a mean and standard deviation for the Bernoulli random variable:

\[\begin{equation} \begin{split} \mu = E[X] &= P(X = 0) \times 0 + P(X = 1) \times 1 \\ &= (1 - p) \times 0 + p \times 1 = p \end{split} \end{equation}\]

Similarly, the variance of \(X\) is:

\[\begin{equation} \begin{split} \sigma^2 &= P(X = 0)(0 - p)^2 + P(X = 1)(1 - p)^2 \\ &= (1 - p)p^2 + p(1 - p)^2 = p(1 - p) \end{split} \end{equation}\]

Leading to the standard deviation: \(\sigma = \sqrt{p(1 - p)}\).

Now we need to generalise Bernoulli to make it more useful. That is the binomial distribution which describes the probability of having exactly \(k\) successes, and \(n - k\) failures, in \(n\) independent Bernoulli trials with a probability of success \(p\).

\[ p^k(1 - p)^{n-k} \]

This is applied to a series of unordered trials without replacement, the equation we developed earlier:

\[ \dbinom{n}{k} = \frac{n!}{k! (n - k)!} \]

Leading to:

\[ P(X = k) = \dbinom{n}{k}p^k(1 - p)^{n-k} = \frac{n!}{k! (n - k)!}p^k(1 - p)^{n-k} \]

Where the mean, variance, and standard deviation of the number of observed successes are:

\[ \mu = np \]

\[ \sigma^2 = np(1 - p) \]

\[ \sigma = \sqrt{np(1 - p)} \]

Four conditions to check if your distribution is binomial:

  • The trials are independent.
  • The number of trials, n, is fixed.
  • Each trial outcome can be classified as a success or failure.
  • The probability of a success, p, is the same for each trial.

4.3.3.3 Normal distribution

The principle difference between binomial and normal distributions, is that binomials are discrete, but normals are continuous.

We have reviewed the normal model previously since it is the most common you’ll see. Curves can differ in centre and spread, with the model being adjusted by mean and standard deviation. Changes to the mean moves the centre either left or right along the x-axis, while changes to the standard deviation stretches or constricts the curve.

As an example, consider the following curves with a different mean and standard deviation, but plotted to the same scale on the same curves:

Code
# A demonstration of differences in mean and standard deviation
from matplotlib import pyplot as plt
import matplotlib as mpl

%matplotlib inline

from scipy.stats import norm
import numpy as np

# 1st
mu_1 = 0.0
sd_1 = 1.0
x_1 = np.arange(-5, 5, 0.1)
y_1 = norm.pdf(x_1, mu_1, sd_1)
# 2nd
mu_2 = 19.0
sd_2 = 4.0
x_2 = np.arange(5, 32, 0.1)
y_2 = norm.pdf(x_2, mu_2, sd_2)
# Draw the chart
fig, ax = plt.subplots(figsize=(9,6))
ax.plot(x_1,y_1, color="C0", label="mu 0, sd 1")
ax.plot(x_2,y_2, color="C1", label="mu 19, sd 4")
ax.legend(loc="best", frameon=False)
plt.show()
Figure 4.7: A demonstration of differences in mean and standard deviation

The notation used to describe these normal distributions is \(N(\mu, \sigma)\), where \(\mu\) is the mean, and \(\sigma\) is the standard deviation. The Z-score of an observation quantifies how far that observation is from the mean in units of standard deviation. If \(x\) is an observation from a distribution \(N(\mu, \sigma)\), then the Z-score is defined as:

\[ Z = \frac{x - \mu}{\sigma} \]

If the observation is equal to the mean, then it should be clear that the Z-score will equal 0. Z-scores greater than the mean are positive, and less than the mean are negative. this permits us to compare observations from different normal distributions.

Observation \(x_1\) can be said to be more unusual than \(x_2\) if the absolute value of its Z-score is larger than that of the other: \(|Z_1| > |Z_2|\).

The further an observation is from the mean, in either direction, the more extreme it is.

We’ll review the different types of distribution in later lessons. For now, our two case studies provide an opportunity to consider different approaches to sampling and building a hypothesis.

4.3.4 Research in the absence of a causal mechanism

4.3.4.1 How to observe what you cannot see

Peyronie’s disease is a connective tissue disorder involving the growth of fibrous plaques in the soft tissue of the penis. Please do not click on that link if you don’t wish to see pictures of penises.

There is no known underlying cause for the disease, although the effects include a buildup of inflexible plaques inside the penis, curvature of the penis and painful erections. It is not life-threatening, but it does have significant quality-of-life effects, from mental distress to interfering with sexual fulfillment.

It is also not necessarily an easy disease for sufferers to talk about, especially in the absence of a causal mechanism, and researchers have little idea as to how prevalent the illness is, let alone how to prevent it.

In 2001, researchers from the Department of Urology, University of Cologne, set out to estimate the prevalence of the disease in Germany (Schwarzer et al. 2001). At the time, a very small number of prevalence estimates existed, few of which are available online for review. A study in Rochester, Minnesota diagnosed 101 male patients between 1950 and 1984 who sought medical help for their condition. Their estimate was 388.6 incidents per 100,000 male population. The most recent was a presentation at a Peyronie’s disease conference in 1997, the text of which doesn’t appear to be online.

From these two studies, the University of Cologne researchers start with a prevalence that varies between 0.3% and 1% of the male population.

The researchers decided to study the male population of Cologne, since it’s close by and they are likely to understand the dynamics of that population. Let’s review our questions:

  • Subject: Males resident in Cologne, aged 30 to 80 years old, presenting symptoms of Peyronie’s disease.
  • Access: Subjects are human males, can be contacted directly, or via their medical practitioners.
  • Scale: There are 1.5 million inhabitants in the Cologne area, although prevalence is unknown.
  • Coherence: Since causation is unknown, coherence is unknown. If the disease is sexually-transmissable, then sexually active males must be included in the study.
  • Recurrence: Disease appears long-term.
  • Frequency: Unknown.
  • Timing: Unknown.
  • Magnitude: Effect is very obvious to measure.
  • Consistency: Effect is sufficiently obvious that it requires no special training or tools to measure.
  • Measurement: Extremely easy for either the subject, or their medical practitioner to measure.
  • Objectivity: Cannot be measured by researchers directly.
  • Comparison: Low prevalence in general population.
  • Review: Several existing studies identified, dating from as early as 1928, offering prevalence of 0.3% to 1%.

Let’s work through a few ways they could sample their subject population:

  • Directly, by contacting a large proportion of men directly.
  • Indirectly, by contacting medical practitioners and asking about their patients.

Neither option is straightforward, requiring patient consent and medical ethics review. Contacting medical practitioners directly would make the returned data relatively comparable to that of the Minnesota study, but it is important to recognise that it is unknown what proportion of men experiencing Peyronie’s disease actually seek medical help.

The Cologne researchers chose to contact men directly. They sent 8,000 male inhabitants of the greater Cologne area, three questions permitting self-diagnosis of Peyronie’s disease:

  1. Can you feel any hardening in your penis in its flaccid state?
  2. Have you noticed a newly occurring curvature in your penis in its erect state?
  3. Have you experienced or do you currently experience pain in association with erection?

The questionnaire was validated against 182 hospital outpatients who both answered the questionnaire, and then physically examined. The questionnaire itself is simple and straightforward so as to increase the chance that recipients will complete and return it.

The questionnaire was mailed three successive times to each of the 8,000 men to ensure a reasonable response rate. You can think of this as unordered sampling without replacement (i.e. 8,000 unique attempts to achieve a measurement where order is unimportant).

In return, they received a 55.4% response, of whom, 3.2% reported symptoms of Peyronie’s disease. The paper describes in detail how they validated these results.

Let’s evaluate what was done and how it could be improved:

  • Bias in the sample returns is unknown, as stated by the researchers. In general, people are more likely to respond to a questionnaire in which they have a personal interest. This is a convenience sample and it should be expected that the respondent population would present a higher incident rate for Peyronie’s than the non-responding population. To what distribution is unknown. There is no way to measure convenience bias.
  • There is no question to ask whether subjects who self-identified these symptoms ever sought medical help. Do all men with these symptoms seek help? What proportion do?
  • A parallel study could have asked all urologists for data on their patients, categorised by age, marriage status and the other demographic data sought in the questionnaire.

This is not to make light of the work involved in sending out 8,000 letters, plus follow-ups, plus data capture and validation. However, this research significantly raises the established prevalence rate from 0.3% to 1%, to 3.2%.

That is a significant increase. It puts it up there with the same risk as breast cancer in women.

Clearly that is a huge claim, and the obvious sources of bias in their sample should be validated against an alternative sampling approach.

We can contextualise their data, though. Assume some bias in response, that 80% of positive recipients of the questionnaire chose to respond, then the real proportion is closer to this:

\[ \frac{142}{8000 \times 0.8} = 2.2\% \]

Let’s make a huge assumption that all men in pain seek medical help. Of the 142 positive respondents, 66 reported pain. If that is 100% of the available subjects in the sample, then the prevalence of men with Peyronie’s symptoms sufficiently severe to cause pain is about 0.8%. If it is 80%, then it is 1%. If it is genuinely the sample distribution, then it is 1.5%.

In other words, we get reasonably close to previous research which indicates that 0.3% to 1% of men seeking medical help have Peyronie’s disease.

This is sufficient to tell us that the research is plausible, or at least not wildly off. However, we definitely need further research to give an indication of how biased that convenience response sample was.

4.3.4.2 How to experiment with unknown causal mechanisms

In March 1988, cyclone Bola struck New Zealand causing heavy and prolonged flooding of kiwifruit orchards. Shortly afterwards the vines suffered plant weakening and began to die off. This happened again the following year, and was considered as some continuing impact of the previous year’s flooding.

“As waterlogging was ascertained to be the most common factor related to the appearance of kiwifruit decline, suitable crop management systems aiming to improve root aeration, such as soil trunking and convexing, soil conditioning with organic fertilizers, and localized watering with specific amounts of water delivered daily on the basis of crop consumption to maintain the soil field capacity constant, were proposed; however, the onset of this disease was not hampered.” (Bardi 2020)

Gradually, these symptoms and effects have spread around the world, reaching Italy in 2012. The impact has not been small, with 12.6% of Italian orchards destroyed in 2018. Farmers in Italy began calling the disease morìa, or “die-off”.

The key problem is insufficient oxygen availability to vine roots. What is not known is a specific set of causal mechanisms. Soils that are carefully prepared to ensure drainage still suffer. Micro-organisms which cause kiwifruit disease are not always present during the disease and are thought to be a secondary infection. Temperature may be a cause, but it’s unknown how or what range.

The range of potential causes, including multiple interactions of these causes, includes water, heat, and micro-organisms.

In 2017, researchers at the University of Udine in Italy set up a controlled trial to investigate factors which may cause what they describe as kiwifruit decline. Their work was published in 2020 (Savian et al. 2020).

They collected soil from three sites effected by the disease and then created three different growth environments within controlled greenhouses, planting twenty vines in each experimental environment, along with plants potted in sterile peat to use as a control.

  • Subject: Kiwifruit presenting a collection of symptoms described as kiwifruit decline
  • Access: Subjects are commercial fruit plants, grown extensively in Italy, but observation does not provide a sufficiently controlled environment in which to create comparisons.
  • Scale: Italy produces 15% of the world kiwifruit supply, and more than 12% of that production is effected.
  • Coherence: Flooding, microbes or heat sometimes causes the disease, sometimes together, sometimes independently, and sometimes not at all. A consistently repeated set of causal conditions is unknown.
  • Recurrence: Seasonal. Sterilising the soil and planting a new crop can remove the symptoms, but they return within a few years.
  • Frequency: Annual.
  • Timing: Started in New Zealand in 1988. Started in Italy in 2012.
  • Magnitude: Causes complete vine death.
  • Consistency: Observing the effect is straightforward, although other causes (other kiwifruit diseases) must be excluded prior to a final diagnosis of kiwifruit disease.
  • Measurement: The cause itself is unknown, but the effect can be measured using a range of methods, from observing vine die-off, to soil biomass measurements (dry biomass to root/leaf ratio), structural cellular observations, and micro-organism isolation and identification.
  • Objectivity: Measuring on site is difficult since the conditions are highly variable and any amount of contamination may occur. Controlling different prospective causal mechanisms is impossible where everything is happening at once.
  • Comparison: Ensuring identical comparisons limited by individual causal mechanisms is not available on site.
  • Review: Research is continuing, and there is a variety of research but, at this stage, no consistent and replicable causal mechanism has been identified.

Creating an ideal environment to observe the effect, and isolate different causal mechanisms, can be achieved in a controlled greenhouse environment. The researchers decided to do precisely that.

As before, let’s consider the various options available to them. We have a list of variables: soil, water, micro-organisms, heat. To that, they added a new variable: weather. The researchers specifically chose to model the rainfall patterns over a period in which kiwifruit disease was known to occur. That’s still a variation on water, but does also imply temperature variation.

Soil can be sterilised (S) or unsterilised (U). Heat can be provided (H) or left at ambient temperature (A). Finally, soil can be flooded and subjected to waterlogging (F) or watered appropriately (N).

That permits us to create the following growth conditions:

  • SAN - sterilised soil, ambient temperature, watered appropriately
  • SAF - sterilised soil, ambient temperature, water-logged soil
  • SHN - sterilised soil, heated soil, watered appropriately
  • SHF - sterilised soil, heated soil, water-logged soil
  • UAN - microbially active soil, ambient temperature, watered appropriately
  • UAF - microbially active soil, ambient temperature, water-logged soil
  • UHN - microbially active soil, heated soil, watered appropriately
  • UHF - microbially active soil, heated soil, water-logged soil

Think of this as ordered sampling without replacement. The Udine researchers did not control for temperature, leaving them only with the following combinations: SN, SF, UN, and UF (the two-letter codes are swapped in the paper, so careful when reading).

What’s interesting about the Udine paper, immediately, is that they did not run an SN experiment. They didn’t check whether appropriately watered, sterilised soil taken from the sample farms would result in kiwifruit disease. That seems an oversight since there is no guarantee that they genuinely have isolated all conditions leading up to the effect, and reproducing the original conditions would be a useful control. Instead they chose a control that was simply peat.

The experiment itself was designed to mimic growing conditions - with the exception of temperature control - over the 2012 to 2016 period to create a simulacrum of the conditions that appeared to give rise to the disease. Flooded soils were treated using these data, including frequency and volume of watering. They ran the experiment over two successive growing seasons (2017 and 2018).

Their results indicate that waterlogging plus microbe activity is required. Neither was it one specific microbe, but a combination of them that interacted.

There are a few questions we can ask:

  • What was temperature over the period? If temperature wasn’t controlled, how can they be sure that it isn’t also a factor?
  • If they start with sterilised soil and introduce the microbes identified as being causative, and subject that soil to waterlogging, can they induce the same effect?

This experiment also doesn’t answer how the disease has occurred even in vineyards not experiencing waterlogging.

Research is iterative. We build on what went before. These researchers have created a useful model for how to mimic actual conditions in a controlled environment and test a variety of causal mechanisms without needing to know what the actual causes are. It is up to future researchers to fill in the gaps by conducting their own experiments.

And, as far as being a researcher is concerned, that’s tremendous.

4.4 Presentation: research design for discrete and continuous measurements

Differentiate between design approaches for measuring effects on data subjects.

Theory - mathematical equations, proofs and analysis - can all be reduced to code, much as we’ve done throughout this course. The data we have used is collected during the measurement phase of a research project.

As a manager or lead researcher, you are frequently at the receiving-end of data collected by others and can do little about the gaps and weaknesses in these data. However, you are not merely a bystander. It is your task to interogate a study protocol and assess whether it conforms to appropriate measurement methodology and and whether what is claimed is what was asked. One of the most common ways in which research goes wrong starts with the report forms used to document a chain of information and ensure that there are answers to any questions you may have, and that any assumptions or local research sources and decisions are documented.

It is the data scientist’s responsibility to reduce all procedures describing data flow, data entry, data processing and quality validation to a written set of instructions. The data scientist is also to ensure that only appropriately certified individuals have access to the data, and only to that necessary to perform their tasks.

How you decide to go about measuring your sample data will depend on the nature of your study and data subjects.

4.4.1 Selection of research design

There are numerous data gathering methods:

  • Qualitative research is a more inductive approach to data gathering, where researchers build up insight from exploration, interaction, activities or events to build a depth of understanding. Such observation can be subjective and is highly reliant on the skill, experience or definitions and methodology of the individual researchers themselves. Another researcher may interpret their investigation differently. This approach is best for going deeply into describing or explaining an observation.
  • Quantitative research is based on a more objective measurement of variables and examining the relationship between these variables to reveal patterns, correlations and causal relationships. Such research should be replicable and less open to interpretation, although what is measured and how it is measured is still subjective. This is where the majority of statistical data arise and aims to produce repeatable explanations at scale.
  • Mixed methods research involves some synthesis of qualitative and quantitative methods to reinforce the strengths of each and counter their weaknesses.
  • Community-based participatory research is a more collaborative approach, building not only long-term community or stakeholder knowledge (such as research into traditional medicines or environmental management) but also “citizen science” where the ready availability of sensors and devices mean that sometimes the largest body of information may reside in the work of enthusiastic amateurs.
  • Review and meta-analysis is the search, aggregation and review of all the existing research literature on the effect you are observing. Such approaches, especially in terms of meta-analysis, have become a powerful way of revealing new insight from what can be collections of low-powered studies.

Research can involve any and all of these approaches. Qualitative research often gives rise to vast body of text which is still accessible to data scientists using natural language processing techniques. Measurements are not only numbers in a table, but can include audio interviews, photographs, even thousand-year-old written descriptions in ancient journals.

Research design using these methods is a deep body of research and you are recommended to the referenced texts. Creswell’s Research Design (Creswell and Creswell 2017) and Leavy’s Research Design (Leavy 2017) each cover similar ground. The books are proprietary, but the two links (if they’re still up) cover this topic in detail. How you go about deciding on which approach is very much determined by where you are in the research process, and how you have responded to the investigation questions in the Analysis section.

Purely by way of example, a chain of events could start with a farmer spotting a patch of rotting vegetation in the middle of a field of commercial crops, contacting her seed distributor, who in turn contact a nearby university agricultural research team. A researcher ends up wondering around a field taking pictures of what they see, then posting those on a web forum, getting similar photographs from a community forum, plotting those on a map, and then noticing it may follow a bird migration path, then contacting a bird-watching society … and you see how this eventually ends up as a study but it starts with speculative qualitative and quantitative information gathering.

From a presentation perspective, once you need to start gathering information in either depth or scale, you need to convert your research question into a study protocol that describes your entire research process. And, somewhere in there, you will need to reduce the measurements you require to a usable form. That is more likely to be completed via some digital device, but there are still many situations where a paper form is the only accessible choice.

4.4.2 From research protocols to report forms for data collection

Quantitative data collection crosses a line from where the researcher manages measurements directly, to where they hand off that responsibility to others. A research report form is designed to collect the subject data in a research study; its development represents a significant part of the study and can affect study success.

Presentation is not only about how people read and digest analysis, but also how they support your research. A poorly-designed form can be frustrating and confusing, and result in frustrating and confusing source data.

The research report form should be concise, self-explanatory, and user-friendly. Along with the research report form, the filling instructions (called the completion guidelines) should also be provided to study investigators for error-free data acquisition. Annotations for the variables are named according to the specific regulatory requirements governing the study or according to conventions followed internally. Annotations are coded terms used in database management tools to indicate the variables in the study.

Nothing is obvious, and you are not there to explain things in real time.

Real people must either manually collect individual measurements, or set up and connect electronic sensors which autonomously collect individual measurements, and place these in a database for researchers to use.

Each decision of what to measure, and how to measure it, is a technical and subjective choice. The process of developing an implementation plan for your research project will result in a reseach protocol. This protocol will result in a set of forms - digital or paper-based - to capture the information required in your study. (Bellary, Krishnankutty, and Latha 2014)

When collecting quantitative data, there really are only two categories of measurement: discrete and continuous.

Discrete measurements may be categorical (True/False, or a specific member of a list), or range-bound (age 18 to 25, or weight 50 to 60 kg). Continuous measurements must specify a level of accuracy required in significant digits, as well as units of measurement.

The objective of any research report form is to support answering your research question. Any data collected is in support of that primary function. If you’re asking a question that doesn’t support that function, then you need to evaluate clearly the impact - ethical and technical - that asking that question has.

Note
  • The most important validation step you will take is to prevent transcription errors and ensure data are valid at the point of data entry.
  • It is critical that your measurement collection forms be concise, self-explanatory, user-friendly, and that you test it thoroughly in advance of it being used in the study itself.
  • Correcting capture errors after the fact is all but impossible and may result in aberrant results or discarding of patient data.
  • Everything else you will do in terms of validation and data quality management requires that data be entered correctly at the beginning.
  • Ideally there is one - and only one - way to interpret and complete all data input fields.

4.4.3 Data input is a product of data requirements

The primary objective for any measurement process is that your form is completed. If the people expected to complete it experience it is unusually complex, confusing, time-consuming, or subjecting them to cognitive or physical effort they can grind through it filling in spurious data, or not complete it at all.

Poor presentation costs you and your research.

While the requirements are different, online forms are ubiquitous on the internet, and there are numerous guides (1, 2) as to how best design them to ensure user engagement.

Here is an example of well-designed and poorly designed data fields for data entry:

Visual cues in data field design

Note that the well-designed form gives direct and visual cues as to data format, significant digits, as well as the units of measurement. Nothing should be left in doubt. (Bellary, Krishnankutty, and Latha 2014)

Critically, you should also ensure that the person capturing the data should not be expected to produce derived / calculated data. For example, age can be calculated from date-of-birth, body mass index can be calculated from height and weight. There is no need to ask for these to be calculated, and this will both reduce duplication and the potential for data capture errors.

In some places, answers are coded in order to simplify the data collection. When codes are used to obtain an answer for a question, consistency in codes should be maintained throughout the completion guidelines booklet and there should not be any variation in the answer for the same question.

If, for example, yes/no answers are coded as 1 = yes and 2 = no, the same order should be practiced throughout the research report form. Nowhere in the same research report form “1” should be coded for “no” and “2” should be coded for “yes”.

4.4.3.1 Discrete data fields

Discrete input data means that you decide on the specific choices available to the person filling in the form. Data can be boolean, or selected from a list. In all cases, the data are specifically bounded. Nothing outside of the choices offered will be valid.

Check boxes are always better than term entry, and term entry is better than free text. Plus, give the user guidance. What do you expect the user to do with a space for “Positive” or “Negative”? If you’re not guiding them, you can get anything at all as a response.

While having defined fields can sometimes be unwieldy on a form, you’re opening yourself up to disaster when you permit free-text entry. You’re going to get a range of spelling, a range of different local brand names (or off-brand) names, and you may well lose data entirely.

If you have a list, are you expecting one choice? Or multiple choices?

4.4.3.2 Continuous data fields

Measurements - temperature, height, weight, volumes - are continuous data. However, that does not mean any freely-entered text will be acceptable. Data must still be subject to range and quality checks prior to acceptance.

Anything on a form is going to be entered into a database. Fields in database are always limited in some way, and any free-text fields are instantly unusable for statistical purposes. These fields will be more subject to ambiguity. Everything from units, to rounding errors, spelling mistakes, misremembered terms or names, to confusion over precisely what was required.

While it may appear obvious to you, always include an example next to a field, especially things like dates and times (i.e. what are you expecting to be entered). If you ask a question like “Pain:” be clear as to whether you expect True/False, or a Pain Scale response.

Always use ISO format for dates (i.e. YYYY-MM-DD). This is both better for data analysis, since it permits automatic sorting, and it removes all doubt. The MMM format for dates implies an abbreviation of the letters. In different countries, across multiple sites and languages, this can lead to unnecessary confusion and inconsistency. It is always best to remove all doubt, and date formats are a perpetual cause of errors in large data series.

Where you’re collecting ages or numerical data, for example, give the user a set of blocks to fill in. This ensures data are placed within a known location (for machine reading of scanned documents) as well as guiding the user into what is expected from that field.

This is especially true for numerical data. Blocks don’t just guide the user, they also specify the level of accuracy you expect. For instance, patient mass could be presented in round kilograms, or to two decimal places. What are you expecting? Do you think you’ll get it with a free-text field?

Also remember that the method by which a measurement is taken may vary and that inconsistency may effect your results. How is temperature measured? Using what device? How is it calibrated? Same for mass and height.

A blank space, with no guidance as to what you expect (e.g. “Subject:”) invites such random responses as to be an unusable field. Further, avoid subjective questions which invite an individual analytical response, especially when you offer only True/False as a response in an ambiguous question. E.g. “Female: child-bearing potential Y/N” This invites ambiguous responses. What are your criteria for “child-bearing”? If you absolutely need such criteria, age or menstruation-frequency / date-of-last-menstruation may be more useful.

4.4.3.3 Personal data and context sensitivity

Be wary of including personally identifying information in a field. This is especially critical in terms of subject names. Other metadata (which includes race, or sexual orientation) can also be sensitive, or potentially hazardous for subject safety. These either need to be coded anonymously, or not collected at all unless there is clear research guidance that such data are relevant. To make this entirely unambiguous, sometimes research staff themselves have personal biases against certain population groups and they will - consciously or unconsciously - sabotage your study by treating a particular subject differently from others in your study. To ensure neutrality, you need to remove any potential for bias. That means, unless it is specifically relevant to your study, do not collect it.

Identifying data also include things like Social Security numbers, government ID, hospital/patient medical record numbers, insurers. Almost any reference from outside the study brings metadata into the study. Be very, very careful as that can unblind your data and put subject safety at risk.

Be very aware of the geographic and cultural context where your data will be collected.

Note for Americans: Fahrenheit and the MM/DD/YYYY date format are only used in the US. If your study is aimed at any other country then this format will cause confusion.

Always assume that across multiple sites, when staff may be distracted or overworked, mistakes will happen. Always make it easy and check your assumptions.

It’s critical to treat a research report form as a single continuous dialogue with the person filling in the form. It needs to have a single, consistent approach, and it needs to not force the user to continuously shift between page orientations, or even text directions. They’ll get tired figuring it out, and a tired person will make mistakes capturing data.

Remember: nothing is obvious, and you are not there to explain things in real time.

Your most challenging task is to avoid ambiguity. Something obvious to you is not obvious to others. The people who will be completing these forms are busy. They have numerous matters on their mind and, while they are professionals, the longer and more complex your form, the more effort it takes and more fatiguing it is to complete. You want them to enjoy supporting your research, not dread it. The more straightforward and easy to understand your forms, the more consistent and high quality will be the responses.

4.5 Group tutorial

Exercise

It is the weekend. Your kids need you to pick them up in two hours and, after that, you won’t have time till Monday morning. However, you’be been meaning to get to a research report that was emailed to you on Friday afternoon. Your boss wants to know if it helps unlock a health crisis you’re dealing with, and she needs you to sort out what should be done. You have just received one of the following two research papers to review and summarise. You have two hours to consider the value of the paper and will then have 15 minutes to present to your boss. You may prepare one slide or visual aid which may, or may not, be used in the meeting.

  • Balasooriya, S., Munasinghe, H., Herath, A.T. et al. Possible links between groundwater geochemistry and chronic kidney disease of unknown etiology (CKDu): an investigation from the Ginnoruwa region in Sri Lanka. Expo Health (2019). doi.org/10.1007/s12403-019-00340-w A direct link to the PDF for this paper can be found here.
  • Anne Perrocheau, Freya Jephcott, Nima Asgari-Jirhanden, Jane Greig, Nicolas Peyraud, Joanna Tempowski, Investigating outbreaks of initially unknown aetiology in complex settings: findings and recommendations from 10 case studies, International Health, Volume 15, Issue 5, September 2023, Pages 537–546, doi.org/10.1093/inthealth/ihac088

What do you want this person to know about the paper results? What do you want them to do about it?

You are also welcome to conduct your own review on a paper of your choice. There are a host of publication repositories which support redistribution of published research. Here are a few and you can explore them to run your own reviews:

  • Europe PMC Europe PubMed Central is an open-access repository of biomedical research works
  • PubPeer for post-publication peer review
  • SciHub for open access research publication
  • Google Scholar for search into a wide range of papers. NOTE, most won’t be open access.

Look for search terms including “unknown etiology” or “idiopathic disease”.

4.6 References