3Probability, randomness, and the risk of de-anonymization
Research outcomes must be better than guessing
Data and research are powerful tools but they can be easily discredited. Not everyone can replicate published research and, even if they could, the time invested would be enormous.
In most cases, we need to trust that the assumptions informing a research question are valid. Where that trust is broken, it doesn’t only undermine that body of research but risks eroding public trust in all research. If the scientific process is no more than guessing, then anyone’s opinion is as valid.
The objective of this lesson is to gain confidence in probability and statistical measures in evaluating causality.
Session 3 lecture, Wednesday 9 July 2025
Research Question
The predictive value of diagnostic tools depends on test sensitivity to accurately measure what it is supposed to, and its specificity to measure only what it is supposed to. Using only core probabilistic techniques, evaluate two papers (Greco et al. 2001), (Rafferty et al. 2013) and assess whether the tests they suggest are sufficiently accurate to use as general diagnostic tools.
Which, if any, would you trust, and under what circumstances?
The two papers we will refer to are these:
Greco Marco, Crippa Flavio, Agresti Roberto et al., Axillary Lymph Node Staging in Breast Cancer by 2-Fluoro-2-deoxy-d-glucose–Positron Emission Tomography: Clinical Evaluation and Alternative Management, JNCI: Journal of the National Cancer Institute, vol. 93, number 8, pp. 630–635, April 2001. 10.1001/jama.283.20.2701
Rafferty Elizabeth A., Park Jeong Mi, Philpotts Liane E. et al., Assessing Radiologist Performance Using Combined Digital Mammography and Breast Tomosynthesis Compared with Digital Mammography Alone: Results of a Multicenter, Multireader Trial, Radiology, vol. 266, number 1, pp. 104–113, January 2013. doi:10.1148/radiol.12120674
Don’t worry if you don’t understand them on first read. Our objective is to learn how to evaluate research even when you are not immersed in the topic.
3.1 Ethics: fallacies, randomness, and treating intuition with skepticism
Determine the implications in the collection, mining and recombination of open- and digital data.
According to Tyler Vigen the rate of divorce in Maine, the northernmost US state, is almost entirely correlated with their per capita consumption of margarine:
Divorce vs margarine consumption
Perhaps you have strong feelings about margarine and this confirms your concerns about its capacity to undermine relationships? Perhaps you think this is just random chance?
Vigen runs the website Spurious Correlations where he satirises the very human obsession with forming patterns where none exist. There you will find that the age of Miss America is related to steam-related deaths, or that swimming pool drownings relates to nuclear power production.
Apophenia is the tendency to mistakenly see connections and meaning between unrelated things. All of Vigen’s charts are spurious correlations. Patterns of numbers that happen to coincide but don’t have any broader meaning, no different from two people walking side-by-side down a street together who don’t know each other and will shortly go in different directions to different destinations.
An objective of the scientific method is to expose mistaken tendencies and ensure they are excluded from any research process. Bad reasoning is challenging since it requires that the people responsible for defining a research question, and carrying out an investigation, admit and accept that their assumptions must be open to challenge.
3.1.1 Fooled by fallacies
An argument, or hypothesis, that does not work is a fallacy(Baggini and Fosl 2007). There are an enormous list of fallacies, and researchers should be aware of how they arise and how to mitigate them in a research process. As examples, an appeal to probability assumes that something may be taken for granted because it’s likely to be true. Another is affirming the consequent, which has the following structure where two premises lead to a conclusion:
If p, then q
q
Therefore, p
You can have a set of premises - each of which could well be true - but the assumption underlying the connection of causality between the premises does not support the conclusion. As example:
People with cancer and undergoing chemotherapy lose their hair.
This person has no hair.
Therefore, this person has cancer.
Or, a common approach to this fallacy:
Untrustworthy people have pointy eyebrows.
This person has pointy eyebrows.
Therefore, this person is untrustworthy.
Fallacies often arise from claims to intuition. A person may state that they have a feeling that a particular set of facts are causally related even despite an absence of evidence. If they have sufficient authority - power to assert their will over others regardless of their desires - then they may impose these fallacies on society.
In 1948, the National Party was elected to power in South Africa and immediately imposed its policy of Apartheid on the nation. This was a system of institutionalised racial segregation where society was stratified according to race; “white” people accorded highest status and “black” people placed at the bottom where they were stripped of their citizenship, right to own property, vote, or even receive an education.
Since racial classification was crucial to such a system, in 1950 the Population Registration Act was introduced along with a suite of tools designed to provide rapid racial classification tests.
If a government official felt unable to make a racial classification by observation, they would insert a pencil into a subject’s hair. If it fell out, that person was classified as “white”. If it remained in place, that person was classified as “coloured”, a social status above “black” but below “white”.
The impact of these classifications could tear families apart.
One story illustrates the effect on millions of South Africans. When Sandra Laing was 10 years old, despite having “white” parents, she was classified as “coloured”. She could no longer attend the same school as her brothers and was eventually ostracised by her family. Even after Apartheid ended in 1994, her brothers continued to refuse to associate with her.
If you are a researcher working on racial classification, you may improve on the accuracy of the pencil test but the foundation of that work is based on two fallacies: that human race is a biological rather than political construct; and that racial classification is a determinant of human worth.
A tool developed for the purpose of implementing intentional unethical outcomes is not a neutral object. The people developing that tool are moral agents and are responsible for the consequences of the use of that tool.
3.1.2 Individual autonomy and moral luck
Data scientists have powerful tools at their disposal capable of finding patterns and correlations in what can often - for individual humans - be overwhelming and indecipherable datasets. We are also living in an era with ready availability of vast and diverse data from public repositories.
It is tempting to assume that data are truthful and unbiased, and then to attempt to answer all sorts of controversial and challenging questions and let a machine “sort it out”.
What differentiates the skills and techniques of a researcher producing a better racial classifier from one producing a better fruit sorter? Few would have any ethical concerns with a machine designed to ensure that poor quality fruit are kept out of produce sent to market. Many would take issue with a machine designed to ensure a specific subset of society receives greater social status by virtue of superficial characteristics derived from an accident of birth.
The skills of the people working on each are no different, yet those who enabled Apartheid were charged with crimes against humanity. Where you end up working, and on what types of projects, is a matter of chance, of moral luck.
Your technical skills cannot help you with an ethical decision. Neither are these ethical dilemmas a matter only for historians.
In 2016, research sought to infer criminality from a sample of facial images (Wu and Zhang 2016). The researchers applied probabilistic classification techniques to a sample of 1,856 people, half of whom were convicted criminals, to answer the question as to whether there is “any diagnostic merit of the face-induced inferences on an individual’s social attributes?”
However, computers can’t answer a question of “who looks more criminal”, so the researchers instead created a proxy, measuring specific facial features, like head shape, eye corners, and mouth corners. The computer did not choose these, the researchers did.
Is there some fallacy at work in these choices? Could this fallacy be present also in the faces of the convicted criminals?
The authors caveat their work with claims that they are investigating whether a computer can model human bias, however, their work does not occur separate from the concerns of their country’s politics.
The researchers work at Shanghai Jiao Tong University in China. The ruling Chinese Communist Party has introduced a Social Credit System based on facial recognition to track and evaluate the population of China in terms of their trustworthiness. People who are classified as “untrustworthy” may be denied social services, including the ability to use public transport.
It is seductive to attempt to infer human behaviour from human appearance, and such research isn’t always produced in service to brutal states. In “Tracking historical changes in trustworthiness using machine learning analyses of facial cues in paintings”, French researchers state that they have designed “an algorithm to automatically generate trustworthiness evaluations for the facial action units (smile, eye brows, etc.) of European portraits in large historical databases.” They claim their results “show that trustworthiness in portraits increased over the period 1500–2000 paralleling the decline of interpersonal violence and the rise of democratic values observed in Western Europe.” (Safra et al. 2020)
Classifying people based on fallacies takes away individual autonomy and imposes an incredible ethical choice on data scientists who enable such systems to function. A researcher may spend an entire career inside a system that supports unethical choices and never requires them to account for their consequences, or they may find that systems change and ushers in an era of transparency and accountability.
A facial recognition system that can instantly classify people as “untrustworthy” based on markers of the distance of eye corners apart, the angle of eyebrows, or the degree of facial roundness, can not be said to be answering a question of “will this person commit a crime” unless there is specific research demonstrating beyond any doubt that such characteristics do indicate criminal behaviour.
A process, system or tool which makes a specific prediction based on input data requires that its predictions are based on answerable questions, and that it be provably accurate at such high-speed diagnostics.
3.1.3 Causation and probability
Causality - the influence by which an event, process or state (a cause) contributes to the production of another event, process or state (an effect) - is critical and challenging for research.
We cannot usually say with absolute certainty that a specific cause leads to a specific effect.
Smoking cigarettes leads to an increased risk of lung cancer, and lung cancer leads to an increased risk of death, but smoking in and of itself does not cause death. Tobacco companies used this lack of immediate causation to claim that any research findings that linked smoking to ill-health were purely a coincidence. That correlation is not causation.
Motivated bad-faith arguments of this nature have been used to validate fallacies about anti-vaccination, flat-Earthism, and bigotry of all varieties.
Researchers prefer to talk about probabilistic causation. That, all else being equal, there is a greater probability of a cause contributing to an effect than any other alternatives.
The probability of an outcome is the proportion of times the outcome would occur if the random phenomenon could be observed an infinite number of times. For some phenomenon, the proportion is known exactly (e.g. the number of faces on a die, or the probability of inheriting a gene-specific disease from one of your parents). For others, we don’t know how the phenomenon occurs and estimate its likelihood by observation (e.g. an obese person developing heart disease, or a woman in her 40’s developing breast cancer). (Dietz, Barr, and Çetinkaya-Rundel 2015)
Unusually for a section on ethics, we’re going to dive directly into probability in terms of mathematical definitions and code.
Ethics grapples with considerations that are inherently inexact or which have ambiguous outcomes. It is essential that the words and concepts we use, and the way they are used, are well-defined, unambiguous and exact. Axiomatic probability gives us a language to explore ambiguous outcomes.
Probability lends itself to simulation and a classic example is to roll a six-sided die multiple times. For this simulation, let \(\hat{p}_n\) be the proportion of outcomes that are \(1\) after the first \(n\) rolls:
Code
from matplotlib import pyplot as plt%matplotlib inlineimport numpy as npx = []die_sides =6die_value =1for n inrange(1, 100001):# Loop and increase the number of die rolls by 1, upto 100,001 rolls# Generate the die value from each roll, count the number of 1s r = np.random.randint(die_sides, size=n) +1# since Python is 0-indexed# Calculate the proportion of rolls which are 1 out of n rolls x.append(np.count_nonzero(r == die_value)/n)# Plot as a log-based line chartplt.figure(figsize=(12,6))# Expected result, i.e. 1/6e = np.full((len(x)), die_value/die_sides)plt.ylabel(r"$\hat{p}_n$", fontsize=20)plt.xlabel(r"$n$", fontsize=20)plt.plot(x)plt.plot(e)plt.xscale("log")
Figure 3.1: Probabilistic distribution of die rolls
You’ll notice that the proportion of 1-rolls diverges a great deal from the expected value (\(\frac16\)) for low values of \(n\) (i.e. low numbers of rolls), but gradually converges on the expected count. From a distribution perspective, the ideal would be an equal distribution for all die rolls:
Figure 3.2: Probabilistic distribution of die rolls at low or high repetition
Disproportionate outcomes and outliers are a characteristic of low sample sizes. The more observations we record, the more likely, or probable, that the rate of occurrences will converge on its true probability.
The right-most chart, where that convergence has happened, represents a probability distribution.
Law of small numbers: A tendency for an initial segment of data to have an unusual distribution or be dominated by rare outliers causing any small sample group to run the risk of failing to sufficiently represent an entire population.
Law of large numbers: As more observations are collected, the proportion \(\hat{p}_n\) of occurrences with a particular outcome converges to the probability \(p\) of that outcome.
A probability is defined as a proportion and takes a value from 0 to 1. The notation we use, the probability \(P\) of an outcome \(A\) is:
\[P(A)\]
There will be maths
Any data-driven research process leads to computation for analysis. Formal mathematical notation, axioms and formulae are a concise language which can be converted into programmatic code. Understanding these derivations will not only improve your communication, but also permit you to more easily interrogate and review the research and work of others.
3.1.3.1 Axioms of probability
There are a number of axioms that form a foundation for probability and interpretation of these can aid in understanding experimental design, randomisation, outliers and statistical analysis (Pishro-Nik 2014):
Axiom 1: For any event \(A\), \(P(A) \ge 0\)
Axiom 2: Probability of sample space \(S\) is \(P(S) = 1\)
Axiom 3: If \(A_1, A_2, A_3, ...\) are disjoint events, then \(P(A_1 \cup A_2 \cup A_3 ...) = P(A_1) + P(A_2) + P(A_3) + ...\)
Let’s be clear on what these mean. The probability of any event cannot be less than zero. It may not happen at all, but it can’t happen a negative number of times. The probability of any event within a sample space - all the possible outcomes for our experiment - is one. If you roll a six-sided die, the outcome will always be one of the faces, not something else.
The final axiom says that, for events that are disjoint (i.e. no overlap between them), then the probability of their union is equal to the sum of their probabilities.
A way to understand this is to plot intersections of potential outcomes as a Venn diagram, which shows all possible logical relations between a finite collection of different sets.
Code
from matplotlib_venn import venn3, venn2# Draw three disjoint sets# https://github.com/konstantint/matplotlib-vennv = venn3(subsets=(1,1,0,1,0,0,0))v.get_label_by_id('100').set_text("A")v.get_label_by_id('010').set_text("B")v.get_label_by_id('001').set_text("C")plt.show()
Figure 3.3: A disjoint Venn diagram
These three sets have no overlap and are independent of each other. If \(P(A) = 1\) then the probabilities of B and C will be zero. You can also think of \(\cup\) as representing a union or an or boolean operator.
As an example, in a competition with only one winner, then the events \(\text{{A wins}}\), \(\text{{B wins}}\) or \(\text{{C wins}}\) are disjoint since only one can occur at a time. If A has a 20% chance of winning and B has 40%, then what is the probability of A or B winning?
Besides disjoint sets, we can also have sets that are intersections, that is where events can be simultaneously true, as a boolean and operator. The intersection of A and B is represented as \(P(A \cap B)\) or \(P(A,B)\) or \(P(AB)\).
That notation is defined as:
\[P(A \cap B) = P(A,B) = P(AB)\]\[P(A \cup B) = P(A or B)\]
Consider a Venn diagram with an intersection:
Code
# Assume two sets, ABC and CDEv = venn2([set(["X", "Y"]), set(["Y", "Z"])])v.get_label_by_id("100").set_text("X")v.get_label_by_id("110").set_text("Y")v.get_label_by_id("010").set_text("Z")plt.show()
Figure 3.4: A Venn diagram intersection
Given the notation we have so far, it should be clear that the probability that any event in A and B occur at the same time (the subset of events Y) is equivalent to \(P(A \cap B)\), while the probability of events \(P(A \cup B)\) can be defined as the General Addition Rule:
\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]
\[\begin{equation}
\begin{split}
P(A \cup B \cup C) = & P(A) + P(B) + P(C) - \\
& P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C)
\end{split}
\end{equation}\]
This can be extended indefinitely into an inclusion-exclusion principle for \(n\) events \(A_1, A_2, ... , A_n\):
We can also define the complement of an event. If the probability that an event, A, will occur is \(P(A)\), then the probability that the event does not occur is its complement, defined as \(P(A^c)\).
3.1.3.2 Discrete and continuous probability distributions
Dice rolls have a finite range of outcomes and are referred to as discrete probability distributions. If a sample space \(S\) is countable then we can list all the elements in \(S\):
\[ S = {\{s_1, s_2, s_3, ...\}} \]
The important thing here is that these options are finite. There are also infinite sample spaces which are uncountable:
\[ S = [0, 1] \]
Where the sample space \([0,1]\) means any floating-point number from 0 to 1. A square bracket \([\) or \(]\) indicates that the distribution includes the value at the bracket, a rounded bracket \((\) or \()\) means the distribution does not include the value at the bracket.
A practical example of a continuous sample space could be any group of people’s specific heights. This is known as a continuous probability distribution.
In general, when we run experiments on populations, we create subsets of those populations. A subset \(A\) of \(S\) is defined as \(A \subset S\) such that if \(k \in A\) then \(k \in S\) (where \(\in\) is an element).
Research spans both discrete and continuous sample spaces, and sometimes at the same time. Consider haemophilia, an inherited genetic disorder impairing the body’s ability to produce blood clots, which can lead to excessive bleeding.
X-linked recessive inheritance scenarios for either the mother being a carrier or the father being affected
The gene is carried on the X chromosome in which a clotting factor is nonfunctional. A genetic test can indicate one of a discrete set of outcomes, as indicated in the image, but the expression of that nonfunctional gene will vary. Therefore the expression of the illness is on a continuous probability distribution (from mild to severe), while diagnosis of having the genetic mutation is discrete.
3.1.3.3 Independent and conditional probability
Independent processes are such that knowing the outcome of one provides no information on the others. Knowing a person’s height doesn’t imply anything about their eye colour. Flipping a coin doesn’t tell you what the result of the next coin toss will be.
For independent processes, the probability of a specific sequence of outcomes - such as rolling one on a die five times in a row - is the product of their separate probabilities. This is the multiplication rule for independent processes:
Conditional probabilities have some degree of dependency.
Conditional probabilities describe how the probability of an outcome varies with the knowledge of another factor or condition, and is closely related to the concepts of marginal and joint probabilities.
There are rare forms of haemophilia where the disease is acquired as a result of cancer or an autoimmune disorder but, largely, expressing the disease is conditional on having a nonfunctional clotting gene. What this means is, if you test any member of the public for nonfunctional clotting factor genes, and they do not have such a disorder, their risk of having haemophilia is close to zero. However, if they do have such a gene, their probability of having haemophilia rises considerably.
A marginal probability is a probability only related to a single event or process, such as \(P(A)\). A joint probability is the probability that two or more events or processes occur jointly, such as \(P(A \cap B)\).
When considering the probability of having a disease we don’t work in certainties. You don’t have your 44th birthday and promptly develop diabetes. Age increases the risk but is not like rolling a 100-sided dice and, if your age comes up, you are ill. A proportion of a population which has a particular illness can be thought of as a probability, and the proportion of an age range which has the disease can reflect a conditional probability:
Table summarising diabetes status as a proportion of the total population, categorised by age group for the US population
Diabetes
No Diabetes
Sum
Less than 20 years
0.001
0.277
0.277
20 to 44 years
0.014
0.315
0.329
45 to 64 years
0.043
0.219
0.261
Greater than 64 years
0.036
0.097
0.132
Sum
0.093
0.907
1.000
If you selected a person from this population at random, and they were 30 years old, their probability of having diabetes is 1.4%. If that person was 20 years older, their probability will have increased to 4.3%. These are joint probabilities since \(P(\text{Diabetes} \cap \text{45 to 64 years} ) = 0.043\). Whereas, the marginal probability of anyone in the general population having diabetes is \(P(\text{Diabetes}) = 0.093\).
You can think of \(P(A|B)\) as indicating our interest in the probability of A given the conditional probability of B (i.e. if B then what is A?), and vice versa. You can then draw a tree diagram representing the various probabilities:
x P(B|A) x P(A) ┌───────────────────── P(A ∩ B) ┌──────────────────P(A)──┤ x P(Bᶜ|A) │ └───────────────────── P(A ∩ Bᶜ) │ 1 ─────┤ │ x P(B|Aᶜ) │ x P(Aᶜ) ┌───────────────────── P(Aᶜ ∩ B) └──────────────────P(Aᶜ)─┤ x P(Bᶜ|Aᶜ) └───────────────────── P(Aᶜ ∩ Bᶜ)
An indefinite number of probabilities can be chained:
We can apply this to an example from Vu and Harrington (2020):
In Canada, about 0.35% of women over 40 will develop breast cancer in any given year. A common screening test for cancer is the mammogram, but it is not perfect. In about 11% of patients with breast cancer, the test gives a false negative: it indicates a woman does not have breast cancer when she does have breast cancer. Similarly, the test gives a false positive in 7% of patients who do not have breast cancer: it indicates these patients have breast cancer when they actually do not. If a randomly selected woman over 40 is tested for breast cancer using a mammogram and the test is positive – that is, the test suggests the woman has cancer – what is the probability she has breast cancer?
This is not this lesson’s research question, but you can see how our foundation in formal probability will give us the language and tools to answer it. For this example, we can define a tree diagram to answer this question:
If you know 7% of patients get a false positive, then the complement of that is \(1 - 0.07 = 0.93\) and so you can see how the tree diagram is constructed. For this example, a randomly selected 40-year-old woman who tests positive and has breast cancer has the following probability:
You can see that in the tree diagram there are two outcomes where our patient tests positive, in one of which she actually doesn’t have cancer. The probability of a positive test result - irrespective of cancer status - is the sum of these two amounts: 0.06976 + 0.00312 = 0.07288. The probability, given a positive test result, that a randomly-selected patient actually has breast cancer is:
Even with a positive test result, our patient only has a 4% chance of actually having breast cancer. Ordinarily, we would only test patients presenting with symptoms, which would change the sample space dramatically, but you can see how a test - something which instinctively you may believe resolves a problem of an unknown situation - can raise ethical concerns.
3.1.3.5 Baye’s Theorem and the Law of Total Probability
These definitions - the language of probability - are intended to support analysis and communication. When we use words with precise meanings, we’re able to effectively and rapidly communicate ideas where complexity may obscure our thoughts.
Tree diagrams and step-by-step axiomatic derivations are useful while you are learning, but eventually you will work faster with less explanation. It is critical to remember these axioms, though, as researchers are often called upon to explain their work to non-specialists.
The above example can be reduced to a single equation, and redefined where \(D = \text{[has disease]}\), \(D^c = \text{[does not have disease]}\), \(T^+ = \text{[positive test result]}\), \(T^- = \text{[negative test result]}\):
The formula in the denominator (below the divisor) is a sum of probabilities, or the total probability of an event. In words, we are saying: “Given the probability of an event (a positive test), what is the total probability of that event in the entire sample space (both having the disease and not having the disease)?”
Where the sample space \(S = {\{B_1, B_2, B_3, \dots\}}\), and \(A\) is any event with \(P(A) \neq 0\).
3.1.4 Intuition, prevalence and skepticism
The UK government operates more than 50,000 closed-circuit television cameras to monitor people moving around the country. In 2017 British Metropolitan police ran a pilot project, connecting these cameras to databases of faces and using facial recognition software to identify people on a government watch list attending the Notting Hill Carnival.
The Notting Hill Carnival celebrates Caribbean culture in the UK and about 150,000 people attend every year. The police pilot software flagged 104 people as potential suspects. 102 of them turned out to be innocent bystanders.
The software was wrong 98% of the time.
Despite this, the UK government continues to fund pilot projects using facial identity, and these systems continue to perform exceptionally badly.
Intuition is the capacity to know or comprehend something directly, without any need for justification or rational argument to support that belief. There are certain concepts where intuition is all we have, like explaining to someone what the colour yellow looks like.
Our intuition is rarely a good foundation on which to make predictions. Our minds are easily fooled by emergent patterns in random data, and differences in the inferences individuals draw in apprehending the same random patterns mean that there can be little agreement as to meaning.
The same failing in the facial identification system causes the low diagnostic power of the mammography test system, and it arises from both the laws of small and large numbers.
A rare event in a large population is a rare event. When we wish to test for such an event within a sample space, we must consider the sensitivity and specificity of that test.
Sensitivity is the probability that a test will positively identify the exact event we are testing for.
Specificity is the probability that a test will return a negative result for every non-matching event within the sample space.
The probability of a specific event in a sample space is its prevalence.
Positive Predictive Value (PPV) is the probability that a specific event occurred when a test result is positive.
Negative Predictive Value (NPV) is the probability that a specific event did not occur when a test result is negative.
The tree diagram presented earlier is a chart of the various outcomes from such a test, presenting four outcomes:
True positive (TP), a specific event is correctly identified as matching a specific event.
False positive (FP), a random and unspecified event is incorrectly identified as matching a specific event.
True negative (TN), a random and unspecified event is correctly identified as not matching a specific event.
False negative (FN), a random and unspecified event is incorrectly identified as matching a specific event.
When evaluating test sensitivity and specificity, we can simplify the tree diagram down to a confusion matrix presenting only these data.
Where intuition lets you down is in not considering the prevalence of a specified event within your sample space. You may assume that people presenting themselves for a cancer test have already undergone some selection process and so a test with 93% specificity is incredibly accurate. However, if only 1% of the general population ever suffer breast cancer and you are testing a very large sample space, then a test with 100% sensitivity will quickly be overwhelmed by an insufficiently acute specificity … too many false positives are returned, drowning out the information you were actually looking for.
This is compounded the greater level of specificity required. Facial recognition requires recognising a single specific individual from amongst a total population. If you wish to track that specific person across multiple locations, then your matching algorithm needs to have a positive predictive value of 100% or it will be regularly matching incorrectly.
Consider a requirement to match one person from a crowd of 100,000 people. If your test had 100% sensitivity and 99.9% specificity, it would still identify 101 people as that specified individual every time. Its positive predictive value would only be 1%.
The positive predictive value of the Metropolitan police facial recognition system is 2% (NPV is 98%), on 150,000 carnival-goers, so the specificity is about 99.93%.
If you were told that a test is 100% sensitive and 99.93% specific, you may well feel an intuitive trust for that test. The law of large numbers says that, given a large enough sample space, outliers are common and any flaws in methodology will be exposed. The law of small numbers says that the model may originally been tested on a biased sample with an over-abundance - too great a prevalence - of target events relative to the general population.
Given that the UK government is conducting a public study of their facial recognition system, it is clear they intend for it to work as claimed. They want the general public to trust it and are using the best technology at their disposal to achieve that end.
What of the Chinese government’s social credit facial identity system? This is not aimed at a specific controlled event at a specific time, but at the population in total, and at all times and places. How often do you think people are wrongly identified? And, given that social credit is a nebulous term, how easy is it for wrongly identified individuals to protest errors? China is run as a police state, with its population subject to arbitrary arrest. An automated system that can render arbitrary justice is an extension of that form of governance. False negatives are a feature, not a failure, of such a system.
You may choose to work to support such a system, or choose not to. You should always be clear on what the systems and tools you build are capable of, and how they fail.
Skepticism is critical for any review of any claim. It is essential that a data scientist be capable of self-reflection and review of their work and the work of others. If your predictions are based on fallacies, perhaps your predictions can be tremendously accurate but that doesn’t mean the total predictive value is meaningful. And you may be lucky that the fallacy corresponds with reality, but that doesn’t make your work of any more value than a person who wasn’t lucky.
A fallacy may lead to bias, but is a class on its own. There are criminals, but deciding that an accident of birth - be it face shape or skin colour - is an indicator of criminality is a fallacy. Using that information to predict future criminality is bias.
Using an intuition based on a fallacy and making a claim to authority to impose that bias on people takes away their autonomy.
An experiment should test an intuition to establish whether it is meaningful. Effectively, what is the probability of a hypothesis being true, and is that hypothesis better than chance?
3.2 Curation: software, interoperability and risks from data reuse
Employ methods for presenting data for synthesis and usage, and employing methods for data maintenance.
There are multiple challenges in preparing and presenting data for reuse, irrespective of whether those data are intended for the general public or inside an organisation only (Chait 2024), (Chait 2014)
If data publication systems are to encourage publication and data reuse then they must integrate well with in-house data process management systems. There are three generalised systems used to manage online and public versions of research reports, content and research data:
Content Management Systems (CMS): permit publishing, editing and modifying of qualitative content, as well as providing mechanisms to manage workflows and individual users in a collaborative environment.
Data Discovery Systems (DDS): are similar to CMS but provide mechanisms to manage the semi-structured quantitative and qualitative data in documents and spreadsheets and offer methods for data publication, discovery and reuse.
Business Intelligence Systems (BIS): provide a platform for engaging with structured quantitative data to produce custom slices of that data, charts, tables and geospatial representations.
While it is certainly possible for a single, integrated, software system to serve all of these requirements, this is not a common use-case. Data managers usually operate a number of different systems which often require manual data restructuring for availability in each system.
A visual representation of the various data and software components is presented as follows:
Overview of the technical components and processes in a data publication platform
3.2.1 Microdata and macrodata
Data are often presented as a ubiquitous mass. Statistical data encompasses a range of forms, from questionnaires and individual – personally-identifying – responses, through to aggregated tables of numerical values, analysis and text-driven reports.
We can differentiate between:
Microdata: information at the level of individual respondents, households and businesses, typically through surveys; for example, a national census may collect age, address, education, employment status, etc. from individuals.
Macrodata: aggregations derived from microdata; for example, the total number of people of a particular education category.
Dissemination of microdata is an issue of particular concern for personally-identifying information. There is often the need and sometimes a legal obligation, to protect confidentiality and privacy for the providers of microdata. In some cases, provenance of microdata may lie with external partners (such as academic institutions) with their own policies or restrictions on dissemination.
National Statistics Organisations have used a variety of approaches to making microdata publicly available1. Some are quite consistent with open data principles; however, many policies may place restrictions on who may access the data and how the data may be used.
There are already platforms that are well suited for distributing microdata online.
For example, the Integrated Public Use Microdata Series (IPUMS) requires researchers to implement security measures, avoid redistribution of microdata, use microdata only for non-commercial research/education purposes, and not make any attempt to identify the individuals recorded.
The International Household Survey Network (IHSN) has developed tools and guidelines to help interested statistical agencies improve their microdata management practices, including a Microdata Cataloging Tool (NADA). NADA allows administrators to specify an access policy for each dataset. Policies can range from “Open access” (similar to “open data”) to “Data not available” (metadata only) for each microdata file.
Additionally, there are an incredible number of knowledge management and data publishing platforms available for general use.
It is common for publishers to have a separate set of dissemination policies - and separate dissemination platforms - for microdata that are particularly sensitive than for their other data products. This dual approach may be beneficial for users, since it highlights that certain microdata are provided with additional conditions with respect to data access.
3.2.2 Proprietary file formats
Data interoperability is a mechanism by which data from different sources can be easily imported and reused in alternative applications. Whether that be multiple hospitals participating in a single clinical trial, or multiple research institutions collaborating on a shared research project, without agreed formats and standards for interoperability these data will either be unusable, or require a major commitment in time and resources for conversion.
One of the criteria for reuse is that data be machine-readable, non-proprietary electronic data files for data distribution. These formats reduce technical barriers to data access to an absolute minimum for broad categories of users. Open formats include CSV, XML, and text files, and some proprietary formats used in data analysis software products include SAS, STATA, and SPSS.
The latter are legitimate even in an open data initiative since these are the software systems used by many professional data users. However, since these formats are not interoperable, the potential for data re-use is limited unless open formats are also supported. Data dissemination in proprietary formats does not preclude dissemination in open formats and vice versa.
3.2.3 Application Programming Interfaces (APIs) are public
Interoperability requires an Application Programming Interface (API) through which standardised commands are available to an external system and used to query the data, metadata and other attributes in a database.
Such interoperability permits a range of actions by other software systems, for example:
Import data into another application (such as Python, Jupyter Notebooks, Tableau, R, or Excel) for analysis and merging with other data sources.
Development of free-standing applications, such as transport apps on mobile phones.
Automate repetitive processes, such as setting a routine to regularly download data released monthly.
APIs can also be used by site administrators to automate data harvesting, uploading or similar bulk processes. Such APIs are often connected to ETL systems for data transformation prior to loading.
The most common approach to implementing an API is via a Representational State Transfer (REST) system. The standard commands generally used in REST for creating, reading, updating, or deleting data are POST, GET, PUT, DELETE.
The more sophisticated software platforms often have a query-builder utility which permits users to experiment live on the server and see the results of different API queries. Such interfaces should permit unique URLs so that particular queries can be bookmarked and shared.
3.2.4 Datasets are reachable via persistent URI
Permanent and persistent availability means not just that data are available, but also that they are always in the same place.
For online systems, this implies a requirement for Uniform Resource Identifiers (URIs) which ensure that resources, content or data are always located through one discrete address for any and all users or software-driven applications. The most familiar of these are the Uniform Resource Locators (URLs) that you see as links to websites. These can also be described as endpoints.
It is essential that these endpoints never change and that their behaviour is predictable. A person who bookmarks a link, or who includes such a link in an article, assumes that it will still be there when they need this.
More generally, a URI can also define a persistent link to a point within a dataset. This permits the interlinking of different datasets, the creation of more complex data aggregations and better insight into that data.
Software should be capable of providing a clear and easy to find permanent URLs for every dataset being served by the platform. Some systems are capable of providing URIs to subsets of data, or points, within datasets.
3.2.5 Data structure, linked data and endpoints
Data are usually stored in a relational database. Structural metadata permits data contained in relational databases to be aligned and joined. This, however, can be slow and inefficient for large and complex datasets.
The World Wide Web Consortium (W3C) develops web standards. Their approach for data interchange on the web is known as the Resource Description Framework (RDF). It has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources, using a variety of syntax formats.
RDF breaks away from the standard relational database and can be thought of as a graph of entity-relationships of the form: subject, predicate and object. The subject (e.g. John) is linked to the object (e.g. Carol) by a predicate (e.g. ‘is a friend’) and gives rise to the terms Triple Store (the three entities being the ‘triple’) or Linked Data.
RDF subject, predicate, object model (A) combining to form an RDF graph
This image from (Anwar and Hunt 2009). The subject and object are also known as nodes, while the predicate is an edge. A network of nodes linked by edges is called a graph.
Numerous implementations of this have resulted in interoperable structured, and machine-readable, metadata systems. There are also, however, numerous legacy approaches to categorising data which have arisen in individual research institutions across the world.
Structured data endpoints return data in predictable ways. These can be as simple as a known type of serialisation format while more complex implementations permit the data to be queried, filtering or refining the dataset prior to download.
What they have in common is that there is a fixed URL to reach the end-point.
A number of commonly-used endpoints are:
JavaScript Object Notation (JSON): an open standard presenting data as a set of key-value pairs; as an end-point, it is accessible via RESTful APIs.
OData: While RDF has achieved a high degree of traction for linking diverse data together, it is still not that straightforward to connect it back to the tools researchers use most frequently to work with data. OData offers a standardised protocol for creating and consuming data. The format is extremely popular, and software as diverse as Tableau (for analysis and visualisation), Drupal (for content management), and Microsoft’s Excel are all able to accept OData as an input.
Extensible Markup Language (XML): a markup language, more difficult for humans to read, found in diverse uses such as encoding Microsoft Office documents, websites and for data interchange; it too can be presented via RESTful APIs.
SPARQL Protocol and RDF Query Language (SPARQL): an RDF database query language able to retrieve and manipulate RDF data, returning it as RDF or – with appropriate interpreters – conversion for use by SQL or other query languages.
3.2.6 Risks from deanonymisation and reuse
Data released to the public in general, or specifically to other researchers, will not be used in isolation. It will be stripped of the original context in which it was used and combined with other data to answer a new research question the original publisher may never have imagined.
There are consequences to this.
In 2014, New York City released data on 173 million individual tax journeys. Driver’s licence numbers were obscured using a hashing algorithm. These produce a specific pseudo-random text string from a specific input. Researchers found it relatively straightforward to regenerate the original identifiers from the hash. What followed was a case-study in the risks of deanonymisation as researchers shared methods for using these data to compile detailed analysis of individual driver’s earnings, and linking up trips to known locations of celebrities and public figures at known times.
There are certainly legitimate reasons to release such data, but redaction needs to be carefullly done. Even so, when a database of photographs used in a psychology study is repurposed to train an image classifier to identify potential criminals, then the data is being asked to perform tasks which the publisher may not want, or may not believe is even possible from their data.
As more organisations which manage data release their data under API access, there are more cases of deanonymisation from recombination. Whether it be Strava, which produces a fitness app logging their user’s exercise routines, accidentally revealing locations of secret US army bases or a future risk of individual genomes released to the public, these are issues that become more likely the more data that are released.
And that is before we consider the risks from arbitrary data collated in bulk. Clearview AI, a secretive US-based surveillance company, scraped billions of publicly available photographs from social media and have used it to create a facial recognition system used by law enforcement agencies.
Data may be anonymous in isolation, but once recombined with other seemingly anonymous data, patterns emerge that can reveal information of value to the public, but potentially of harm to uninvolved individuals. Probabilistic systems, research fallacies, and social bias mean that some proportion of the people targeted by these systems will experience significant harm.
Appropriate licences can ensure that legitimate users only reuse data in ways that are acceptable to a research publisher, but cannot prevent inadvertant or deliberate disclosure, or outright malfeasance. As we get deeper into this course, you will learn mechanisms for deanonymisation, and gain more skills on reducing this risk. What we cannot do is prevent it entirely, unless we choose to stop research publication.
3.3 Analysis: predictive value of synthetic data
Perform techniques in randomness and probability to understand distribution and prevalence.
3.3.1 Improving our synthetic population modeller
In the last lesson we developed a synthetic sample profile to generate random individuals. We can do the same here, this time simulating a population of women of all ages and their risk for developing breast cancer.
According to the World Cancer Research Fund the age-standardised breast cancer rate per 100,000 in Lebanon is about 100. Purely for testing a null hypothesis, we’ll also generate their foot size. Women’s feet in Europe average 245 mm in length, with a standard deviation of 10 mm (i.e. 70% of women have feet from 235 mm to 255 mm long). (Jurca, Žabkar, and Džeroski 2019)
The Jurca foot analysis study is worth a read if only to see a paper that uses much of the theory and tools taught in this course.
For the purposes of our analysis, let’s create a population that reflects the distribution of 100,000 women in terms of their risk from breast-cancer at specific ages, and their foot size.
According to the US National Cancer Institute, the age-associated risk for diagnosis with breast cancer is as follows:
Range
Risk
Age 30
0.48%
Age 40
1.53%
Age 50
2.38%
Age 60
3.54%
Age 70
4.07%
We can generate a list of uniformly-distributed random floating-point numbers from 0 to 1. If a person is 30 years old, and their random number is less than 0.48, then they have breast cancer. This is our conditional probability.
We also need to create a skewed age distribution; one that reflects an older population. These normally have a distribution charted like this:
Population pyramid for Wales using 2011 census data
Plotting this for a population of 100,000 people using a synthetic population generator produces this histogram:
Code
"""Let's improve the synthetic data algorithm developed in the last lesson to make it easier to create any profile. We'll start by defining a schema for specifying anycharacteristic.We also need to handle conditional characteristics, such as if a smoker has a greaterrisk of lung cancer, or if age > 40 and female has a greater risk of breast cancer.Technically, since the function won't generate the random distribution, we can provideany list of values. As long as the length of the array is the same as the number of individuals in our synthetic population, the function will work."""from matplotlib import pyplot as pltimport matplotlib as mplimport pandas as pdimport numpy as npfrom scipy.stats import trapzimport random, uuid, math, jsondef create_synthetic_population(**kwargs):""" Return a population for a group of patients defined by a population characteristics schema: [{ "name": "name", "values": [], "conditional": [ { "name": "name", "values": [], "range": (from, to), "conditional": [] } ] }] This is relatively simple, and we may need to improve it in future, but it will work for now. The schema can be deeply nested and so relies on recursion. This function will add the following to each population member: id: a unique anonymous string coordinates: random coordinates for plotting on a scatter plot, not meaningful Args: total: int, default 10000 groups: int, default 4 profile: list of dicts which each conforms to schema Returns: list of dicts """# Get arguments population_total = kwargs.get("total", 10000) colour_list = [F"C{n}"for n inrange(kwargs.get("groups", 4))] population_profile = kwargs.get("profile", [])# Generate the colour coordinates to plot on a grid cc =int(len(colour_list)/2)# if the count is uneveniflen(colour_list)%2: cc =int((len(colour_list)-1)/2) cc_coords = [i+1for i inrange(cc)]*2# And add in the extra groupiflen(colour_list)%2: cc_coords.append(cc+1) cc_coords.sort() colour_coordinates = {F"C{n}": ((20*c, 20) if n%2else (20*c, 40)) for n, c inenumerate(cc_coords) } distance =20# Create the population population = {"id": np.array([str(uuid.uuid4()) for i inrange(population_total)]),"color": np.random.choice(colour_list, population_total),"coordinates": [] }for c in population["color"]: population["coordinates"].append( ( np.random.uniform(colour_coordinates[c][0], colour_coordinates[c][0] + distance), np.random.uniform(colour_coordinates[c][1], colour_coordinates[c][1] + distance) ) )# Get the individual characteristics# https://stackoverflow.com/a/26853961for pp in population_profile:# Iterate over the list of profiles population = {**population, **add_profile_field(pp, population_total)}# Convert the dictionary of arrays into an array of dictionaries# where each dict item represents a single person# https://stackoverflow.com/a/33046935 (we could use pandas as well) population = [dict(zip(population, v)) for v inzip(*population.values())]# Shuffle the ordered list to create a random population np.random.shuffle(population)return populationdef add_profile_field(field, size):""" Return a list of profile fields as part of an individual in a population. Recursive function - it calls itself until an endpoint is reached - so as to traverse conditional fields and ensure those dependencies are met. { "name": "name", "values": [], "conditional": [ { "name": "name", "values": [], "range": (from, to), "conditional": [] } ] } Args: field: dict of specific terms size: int of total population to be created Returns: dict of fields with value lists """ response = {} sample = np.random.choice(field["values"], size=size, replace=False) response[field["name"]] = []for c_field in field.get("conditional", []):iflen(c_field["range"]) ==2:# Filter the sample so that we get the c_sample = sample[(sample >= c_field["range"][0]) & (sample < c_field["range"][1])]else:# only 1 value in range, so is open-ended to end c_sample = sample[sample >= c_field["range"][0]]# Populate the conditional fields response[field["name"]].extend(c_sample) c_response = add_profile_field(c_field, len(c_sample))for k, v in c_response.items():if k notin response: response[k] = [] response[k].extend(v)# If there were no conditionals, then ensure response has a sample valueifnot response[field["name"]]: response[field["name"]] = sample# This should result in a list of fields all with arrays of the same lengthreturn responsedef generate_age_distribution(mean=45, low=5, high=60, total=100, maxAge=100):""" Generate a list of ages of population size 'total', and distributed in a trapezoidal weighting with 'mean', 'low' and 'high'. https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.trapz.html Args: mean: int, average age for population low: int, lower-edge of trapezoid weighting distribution high: int, upper-edge of trapezoid weighting distribution total: int, total number of values to be returned maxAge: int, maximum age, default is 100 Returns: list of ages """# All input values are a ratio of 1 d =100 mean, low, high = mean/d, low/d, high/d mean, var, skew, kurt = trapz.stats(low, high, moments='mvsk') ages = np.linspace(trapz.ppf(0.01, low, high), trapz.ppf(0.99, low, high), 100) probability_distribution = trapz(low, high).pdf(ages) probability_distribution = probability_distribution/sum(probability_distribution) distribution = np.random.choice(ages, total, p=probability_distribution) *100return distribution.astype(int)total_population =100000age_distribution = generate_age_distribution(total=total_population)plt.hist(age_distribution, density=True, histtype="stepfilled", alpha=0.2)plt.show()
Figure 3.5: A synthetic age-distributed population
It’s not a perfect reflection of a population age distribution, but it isn’t bad for a few lines of code generating random numbers. We can now go on to producing the rest of our population profile and creating a random list of people.
Finally, we need a universal test with a specified sensitivity and specificity. Given a random patient and a set of criteria, it should return whether that person has, or not, the target event we’re testing for.
Since ours is a “magic” universal test, it can also tell us whether or not it is returning a true or false positive.
We could, at this point, sample our population in some way, and test them and see how things go. Let’s assume, however, that a government has decided that all women with a “large” foot size need to be identified as they have decided that such women are likely to be witches.
They have set up a camera at the entrance to a popular shopping mall and this is connected to a machine intelligence system that can measure foot size as people pass below, and flag which women need to be stopped and arrested as potential witches.
Except the magic test isn’t really a test. It’s a person flipping a coin. Heads they’re a witch, tails they’re not.
How does it do?
Code
def get_universal_test_results(sample, sensitivity=0.5, specificity=0.5):""" Given an array of True and False corresponding to a sample population's actual status, return the universal test measurement result with an accuracy corresponding to the test sensitivity and specificity. Prints out a report on the test result accuracy Args: sample: array of True, False sensitivity: float, corresponding to test sensitivity, default 0.5 specificity: float, corresponding to test specificity, default 0.5 Returns: numpy array of test results """# https://numpy.org/doc/stable/reference/generated/numpy.where.html response = np.where(sample,# If actually true, the probability that the test returns true sample * (np.random.random_sample((len(sample), )) < sensitivity),# If actually false, the probability that the test returns false~sample * (np.random.random_sample((len(sample), )) > specificity)) s =len(sample) # sample size tr =len(response[response==True]) # true test response fr = s - tr # false test response ap =len(sample[sample==True]) # actual positive an = s - ap # actual negative tp =len(sample[(sample == response) & (sample ==True)]) # true positive of response tn =len(sample[(sample == response) & (sample ==False)]) # true negative of response fn = ap - tp # false negative of response fp = an - tn # false positive of response# https://en.wikipedia.org/wiki/Sensitivity_and_specificity#Prevalence_Thresholdif sensitivity + specificity ==1: prevalence_threshold =0.5else: prevalence_threshold = ((math.sqrt(sensitivity*(-specificity +1)) -1+ specificity)/ (sensitivity + specificity -1))print("Universal Test Results")print("----------------------")print(F"{tr:,} positive test results for sample (n = {s:,}), with {ap/s*100:.1f}% sample prevalence.")print("----------------------")# '\t' is the special 'tab' characterprint(F"Actual positive = {ap:,}\t|\t Actual negative = {an:,}")print(F"True positive = {tp:,}\t|\t True negative = {tn:,}")print(F"False positive = {fp:,}\t|\t False negative = {fn:,}")print("----------------------")print(F"Positive predictive value (PPV) = {tp/(tp+fp)*100:.1f}%")print(F"Prevalence threshold = {prevalence_threshold*100:.1f}% \t|\t Actual prevalence = {ap/s*100:.1f}%")print(F"Accuracy = {(tp+tn)/s*100:.1f}%")return response# Total sample sizepopulation_size =100000# Breast cancer statsage_distribution = generate_age_distribution(total=population_size)# Foot size statsmu, sigma =245, 10# mean and standard deviationlength_distribution = np.random.normal(mu, sigma, population_size).astype(int)# Profile schemaprofile = [ {"name": "age","values": age_distribution,"conditional": [ {"name": "breast_cancer",# Generate a set of random numbers from 0 to 1. If <= n, then set True"values": np.random.uniform(size=population_size)<=0.0,"range": [0, 30] }, {"name": "breast_cancer","values": np.random.uniform(size=population_size)<=0.0048,"range": [30, 40] }, {"name": "breast_cancer","values": np.random.uniform(size=population_size)<=0.0153,"range": [40, 50] }, {"name": "breast_cancer","values": np.random.uniform(size=population_size)<=0.0238,"range": [50, 60] }, {"name": "breast_cancer","values": np.random.uniform(size=population_size)<=0.0354,"range": [60, 70] }, {"name": "breast_cancer","values": np.random.uniform(size=population_size)<=0.0407,"range": [70] }, ] }, { "name": "foot_length","values": length_distribution,"conditional": [ {"name": "foot_size","values": np.array(["small"]*population_size),"range": [0, 235] }, {"name": "foot_size","values": np.array(["medium"]*population_size),"range": [235,255] }, {"name": "foot_size","values": np.array(["large"]*population_size),"range": [255] } ] }]kwargs = {"total": population_size,"groups": 4,"profile": profile}synthetic_population = create_synthetic_population(**kwargs)# We're going to convert our population from a list of dicts to a dict of lists:test_population = {k: np.array([d[k] for d in synthetic_population]) for k in synthetic_population[0]}# We can use the 'foot_size' or 'foot_length' fields, depending on how we feelresponse = get_universal_test_results(test_population["foot_length"] >255)
Universal Test Results
----------------------
49,948 positive test results for sample (n = 100,000), with 13.5% sample prevalence.
----------------------
Actual positive = 13,500 | Actual negative = 86,500
True positive = 6,830 | True negative = 43,382
False positive = 43,118 | False negative = 6,670
----------------------
Positive predictive value (PPV) = 13.7%
Prevalence threshold = 50.0% | Actual prevalence = 13.5%
Accuracy = 50.2%
Figure 3.6
It wrongly identified about 40,000 women, and missed more than 6,000. Which - since our test is a literal coin-toss - is as expected. What happens, though, if we improve the sensitivity and specificity dramatically? How well does it work when sensitivity and specificity are each 95%?
Universal Test Results
----------------------
17,131 positive test results for sample (n = 100,000), with 13.5% sample prevalence.
----------------------
Actual positive = 13,500 | Actual negative = 86,500
True positive = 12,795 | True negative = 82,164
False positive = 4,336 | False negative = 705
----------------------
Positive predictive value (PPV) = 74.7%
Prevalence threshold = 18.7% | Actual prevalence = 13.5%
Accuracy = 95.0%
Figure 3.7
Our more accurate test results in only 4,000 wrongly-identified women. Women with a foot length larger than 255 mm only appear in 13% of the population. The prevalence threshold of 18.7% indicates that positive predictive value drops when the actual prevalence is beneath this.
The prevalence threshold equation is independent of the population size, type of test or question being asked:
If you have a test that is right 95% of the time, you are subjecting 5% of a population to the risk of being found positive when they are not. The scale of your error depends very much on the prevalence of the characteristic you’re testing for. You need to decide whether that risk is worth it.
Our test isn’t answering the question as to whether or not a randomly-selected woman is a witch. The question it answers is whether her foot length is greater than 255 mm.
Perhaps you think this is silly, but how more ridiculous is it to treat face shape, eye-brow curvature or skin colour as having predictive power for criminality?
3.3.2 Assessing the predictive value for screening tests
Let’s use our synthetic population to answer our research question. If a woman actually has breast cancer, how well can a test tell us if she does have cancer?
These tests are more appropriately called screening tests or even medical surveillance, being a test or procedure performed on subjects of a defined asymptomatic population to assess the likelihood of their members having a particular disease or target feature. (Maxim, Niebo, and Utell 2014)
Screening tests do not diagnose an illness, but identify those needing further diagnostic tests or procedures. For example, a pap smear for cervical cancer, or cholesterol levels for heart disease. The strength of a screening test is that it is looking for characteristics that research has identified as being correlated with a specific illness.
There are multiple risks from using a derivative test, but the most important are that it must actually serve to identify the target being sought (i.e. it should ask the right question), and it must have a high positive predictive value.
The logical possibilities for a true target state and screening test outcome:
Test result
Subject has target
Subject target free
Subtotal
Positive
Correct result
False positive
Total positive test results
Negative
False negative
Correct result
Total negative test results
Subtotal
Total target subjects
Total target-free subjects
Total subjects
Unless absolutely obvious, or there is no alternative, a screening test is benchmarked against a gold standard test (Maxim, Niebo, and Utell 2014). This is a diagnostic test regarded as definitive. It provides 100% positive predictive value under all circumstances; it has 100% sensitivity and 100% specificity.
“As science increases its hold on the practice of medicine we become more aware of the limitations of the clinical method. Unfortunately, we also become more aware of the limitations of various diagnostic tests. Nevertheless, at any given time there may well be a consensus that a given test in a given situation is the best available test. It therefore serves as the gold standard against which newer tests can be compared. When enough data have accumulated to make that gold standard untenable, it can perfectly reasonably be replaced by another. This can then preside until it too is toppled.”(Versi 1992)
The problem is, some questions need answers even in the absence of a perfect gold standard. We may know how to treat an illness, but only if we know that a patient is suffering from it. Stretching the metaphors, imperfect tests are called alloyed gold standards.
There is a constant process of seeking to identify tests which trade-off cost versus efficacy and so being able to read and evaluate research which presents new approaches, procedures or tests is a critical part of a data scientist’s professional development.
Non-random selection for definitive assessment for disease with the old standard reference test
Errors in the reference
True disease status is subject to misclassification because the gold standard is imperfect
Spectrum bias
Types of cases and controls included are not representative of the population
Test interpretation bias
Information is available that can distort the diagnostic test
Unsatisfactory tests
Tests that are uninterpretable or incomplete do not yield a test result
Extrapolation bias
The conditions or characteristics of populations in the study are different from those in which the test will be applied
Lead time bias
Earlier detection by screening may erroneously appear to indicate beneficial effects on the outcome of a progressive disease
Length bias
Slowly progressing disease is over-represented in screened subjects relative to all cases of disease that arise in the population
Overdiagnosis bias
Subclinical disease may regress and never become a clinical problem in the absence of screening, but is detected by screening
It is sometimes easy to imagine, besides some inconvenience, a false positive or false negative has little consequence.
Numerous studies (Maxim, Niebo, and Utell 2014) have demonstrated the contrary. The consequences to individuals and to society at large are enormous. From polygraph tests (“lie” detectors) falsely accusing the innocent, to invasive biopsies - with all the consequent costs - performed on healthy people identified as having cancer.
There are further consequences. Radiographic imaging (CT scans used for mammography, colonography) carry a risk of radiation-induced carcinogenesis. And a too-perfect test can be so sensitive that it identifies disease at such an early stage that it isn’t clear whether it will ever result in real concern or what - if anything - should be done about it.
Your immune system can, and does, combat cancer. Some cancers are never dangerous. Identifying them causes anxiety and potentially unnecessary medical interventions.
Any test we assess must serve to limit these negative consequences.
Which leads us to our two papers, and the screening tests they examine:
Positron emission tomography (PET) with 2-[18F]fluoro-2-deoxy-d-glucose (FDG), (Greco et al. 2001)
Summarising the study results against the existing mammography standard looks as follows:
Test
Sample size
Actual positive
Actual negative
Sensitivity (%)
Specificity (%)
PPV (%)
Sample prevalence (%)
Mammography
65.5
84.1
42.9
Tomosynthesis
622
99
523
76.2
89.2
56.2
15.9
PET
167
72
95
94.4
86.3
84.0
43.1
The sample chosen, and the prevalence of a target within a sample population, have a significant impact on the predictive power identified in a study.
The relationship of sensitivity, specificity, and prevalence to the overall accuracy of a screening or diagnostic test.
The relationship of sensitivity, specificity, and prevalence to the overall accuracy of a screening or diagnostic test are important. The less prevalent the disease, the greater the weight applied to specificity in calculating overall accuracy; conversely, the more prevalent the disease, the greater the weight applied to sensitivity. Second, extreme differences in sensitivity and specificity under circumstances where disease prevalence is very low or very high lead to overall accuracy deviating considerably from sensitivity or specificity, respectively (Alberg et al. 2004).
The claimed sensitivity and specificity are the result of the population characteristics. This means you can’t just look at the results in isolation. Even a “good” test needs to be considered in terms of the study design.
Many countries around the world invite women for regular breast-cancer screening tests. Usually women between the ages of 50 and 70 are invited for these tests. That population prevalence for breast cancer looks quite different to the sample above.
Let’s filter our population of 100,000 women for only those between the ages of 50 and 70, and run each of these tests. This test is based on the current gold standard with sensitivity 65% and specificity 84%.
Code
# First, filter our synthetic sample populationtest_population = [p for p in synthetic_population if (p["age"] >=50) and (p["age"] <=70)]# Convert to dict of liststest_population = {k: np.array([d[k] for d in synthetic_population]) for k in synthetic_population[0]}# Run the standard mammographyresponse = get_universal_test_results(test_population["breast_cancer"], sensitivity=0.655, specificity=0.841)
Universal Test Results
----------------------
16,840 positive test results for sample (n = 100,000), with 1.5% sample prevalence.
----------------------
Actual positive = 1,469 | Actual negative = 98,531
True positive = 984 | True negative = 82,675
False positive = 15,856 | False negative = 485
----------------------
Positive predictive value (PPV) = 5.8%
Prevalence threshold = 33.0% | Actual prevalence = 1.5%
Accuracy = 83.7%
Figure 3.8
The actual population prevalence, at 1.5%, is far lower than in either of the studies. The test missed about 500 women, and wrongly identified a further 15,000 women for further testing, which could include the gold standard of an invasive surgical biopsy.
Do either of the studies do better?
Code
# Combined tomosynthesis and digital mammographyresponse = get_universal_test_results(test_population["breast_cancer"], sensitivity=0.762, specificity=0.892)
Universal Test Results
----------------------
11,713 positive test results for sample (n = 100,000), with 1.5% sample prevalence.
----------------------
Actual positive = 1,469 | Actual negative = 98,531
True positive = 1,122 | True negative = 87,940
False positive = 10,591 | False negative = 347
----------------------
Positive predictive value (PPV) = 9.6%
Prevalence threshold = 27.4% | Actual prevalence = 1.5%
Accuracy = 89.1%
Figure 3.9
The addition of tomosynthesis has the impact of increasing accuracy by 5%, and removing 5,000 false positives from the results, however, it still missed 350 actual breast cancer incidents.
Universal Test Results
----------------------
15,052 positive test results for sample (n = 100,000), with 1.5% sample prevalence.
----------------------
Actual positive = 1,469 | Actual negative = 98,531
True positive = 1,399 | True negative = 84,878
False positive = 13,653 | False negative = 70
----------------------
Positive predictive value (PPV) = 9.3%
Prevalence threshold = 27.6% | Actual prevalence = 1.5%
Accuracy = 86.3%
Figure 3.10
The PET test has a much lower false negative, missing only 100 patients, but with the trade-off of higher false positives. It is still lower than mammography, but by 2,000.
But we still come back to a fundamental problem. The prevalence of breast cancer in the general population is only 1.5%. Any test with that prevalence is vulnerable to weakness in the specificity.
Both tests have a lower prevalence threshold required, about 27% in comparison to mammography at 33%. Both almost double the positive predictive from 6% to 10% of those tested, which still indicates that 90% of those receiving a positive result are actually negative. What that requires is that we need to send women to be tested only if we have a reasonable suspicion they need to be tested in the first place. If 1 in 4 women sent for testing have some reasonable reason to suspect the need for a test, then we would achieve far greater accuracy.
What’s interesting in the tomosynthesis study is that its sample population reflects a real test population much more accurately (prevalence of 15.9%) compared to the PET study (prevalence of 43.1%). The PET study has a similar specificity to mammography but the research study flatters overall predictive value with a sample that dramatically overstates actual prevalence.
The reality is that neither test is useful for mass screening (even if better than mammography on its own), and that there is a genuine trade-off between false positives and false negatives which must be considered. If the sample prevalence is closer to 16%, though, then a test with lower false positives is likely to be more useful.
That doesn’t mean the tests aren’t useful. They’re clearly better than mammography alone, but context is everything. If this is the best non-invasive screening test at present, then there should be no mass-screening program. Or, the test itself should be treated as a two-step process, in which it is communicated that all positive tests automatically result in a second screening test. This would increase prevalence to 10% and help improve things.
That’s a political decision, rather than technical one, and brings us back to ethics.
3.4 Presentation: communication and visualisation
Apply appropriate plots to communicate probability to a lay public.
There are two groups of people to whom we may wish to communicate our research findings.
The first are fellow research professionals. We can expect them to have a grounding in scientific methodology, mathematics, probability and statistics, and even experience and direct knowledge of the research subject area. Histograms, scatter plots and box plots are familiar, and well-presented tables and charts are aids to rapid review and analysis.
If something is unclear, or our methods and analysis don’t support our conclusions, our peers won’t accept what we say because we’re “data scientists”. They’ll push back, ask questions and - if we’re lucky - help us correct and improve our work.
Members of the lay public are far more diverse. Some have no technical or research skills at all. Some are highly technical, but their experience offers them no context for evaluating our specific research area. Nevertheless, they will have strong opinions, and - more commonly than we would prefer - will take from your research what they will.
Most researchers communicate primarily to a peer audience first, and are left struggling to catch up should their work end up causing some controversy in the public domain.
3.4.1 Positive tests in a low prevalence environment
As in earlier lessons, what follows are a set of building-blocks you can think about and integrate into your work. They are not a set of rules, and are not universally appropriate.
Dimensions provides high-quality vector graphics (SVGs) for use as architectural reference drawings and which we’re going to use to humanise our visualisations.
We’ll start by creating a sample space of 100 individuals which reflects the US population of convicted felons. A 2017 study by Shannon, et al. estimates that 8% of the total US adult population, but 33% of the African American adult male population, has a felony conviction. The US is a substantial outlier when it comes to convicting and imprisoning its population, accounting for about 700 people jailed per 100,000 and the largest prison population on Earth.
We will create a synthetic population of 100 people. 15 will be African American, of which 5 will be convicted criminals, and the balance will be non-African Americans, of which 3 will be convicted criminals, making up our total population.
This next chart draws only the convicted felons in the population.
Code
from svgpath2mpl import parse_pathimport json# Prepare a chart using our people silhouettes as markersdef parse_svg(s):""" Process a vector graphic into a format that can be used by matplotlib. https://stackoverflow.com/a/58552620 An SVG is a set of vector points, so we can normalise the vector (remove random white-space placing the image at an arbitrary point) by centering the average at 0. We can similarly reoriante image if it is upside down. Args: s: SVG path Returns: Matplotlib Marker Path """ s = parse_path(s) s.vertices -= s.vertices.mean(axis=0)# Do this if they're upside down#s.vertices[:,1] *= -1return sdef load_json(source):""" Load and return a JSON file, if it exists. Args: source: the filename & path to open Raises: JSONDecoderError if not a valid json file FileNotFoundError if not a valid source Returns: dict """withopen(source, "r") as f:try:return json.load(f)except json.decoder.JSONDecodeError: e ="File at `{}` not valid json.".format(source)raise json.decoder.JSONDecodeError(e)# First we create arrays for the African-American populationaa_total =15aa_fel_total =5pop_total =100pop_fel_total =8aa_pop = [True]*aa_totalaa_fel = [True]*aa_fel_total + [False]*(aa_total - aa_fel_total)# Then for the total populationna_pop = [False] * (pop_total - aa_total)na_fel = [True] * (pop_fel_total - aa_fel_total) + [False] * (pop_total - aa_total - pop_fel_total + aa_fel_total)population = {"population": np.array(na_pop + aa_pop),"felons": np.array(na_fel + aa_fel)}# Convert to dict of lists and shuffle our populationpopulation = [dict(zip(population, v)) for v inzip(*population.values())]np.random.shuffle(population)# And reconvert to list of dictspopulation = {k: np.array([d[k] for d in population]) for k in population[0]}mini_people = load_json("media/images/svg-people.json")mini_men = [parse_svg(m) for m in mini_people["male-silhouette"]]total_population =100cols =20rows =5# Create a matrix of x and y points in a neat gridx = np.tile(np.arange(cols), rows)y = np.repeat(np.arange(rows), cols)# Generate random markers for our populationmarker = np.random.choice(mini_men, total_population)fig, ax = plt.subplots(figsize=(12, 6), facecolor="w")ax.axis("off")for i inrange(cols*rows): c ="C1" a =0.1if population["felons"][i]: a =0.8 ax.scatter(x[i], y[i], c=c, alpha=a, marker=marker[i], s=3000, clip_on=False)plt.margins(x=0.09, y=0.09)plt.show()
Figure 3.11: Convicted felons in the general population
This chart draws only the African Americans in the population.
Code
fig, ax = plt.subplots(figsize=(12, 6), facecolor="w")ax.axis("off")for i inrange(cols*rows): c ="C1" a =0.1if population["population"][i]: a =0.8 ax.scatter(x[i], y[i], c=c, alpha=a, marker=marker[i], s=3000, clip_on=False)plt.margins(x=0.09, y=0.09)plt.show()
Figure 3.12: African Americans in the general population
We will now simulate a typical criminality facial recognition system used by law-enforcement. Let’s assume 95% sensitivity and 95% specificity. Our system can’t identify criminality directly, but it assumes that race is a proxy for criminality in line with common US prejudice.
This chart draws the population at large with blue for those who were identified as potential criminals who are also convicted felons, and those wrongly identified as potential criminals as olive green. Unidentified convicted felons are marked in dark orange. It does not indicate race.
Code
response = get_universal_test_results(population["population"], sensitivity=0.95, specificity=0.95)fig, ax = plt.subplots(figsize=(12, 6), facecolor="w")ax.axis("off")for i inrange(cols*rows): c ="C1" a =0.1if population["felons"][i]:# Undiagnosed a =0.8if response[i]: a =0.8# False positive c ="C8"if population["felons"][i]:# True positive c ="C0" ax.scatter(x[i], y[i], c=c, alpha=a, marker=marker[i], s=3000, clip_on=False)plt.margins(x=0.09, y=0.09)plt.show()
Universal Test Results
----------------------
21 positive test results for sample (n = 100), with 15.0% sample prevalence.
----------------------
Actual positive = 15 | Actual negative = 85
True positive = 15 | True negative = 79
False positive = 6 | False negative = 0
----------------------
Positive predictive value (PPV) = 71.4%
Prevalence threshold = 18.7% | Actual prevalence = 15.0%
Accuracy = 94.0%
Figure 3.13: Predictions for potential criminals in the US
Note the problem with this, though. That’s not the answer to the question we asked. The textual output is at odds with the graphical presentation.
Why?
You wanted to know who was a potential criminal, but your system has identified 14 out of 15 African Americans instead. All those people in olive were falsely categorised as being untrustworthy and suffered the consequences. Let’s draw that chart in a way that better reflects the question that was asked. Who are the African Americans?
Code
response = get_universal_test_results(population["population"], sensitivity=0.95, specificity=0.95)fig, ax = plt.subplots(figsize=(12, 6), facecolor="w")ax.axis("off")for i inrange(cols*rows): c ="C1" a =0.1if population["felons"][i]:# Undiagnosed a =0.8if response[i]: a =0.8# False positive c ="C8"if population["felons"][i]:# True positive c ="C0" ax.scatter(x[i], y[i], c=c, alpha=a, marker=marker[i], s=3000, clip_on=False)plt.margins(x=0.09, y=0.09)plt.show()
Universal Test Results
----------------------
18 positive test results for sample (n = 100), with 15.0% sample prevalence.
----------------------
Actual positive = 15 | Actual negative = 85
True positive = 15 | True negative = 82
False positive = 3 | False negative = 0
----------------------
Positive predictive value (PPV) = 83.3%
Prevalence threshold = 18.7% | Actual prevalence = 15.0%
Accuracy = 97.0%
Figure 3.14: Who are the African Americans?
There are 8 convicted felons in this chart, and the algorithm has identified about 20 of them. In reality, it missed what it was intended to look for almost entirely. This is not a fault of the algorithm, but of a human research-driven decision to treat a test for ethnicity as if it is a test for trustworthiness.
This is a useful illustration of fallacies in action, but even when research is on your side, prevalence can let you down. Let’s demonstrate this with our synthetic population and our screening test for breast cancer. This time green for true positive, red for false positive, and orange for a false negative.
There are 100,000 people in this sample. We’ll draw only the first 400 results from the random population.
Code
mini_women = [parse_svg(m) for m in mini_people["female-silhouette"]]cols =40rows =10x = np.tile(np.arange(cols), rows)y = np.repeat(np.arange(rows), cols)marker = np.random.choice(mini_women, population_size)response = get_universal_test_results(test_population["breast_cancer"], sensitivity=0.762, specificity=0.892)fig, ax = plt.subplots(figsize=(16, 15), facecolor="w")ax.axis("off")for i inrange(cols*rows): c ="C1" a =0.1if test_population["breast_cancer"][i]:# Undiagnosed a =0.8if response[i]: a =0.8# False positive c ="C3"if test_population["breast_cancer"][i]:# True positive c ="C2" ax.scatter(x[i], y[i], c=c, alpha=a, marker=marker[i], s=6000, clip_on=False)plt.margins(x=0.09, y=0.09)plt.show()
Universal Test Results
----------------------
11,682 positive test results for sample (n = 100,000), with 1.5% sample prevalence.
----------------------
Actual positive = 1,469 | Actual negative = 98,531
True positive = 1,086 | True negative = 87,935
False positive = 10,596 | False negative = 383
----------------------
Positive predictive value (PPV) = 9.3%
Prevalence threshold = 27.4% | Actual prevalence = 1.5%
Accuracy = 89.0%
Figure 3.15: Who has breast cancer?
3.4.2 Intimidated by randomess
Mastery in research does not come from skills with software or algorithms, but with experience and ability in navigating ambiguity in randomness, and the pursuit of high probability causation.
The objective of presentation is appropriate communication. In thinking carefully about what would best convey the realities of your research, and the prevalence of a target event within a population, to a non-specialist audience.
Probability is especially challenging for a lay public to understand, being the intersection of maths, randomness, and prevalence.
Good communication includes both the words we write to explain what we did and what we found, as well as the charts we use to summarise and explain what we looked at and what we discovered.
Here are some of the things we need to communicate:
Randomness means that experiencing, or not experiencing, a research event can be a lot like winning the lottery - it doesn’t always conform to social conventions of fairness. A low probability does not mean no probability, and a high probability is not certainty.
Prevalence means that some research events are extremely rare, or extremely common, and the way in which experimental sampling reflects actual prevalence may have a huge impact on the results of any experiment.
Small sample size can increase the prevalence of dominance by outliers or rare events, making them seem to be more common than they really are.
Small sample sizes may exclude rare - but likely - events entirely leading to a false sense of security.
High prevalence of targets in a sample relative to actual prevalence in a population risks over-confidence in test accuracy.
There’s no reason to over-simplify or dumb-down research findings when talking to those outside your research field. However, what is required is a greater awareness of a gap between what you know and what they know. Some ramp or path into your findings is required.
The simplest way to do this is to use the same charts as you would normally, but change the structure and emphasis. That does not mean always drawing lots of pictures of little silhouetted people, but a little extra effort can convert a scatter-plot of probabilities into something more relateable.
While you can certainly hope for a good-faith response to what you present, you should not assume that any lack of comprehension is due to stupidity. The responsibility lies with a researcher to appropriately and clearly present their findings in a way that resonates with others. But motivated bad-faith exists too.
3.5 Group tutorial
Exercise
It is 4pm on a Thursday afternoon, the day before a long-weekend. NHS NICE has asked you to brief an aid to the health minister on his way out the building at 18h30 before he leaves on his vacation to Cornwll. You have just received one of the following two research papers to review and summarise. You have two hours to consider the value of the paper and will then have 15 minutes to present to the adviser. You may prepare one slide or visual aid which may, or may not, be used in the meeting.
Tong MJ, Blatt LM, Kao VW. Surveillance for hepatocellular carcinoma in patients with chronic viral hepatitis in the United States of America. J Gastroenterol Hepatol. 2001 May; 16(5):553-9. doi: 10.1046/j.1440-1746.2001.02470.x. PMID: 11350553
Ogawa K, Oida A, Sugimura H, Kaneko N, Nogi N, Hasumi M, Numao T, Nagao I, Mori S. Clinical significance of blood brain natriuretic peptide level measurement in the detection of heart disease in untreated outpatients: comparison of electrocardiography, chest radiography and echocardiography. Circ J. 2002 Feb;66(2):122-6. doi: 10.1253/circj.66.122. PMID: 11999635.
What do you want this person to know about the paper results? What do you want them to do about it?
“The Use of ‘Overall Accuracy’ to Evaluate the Validity of Screening or Diagnostic Tests” (Alberg et al. 2004) also has links to other papers you can review. As your skills develop, you will gain confidence in ever-deeper review but, for now, focus on prevalence, specificity and sensitivity.
You are also welcome to conduct your own review. There are a host of publication repositories which support redistribution of published research. Here are a few and you can explore them to run your own reviews:
Europe PMC Europe PubMed Central is an open-access repository of biomedical research works
Alberg, Anthony J, Ji Wan Park, Brant W Hager, Malcolm V Brock, and Marie Diener-West. 2004. “The Use of ‘OverallAccuracy’ to Evaluate the Validity of Screening or DiagnosticTests.”Journal of General Internal Medicine 19 (5 Pt 1): 460–65. https://doi.org/10.1111/j.1525-1497.2004.30091.x.
Anwar, Nadia, and Ela Hunt. 2009. “Francisella Tularensis Novicida Proteomic and Transcriptomic Data Integration and Annotation Based on Semantic Web Technologies.”BMC Bioinformatics 10 (10): S3. https://doi.org/10.1186/1471-2105-10-S10-S3.
———. 2024. “Auditable and Reusable Crosswalks for Fast, Scaled Integration of Scattered Tabular Data.” Text/{HTML}. https://arxiv.org/abs/2409.01517.
Dietz, David M, Christopher D Barr, and Mine Çetinkaya-Rundel. 2015. OpenIntroStatistics. Third. OpenIntro.org. https://www.openintro.org/.
Greco, Marco, Flavio Crippa, Roberto Agresti, Ettore Seregni, Alberto Gerali, Riccardo Giovanazzi, Andrea Micheli, et al. 2001. “Axillary LymphNodeStaging in BreastCancer by 2-Fluoro-2-Deoxy-d-Glucose–PositronEmissionTomography: ClinicalEvaluation and AlternativeManagement.”JNCI: Journal of the National Cancer Institute 93 (8): 630–35. https://doi.org/10.1093/jnci/93.8.630.
Jurca, Ales, Jure Žabkar, and Sašo Džeroski. 2019. “Analysis of 1.2 Million Foot Scans from NorthAmerica, Europe and Asia.”Scientific Reports 9 (1): 19155. https://doi.org/10.1038/s41598-019-55432-z.
Lam, Diana L., Pari V. Pandharipande, Janie M. Lee, Constance D. Lehman, and Christoph I. Lee. 2014. “Imaging-BasedScreening: Understanding the Controversies.”AJR. American Journal of Roentgenology 203 (5): 952–56. https://doi.org/10.2214/AJR.14.13049.
Maxim, L. Daniel, Ron Niebo, and Mark J. Utell. 2014. “Screening Tests: A Review with Examples.”Inhalation Toxicology 26 (13): 811–28. https://doi.org/10.3109/08958378.2014.955932.
Pishro-Nik, Hossein. 2014. Introduction to Probability, Statistics, and Random Processes. Kappa Research LLC. https://www.probabilitycourse.com/.
Rafferty, Elizabeth A., Jeong Mi Park, Liane E. Philpotts, Steven P. Poplack, Jules H. Sumkin, Elkan F. Halpern, and Loren T. Niklason. 2013. “Assessing RadiologistPerformanceUsingCombinedDigitalMammography and BreastTomosynthesisCompared with DigitalMammographyAlone: Results of a Multicenter, MultireaderTrial.”Radiology 266 (1): 104–13. https://doi.org/10.1148/radiol.12120674.
Safra, Lou, Coralie Chevallier, Julie Grèzes, and Nicolas Baumard. 2020. “Tracking Historical Changes in Trustworthiness Using Machine Learning Analyses of Facial Cues in Paintings.”Nature Communications 11 (1): 4728. https://doi.org/10.1038/s41467-020-18566-7.
Vu, Julie, and David Harrington. 2020. Introductory Statistics for the Life and BiomedicalSciences. First. OpenIntro.org. https://www.openintro.org/book/biostat/.
Wu, Xiaolin, and Xi Zhang. 2016. “Automated Inference on Criminality Using FaceImages.”arXiv:1611.04135 [Cs], November. http://arxiv.org/abs/1611.04135.