My main research interests are in probability and statistics, and their application to science and engineering.
January 2017 - RESOLVIt, A Stochastic Model of the DNA Forensics Pipeline.
With my collaborator, the forensic scientist Catherine Grgicak (BU), Kelsey C. Peters and Genevieve Wellner in her lab, and my ex-graduate student at MIT, Neil Gurram, we recently published a stochastic model that produces simulated electropherograms, the core measurement output in DNA forensics. The model encodes all of the standard lab procedure, capturing sampling issues, allele drop-out, PCR amplification and stutter, and inherent instrument noise in the process. Its construction was informed and parameterized by a huge number of samples created in Catherine's lab. As those samples are costly and time-consuming to create, through its use, its main boon is in being able to quickly and cheaply create simulated mixtures with known contributors with which to explore modification of lab procedures to improve inference.
November 2016 - Clonal Cellular Programmes in the Immune System.
With my long-run immunology collaborator, Phil Hodgkin and members of his lab, Julia Marchingo, Andrey Kan, Susanne Heinzel at the Walter and Eliza Hall Institute of Medical Research, and my student Giulio Prevedello we have published a paper in Nature Communication. The article introduces a new technique that allowed us to determine the division state of offspring from individual cells, which has broad applicability, in a relatively high-throughput way that is significantly less labor intensive than microscopy methods.
The experiment design is based on a multiplex of division diluting dyes. In the early 1990s, a green fluorescent dye, CFSE, was identified by two Australian scientists, Lyons and Parish, that cells can imbibe without impacting their function. When a stained cell divides, its daughters fluoresce with half the intensity of their parent, and their daughters with half again, allowing division numbers to be determined. In the past few years, a new generation of dyes in a range of colours have been created. To enable determination of the division history of individual families, we designed a system where cells could be stained with a unique combination of colours. This enables observation of the lineages of hundreds of founding cells through multiple generations, as illustrated below with two dyes, CFSE and CTV.
In the paper, we use the approach to consider the initial clonal expansion of T cells in response to challenge and addresses the question of how individual clones integrate distinct stimuli. The finding is that, contrary to what would have been our expectations until recently, that the initial cell is being programmed independently by the stimuli to execute a regimented family expansion programme. The significant population level heterogeneity that had previously described at population level comes about as a consequence of heterogeneity in the depths of regular individual trees rather than frayed family trees.
In order to address these questions and interrogate the data from a newly developed method, we had to develop some pretty new mathematics and statistics related to the concatenation of directed trees.
June 2016 - The Design of a DNA Coded Randomized Algorithm For In Situ Lineage Tracing.
Tom Weber, our combinatorist colleague Mark Dukes from University College Dublin, and three wet lab collaborators from the Walter and Eliza Hall Institute of Medical Research, Denise Miles, Stefan Glaser and Shalin Naik, and I recently published a paper in BMC Systems Biology describing the optimal design of a randomized algorithm that can be encoded in the DNA of a cell. The programme starts with what can be thought of a deck of cards as well as mechanisms for shuffling and picking. Upon activation by the delivery of a drug, a cell executes the randomized algorithm, shuffling and removing cards from the deck until they are left with either one or three cards, whereupon the algorithm terminates.
Conceptually, the end effect is that each cell shuffles and picks either one or three cards from the deck, which are then left in that cell's DNA. As the offspring of that cell then get copies of those cards, they serve as labels them as having a shared ancestor, at least for rare combinations. Starting with a deck of 13 cards, one execution of the algorithm would look as follows:
January 2016 - The Design of a DNA Coded Randomized Algorithm For Inferring Tree Depth.
How many cell division take place between birth and the construction of a brain? Are there more cell divisions between the development of skin than gut? Perhaps surprisingly, these are questions that one cannot answer at present. In a paper published in the Journal of Mathematical Biology by Tom Weber, Leïla Perié and myself, we have established a formula, which I personally think is remarkable, that may allow us to answer these questions:
The core of the mathematics is based on newly established properties of the fundamental model of growing trees, the Bellman-Harris process first studied in 1948, which assumes substantial statistical homogeneity. We do, however, also provide a derivation of a weaker form of the formula in broad generality and establish through simulations its accuracy and utility for trees that are far from samples of a Bellman-Harris process, such as the early development of C. elegans , as illustrated here:
December 2015 - Cellular Barcoding and Erythropoiesis.
All blood cells, red and white, originate from Hematopoietic Stem Cells (HSCs), which reside in bone marrow. The reason, for example, that a bone marrow transplant can be needed after extreme chemotherapy is that it sometimes the case that cancer treatments, as an unwanted side effect, destroys the blood system, and the transplant reseeds it with HSCs from the donor. Given that significance, how, exactly, the reseeded system reboots the blood system has been a significant topic of research in the medical sciences since the discovery of HSCs in the 1960s.
Before becoming a red or white blood cell, of which there are many types, a series of maturation steps takes place where offspring of HSCs progressively acquire specificities. That is, the cells increasingly commit to produce only restricted classes of cell type. When and where that commitment occurs is an important matter as, for example, it identifies where errors that lead to disease must occur and determines where intervention should take place to change outcomes.
The majority of studies determining those commitment steps have been in vitro (i.e. in test tubes), which is the primary distinction of work, "the branching point in erythro-myeloid differentiation" published in Cell. with my colleagues Leïla Perié, Lianne Kok, Rob J. de Boer and Ton N. Schumacher. Leïla designed and performed, with Lianne's aid, a series of experiments using a novel system called cellular barcoding that was developed within the Ton Schumacher's lab. at the Netherlands Cancer Institute. That methodology enables us to observe this commitment in vivo (i.e. within the body). The present work focuses on where in this chain of events the commitment to erythropoiesis, i.e. red blood cell production, takes place. Our analysis of the data from those experiments showed that the commitment is earlier than previously thought.
December 2015 - Quantifying the Computational Security of Multi-User Systems.
The security of many systems is predicated on the following logic: a user selects a string, for example a password, from a list; an inquisitor who knows the list can query each string in turn until gaining access by chancing upon the user's chosen string; the resulting system is deemed to be computationally secure so long as the list of strings is large.
Implicit in that definition is the assumption that the inquisitor knows nothing about the likely nature of the selected string. If instead one assumes that the inquisitor knows the probabilities with which strings are selected, then the number of queries required to identify a stochastically selected string, and the quantification of computational security, becomes substantially more involved. In the mid 1990s, James Massey and Erdal Arikan initiated a programme of research on the susceptibility of such systems to brute for interrogation, dubbed Guesswork, to which from we have been contributing in recent years with collaborators MIT and, more recently, Duke University.
With my my recently graduated Ph.D. student, Mark Christiansen, my colleague Muriel Médard (MIT) and her recently graduated student, Flávio du Pin Calmon (now IBM Research), our most recent contribution has just been published in IEEE Transactions on Information Theory. In it, we address a natural extension in this investigation of brute force searching: the quantification for multi-user systems. We were motivated by both classical systems, such as the brute force entry to a multi-user computer where the inquisitor need only compromise a single account, as well as modern distributed storage services where coded data is kept at distinct sites in a way where, due to coding redundancy, several, but not all, servers need to be compromised to access the content.
The results establish that from an inquisitor's point of view there is a law of diminishing returns as the number of potentially hacked accounts increases. From a designer's point of view, coming full circle to Massey's original observation that Shannon entropy has little quantitative relationship to how hard it is to guess a single string, Shannon entropy plays a fundamental role in the multi-user setting.
November 2015 - Forensic Analysis of DNAWhen cells are taken from a crime scene, a substantial amount of processing is necessary before DNA fingerprints can be determined. This processing results in both noise and artifacts being added to the measured signal. In a paper published in Forensic Science International: Genetics, my colleagues Ullrich Mönich (now TU Munich), Muriel Médard (MIT), Viveck Cadambe (now Penn. State U.), Lauren Alfonse (BU) and Catherine Grgicak (BU) and I characterize the most fundamental element of this process, baseline noise, for circumstances in which there is little initial DNA, as often occurs with crime-scene samples. Historically, this noise has been modeled as Gaussian when present, but the data generated by Lauren and Catherine does not support that hypothesis and instead suggests that a heavy-tailed distribution is more appropriate. This has ramifications for how noise should be treated in continuous models or how filters should be calibrated by crime-labs.
June 2015 - Network Coding, Constraint Satisfaction, Guesswork, and Network Infusion.Three pieces of work on quite distinct topics will be presented at international conferences this month, one of which is a prize-winner.
Ying Cui will present one in London, which has won the best paper award of the Communication Theory Symposium and an overall best paper award, at the IEEE International Conference on Communications. It is a contribution to on Network Coding, relating it to constraint satisfaction problems and was written with collaborators Muriel Médard (MIT), Edmund Yeh (Northeastern) and Doug Leith (Trinity College Dublin),
Ahmad Beirami (Duke & MIT) will present another, written with Robert Calderbank (Duke), myself and Muriel Médard (MIT) at the IEEE Symposium on Information Theory in Hong Kong. The work is an investigation of computational security subject to source constraints.The last, but certainly not least, will be presented Soheil Feizi (MIT) at the International Conference on Computational Social Science in Finland. It is based on work in progress with myself, Manolis Kellis (MIT) and Muriel Médard (MIT). It forms our contribution the question of how you identify the source, or sources, of an epidemic, an idea or a social network application. Soheil presented an earlier version of the work at RECOMB/ISCB Conference on Regulatory and Systems Genomics in San Diego in November 2014.
May 2015 - Information Theory and Cryptography.In 1949, Claude Shannon published one of his opuses, Communication Theory of Secrecy Systems, in which he introduced a framework for a mathematical theory of cryptography. At a workshop of the 8th International Conference on Information-Theoretic Security (ICITS), Flávio du Pin Calmon presented a piece entitled Revisiting the Shannon Theory approach to cryptography written with Mayank Varia, Muriel Médard, Mark M. Christiansen, myself, Linda Zeger and Joao Barros, revisiting the topic.
November 2014 - An Algebra for Immune ResponsesLymphocytes, the key players in an adaptive immune response, have long since been known to need to receive multiple signals to mount an effective defense. That discovery led to the two signal theory of T cell activation. The principle being that two independent signals, antigen followed by costimulation, ensure that lymphocyte expansion is only initiated in response to genuine infection. Redundancy in this second signal has long since left a conundrum in the field.
A paper published today in the journal Science brings this theory up to date with a quantitative edge. The work was led by the Walter and Eliza Hall Institute of Medical Research's Philip D. Hodgkin and Susanne Heinzel, driven by Phil's Ph.D. student Julia M. Marchingo, in collaboration with Hodgkin lab. members Andrey Kan, Robyn M. Sutherland, Cameron J. Wellard, Mark R. Dowling, two other WEHI lab. heads Gabrielle T. Belz and Andrew M. Lew, and myself. While the signaling of antigen, costimulation and cytokines are complex and involved, clever experimentation in concert with computer modelling and mathematics revealed a simple additive algebra of T cell expansion; one that puts the old conundrum to bed and will, hopefully, provide a quantitative paradigm for therapeutically manipulating immune response strength.
September 2014 - Randomness, Determinism and Immune ResponsesOne of the difficult challenges in science is doing justice while framing other people's hypotheses. Steven L. Reiner, one of the main instigators behind investigating the role of asymmetric cell division in immunology, and William C. Adams have neatly laid out a deterministic description of adaptive immune responses in a fascinating opinion piece published in Nature Reviews Immunology. My Australian colleagues Philip D. Hodgkin, one of the main proponents of stochastic processes in immunology, Mark R. Dowling and I have written a brief response in defense of randomness in correspondence to the same journal. The community has not yet acquired the data required to answer how deterministic and stochastic processes interleave to build the complete immune response, but with so many different groups designing experiments and doing analysis based on distinct hypotheses, we're all looking forward to the time when that resolution occurs.
July 2014 - Forensic Analysis of DNAThe basis of a DNA fingerprint is the measurement of the number of repeats of microsatellites, short repeated sequenes of of DNA, at various locations in the genome. While the combination of given numbers of repeats are (largely) unique for individuals, in forensics applications often the number of repeats cannot be measured precisely. This occurs as the amount of DNA at a crime scene can be small, requiring amplification before the measurement takes place, which can create false signals or missing data, or the sample itself may be made up of a composite from several individuals in which case one gets a combined, mixed measurement.
My colleagues Catherine Grgicak, a Forensic Scientist at Boston University, Desmond Lun, a Computer Scientist at Rutgers, and Muriel Médard, an Electronic Engineer at MIT, have been working in an interdisciplinary team, which I have recently joined, to carefully investigate this topic. The first piece of work I have been involved in is a paper to be presented at Asilomar in November 2014 entitled A signal model for forensic DNA mixtures by members of Catherine and Muriel's labs: Ullrich J. Mönich, Catherine Grgicak, Viveck Cadambe, Jason Yonglin Wu, Genevieve Wellner, Ken Duffy and Muriel Médard. The paper details fundamental statistics of DNA fingerprints created by standard forensics techniques in controlled circumstances in an attempt to build a descriptive model of noise generated in the process.
February 2014 - Cellular Barcoding and Inferring Lineage PathwaysThere are many instances where one would wish to know the familial relationships of cells, to be able identify those that came from the same parent. For example, the purpose of a bone-marrow transplant is to place new hematopoietic stem cells into to the recipient so that their blood system can be rebuilt. Does each stem cell contribute equally to this rebuilding? Does chance play a role? Are some stem cells specialized? These are all questions that can only be tackled if one can identify cells from the same progenitor.
Cellular barcoding, which Ton N. Schumacher's lab. has been at the forefront of developing, is an experimental method in which small pieces of non-functional DNA into otherwise identical cells. Each small piece, a barcode, can then be found in the progeny of that cell and so those in a common family can be identified. This technique has led to many high-profile results in the past year in everything from immunology, cancer and blood system development, including several from Ton's lab.
In a data analysis contribution to the Cellular Barcoding technique, my colleagues Leïla Perié, Philip D. Hodgkin, Shalin H. Naik, Ton N. Schumacher, Rob J. de Boer, and I have published a paper in Cell Reports that develops a mathematical framework for the analysis of data that comes from cellular barcoding experiments. The framework is designed to draw inferences about the compatibility of potential lineage pathways with the data. As an exemplar, we study data from the blood system taken from one of these high-profile results, published in Nature by S. H. Naik, L. Perié, E. Swart, C. Gerlach, N. van Rooij, R. J. de Boer and Ton N. Schumacher. The analysis suggests the classical model of hematopoiesis is not consistent with the data and, inspired by in vitro deductions from the 80s, we propose an alternate lineage pathway that is consistent.
January 2014 - Integrated Random Walks and Large Busy PeriodsEver since an elegant paper by Venkat Anantharam in 1989, it's been known that the way a random walk becomes large, or how a big queue builds up, is by a piecewise linear path. The question of how a large integrated random walk, or how a large volume of work done during a busy period of a queue, occurs was first tackled by A. A. Borovkov , O. J. Boxma and Z. Palmowski in a paper in 2003.
In a paper to appear in Stochastic Systems, Sean Meyn and I establish that the most likely path to a large busy period is concave in general and, remarkably, has a simple form, following a rescaled, upside-down version of the scaled cumulant generating function. Moreover, the path starts and ends in the same fashion as the most likely path to a long queue as identified by Ayalvadi J. Ganesh and Neil O'Connell in 2002.
For 10 billion simulated paths with i.i.d. Gaussian increments, below is a graph of the random walk that hits the highest height and the prediction from Ganesh and O'Connell conditioned on the height. Also plotted is the largest integrated random walk and our prediction conditioned on the area under the curve. The predicted paths start and end with the same slope, but diverge in between.
November 2013 - Guesswork, Wiretap and Erasure ChannelsDevices such as door-card readers transmit passwords over wireless channels. Unintended receivers, eavesdroppers, are typically distant and observe the communication over a noisy channel that distorts it. How hard is it for the eavesdropper to fill in the erased gaps and how does it depend on the properties of the noisy channel?
Complementing earlier work with a distinct interpretation of guesswork and wiretap channels by Neri Merhav and Erdal Arikan, as well as Manjesh Kumar Hanawal and Rajesh Sundaresan, this is a subject that with our collaborators, Flávio du Pin Calmon and Muriel Médard at MIT, Mark Christiansen and I had a paper on at this year's Asilomar Conference on Signals, Systems & Computers. The main observation is that the average noise on the channel is not the determining factor in how difficult the task is for the eavesdropper, but instead another average of the noise, a moment, that is determined again by Rényi entropy.
October 2013 - Limits of Statistical InferenceA paper, which was driven by our collaborators, Flávio du Pin Calmon at MIT and Mayank Varia, at the Lincoln Lab., was presented by Muriel Médard at this year's Allerton conference Communication, Control, and Computing. The work establishes bounds on how much can be inferred about a function of a random variable's value, if only a noisy observation of the variable is available.
September 2013 - Constraint Satisfaction and Rate ControlIEEE/ACM Transactions on Networking papers are like buses in that you wait for a long time for one and then two come at once. The work on Decentralized Constraint Satisfaction that is mentioned below has appeared in the August issue of the journal. The final paper from my ex-student Kaidi Huang's doctoral thesis also appears in that issue. The work, performed in collaboration with my Hamilton Institute colleague David Malone, addresses a practical problem in wireless local area networks: rate control. When transmitting packets, WiFi cards must select a physical-layer rate at which to transmit. In principle, the faster the rate, the less robust the transmission is to noise. For the poorly engineered rates of standard WiFi, this is not strictly true as Kaidi, David and I demonstrated in a paper published in IEEE Transactions on Wireless Communications in 2011. In the present work, we proposed a rate adaption scheme based on opportunity cost and Bayesian decision making that was, demonstrably thanks to hard work of Kaidi and David, implementable on standard hardware and outperforms the standard algorithms.
July 2013 - Information Theory, Guesswork & Computational SecurityThe late Kai Lai Chung purportedly said that there's only one theorem in Information Theory: the Asymptotic Equipartition Property. At the 2013 IEEE International Symposium on Information Theory, Mark Christiansen presented a preliminary version of a surprising result that we established in collaboration with our friends from M.I.T., Flávio du Pin Calmon and Muriel Médard. Namely, despite the fact that everything within a Typical Set has, by construction, approximately equal probability, it is exponentially easier as a function of word length to guess a word chosen from a Typical Set than a naive application of the AEP would have you deduce. This has ramifications for physical layer security.
January 2013 - Guesswork & Information TheoryHow hard is it to guess someone's password? More importantly, how do you measure how hard it is to guess someone's password? It's a question that has recieved less attention than one would have expected. The recently deceased information theorist James Massey published a fascinating one page paper at ISIT on this question in 1994 and it is the topic of a paper that my student, Mark Christiansen, and I have had published in IEEE Transactions on Information Theory. Building on beautiful work of others, which began with Erdal Arikan and included a contribution from my Hamilton Institute colleague David Malone, that identified Rényi entropy as a key player in the quantification of guesswork, in the article we provide a direct estimate on the likelihood that it takes a given number of guesses to guess a randomly selected object.
January 2013 - Networking & Medium Access ControlThere are two fundamental paradigms for sharing a resource such as wireless medium. One can empower a centralized controller who gathers everyone's requirements and sets out a schedule of who gets access to the resource when. This system is used, for example, for speakers at every conference and in cell phone networks. The alternate system is to not be prescriptive, but to listen before speaking and when the medium becomes silent, randomly interject. This is used, for example, in human conversation as well forming the basis for the standard access method, IEEE 802.11, for WiFi networks.
Both of these systems have advantages and disadvantages. If one knows exactly what who needs what resources and can agree on a central party to adjudicate, the former is most efficient. If one is uncertain about the number of users and their requirements, the latter is more robust to this uncertainty, but suffers from collisions - people talking over one another and thus wasting resources.
In a paper that has been published in the journal Wireless Networks, written with my ex-student Minyu Fang, and colleagues at the Hamilton Institute David Malone and Douglas J. Leith, we investigate a system that has the best of both worlds: a decentralized stochastic algorithm is used to obtain a collision-free schedule.
October 2012 - Constraint Satisfaction ProblemsIs it possible to find a global solution to a problem, with everyone only having local information and being unable to see
the global picture? That's the topic of a paper that my colleagues, the French mathematician Charles Bordenave
and Douglas J. Leith, my Scottish engineering colleague from the Hamilton Institute, address in a paper that has
been accepted for publication in IEEE/ACM Transactions on Networking. Two undergraduates, TCD's John Roche
and National University of Ireland Maynooth's Tony Poole, created java applets that illustrate how the proposed approach, which provably works, solves problems. They also illustrate that, in this setting, it is easier to agree to disagree than it is to find a consensus.
September 2012 - Immunology & Cell Biology
My Australian immunology collaborator Philip Hodgkin and I had a paper published in Trends in Cell Biology. It expands on the hypothesis we employed in our earlier 2012 paper, published in Science, to explain cell fate diversity. Artwork based on the paper, illustrated by WEHI's Jie H. S. Zhou, was used for the edition's cover:
January 2012 - Immunology & Cellular BiologyWith colleagues from the Walter and Eliza Hall Institute of Medical Research led by Philip D. Hodgkin, I had a paper published in Science. The piece analyses data from an experiment that took my Australian colleagues, particularly Mark Dowling and John Markham, four years to perfect. It enabled us to directly observe the times at which an important class of cells in the immune system, the antibody secreting B lymphocytes, make decisions on when and how to combat a pathogen. Despite the data looking immensely complex, it is consistent with a remarkably simple, holistic hypothesis.
The group on sabbatical at MIT in 2016, where we generously hosted by
Muriel Médard, (L-R: Gianfelice, Harry, Alex, Giulio, Neil, myself and Tom):