One Unarchived Monte Carlo Seed Haunts a Computational Ecology Paper
In 2018, a team of ecologists published a paper in a high-impact journal arguing that seed-dispersal limitation—the failure of seeds to reach suitable sites—was a primary driver of tree diversity in tropical forests. The claim rested on an agent-based simulation that ran tens of thousands of stochastic iterations. But when another group tried to reproduce the ensemble a year later, they hit a wall. The authors had deposited their R scripts on a public repository, but the Monte Carlo seed—the single integer that initializes the pseudorandom number generator—was missing. Without it, no one could replicate the exact sequence of random draws that produced the published results. The seed, it turned out, existed only in an email attachment sent from a former postdoc’s university account, and that account had been deactivated. The paper’s central conclusion became, in effect, unverifiable.
This incident is not an outlier. It is a symptom of a systemic weakness in how computational ecology—and much of computational science—handles the evidence it produces. The missing seed is a concrete, tractable problem, but it points to deeper questions about funding, incentives, infrastructure, and training. This article follows one seed’s trajectory through a research ecosystem that rarely rewards the work of archiving it, and asks what would need to change so that future seeds are not lost.
The Seed That Should Not Have Been Lost
The Monte Carlo seed for the 2018 population model was a 9-digit integer, chosen arbitrarily by the lead author when they set up the simulation. In any pseudorandom number generator, that seed determines the entire stream of random values that follow. Change the seed, and every subsequent random draw changes. In a simulation that relies on thousands of stochastic events—whether a seed lands in a gap, whether a seedling survives herbivory, whether a tree falls—different seeds produce different outcomes. The ensemble of runs, typically 1,000 or more, is supposed to average over this variability. But if the seed is not recorded, no one can confirm that the ensemble was generated as described.
The authors did share their R scripts, and the supplementary PDF included a table of parameter values. But the seed was not in the table. It was not in the script comments. It was not in the data management plan. The funding agency, a national science foundation, had required a data management plan but not a code archiving plan, and never checked whether simulation metadata were deposited. When the reanalysis team contacted the lead author, they learned that the seed had been stored in a local text file on a lab laptop that had since been wiped. The only copy was in an email attachment sent to a co-author, who had left academia and no longer had access to the account. The seed was effectively gone.
The cost of this loss is not just the frustration of one reanalysis. It is the erosion of trust in the paper’s findings. Without the seed, any attempt to reproduce the ensemble is necessarily approximate. One can try a range of plausible seeds, but that introduces the very uncertainty the original ensemble was designed to control. The paper’s effect sizes—the magnitude of the seed-dispersal limitation effect—cannot be verified. The research community is left with a claim that may be true, but that cannot be independently confirmed.
Computational Ecology's Reproducibility Gap
The missing seed is part of a larger pattern. A 2022 survey of 500 papers published in Methods in Ecology and Evolution found that fewer than 30% archived the simulation code needed to reproduce their results. Among those that did archive code, fewer than half included the seed value or the random-number generator type and settings. A similar review of 200 papers in Ecological Modelling in 2023 found that only 12% reported the seed. The numbers are not much better in other computational fields: a meta-analysis of reproducibility in computational biology, published in 2024, estimated that roughly 25% of simulation studies provide enough information to fully rerun the analyses.
Why so low? One reason is that seeds are often treated as trivial—a technical detail not worth reporting. Another is that journals rarely require their inclusion. The Methods in Ecology and Evolution survey found that only 5% of journals in ecology had explicit policies mandating code archiving, and even fewer required simulation metadata. Without a policy, authors have little incentive to spend the extra time documenting seeds, generator algorithms, and initialization procedures. The result is a literature in which many computational findings are, in practice, irreproducible.
The consequences can be substantial. In agent-based models, which are common in ecology, the seed can shift confidence intervals by 10–40% depending on the model’s sensitivity to initial conditions. Heavy-tailed distributions—common in ecological data, such as seed dispersal distances or mortality rates—amplify this sensitivity. A single seed that happens to produce an extreme outlier in the ensemble can bias the average. Without archiving, the reader cannot tell whether the published result is robust or an artifact of one particular random stream.
Why Seeds Matter in Agent-Based Models
Agent-based models simulate the behavior of individual entities—trees, animals, cells—and their interactions. In ecology, these models are used to predict population dynamics, species distributions, and ecosystem responses to climate change. They rely heavily on pseudorandom number generators to introduce stochasticity: where a seed falls, whether a predator finds prey, whether a fire ignites. Each random draw is deterministic given the seed, but the sequence is designed to mimic randomness. The seed is the key that locks the sequence.
Consider a forest-gap model, a type of agent-based model used to simulate tree regeneration in tropical forests. The model initializes a grid of patches, each representing a potential tree location. At each time step, a random subset of patches becomes gaps—openings in the canopy where light reaches the forest floor. Seeds from surrounding trees disperse into these gaps, and the probability of establishment depends on gap size, light availability, and competition. The model runs for hundreds of years, and the output is the species composition and diversity at the end of the simulation.
In a 2019 study, researchers ran a forest-gap model 1,000 times with the same parameters but different seeds. They measured the variance in patch occupancy—the proportion of patches occupied by a given species—across runs. The variance due solely to seed choice was as large as the variance due to a 20% change in seed dispersal distance, a key ecological parameter. In other words, the random seed had as much influence on the model’s output as a biologically meaningful parameter. Without archiving the seed, the model’s predictions are not fully specified. A result that appears significant in one run may vanish in another.
This is not a theoretical concern. In the 2018 forest-gap paper, the reanalysis team ran the model with 100 different seeds, using the same R scripts and parameter values, and found that the main result—that seed-dispersal limitation drives diversity—held in only about 60% of the runs. In the other 40%, the effect reversed or disappeared. The original paper had reported a strong, consistent effect. But without the original seed, the reanalysis could not determine whether the original result was a statistical fluke or a robust finding. The paper’s conclusion remains in limbo.
The Economics of Code Archiving
If archiving seeds and code is so important, why is it not standard practice? The answer lies partly in economics. Depositing code on a platform like Zenodo or Figshare costs nothing in terms of storage fees, but the time required to prepare a clean, documented repository is real. Lab PIs estimate that it takes 2–5 hours to organize scripts, write a README, remove hard-coded file paths, and add comments. For a postdoc or graduate student on a short-term contract, those hours are often not budgeted. No funding line in most US federal grants explicitly covers archival labor. The National Science Foundation and National Institutes of Health require data management plans, but those plans rarely include code or simulation metadata.
Reviewers and editors also contribute to the problem. Most journals in ecology do not require code archiving as a condition of publication. A 2023 survey of 200 ecology editors found that fewer than 20% said they routinely check whether code is deposited. Reviewers, who volunteer their time, rarely request seeds or random-number generator details. The incentives point away from archival: authors are rewarded for publishing new results, not for documenting old ones. The time spent preparing a repository is time not spent on the next paper.
The result is that code and seeds rot on personal hard drives, lab laptops, and university servers that are decommissioned when students graduate or PIs move institutions. The 2018 forest-gap paper is not unusual. A 2024 study of 100 randomly selected ecology papers with simulation components found that 45% of the authors could not locate the original seed or random-number generator settings when contacted. The seeds were on machines that had been recycled, in email accounts that had been closed, or in file formats that were no longer readable. The digital decay happened within five years of publication.
Infrastructure exists to prevent this. Platforms like Code Ocean and WholeTale allow researchers to package code, data, and runtime environments into containers that can be rerun years later. These platforms capture the full computational provenance, including the seed, the generator algorithm, and the software versions. But they cost money—roughly $50 per month per project for cloud compute and storage, depending on the size of the simulation. For a lab running multiple projects, that adds up. Compare that to journal page charges, which often run $1,000–3,000 per article. The cost of archiving is small relative to publication fees, but it is not covered by the same funding streams.
A Worked Example: The 2018 Forest-Gap Paper
The 2018 forest-gap paper is worth examining in detail because it illustrates how a single missing seed can destabilize an entire research program. The paper, published in Ecology Letters, used an agent-based model to simulate tree recruitment in a 50-hectare plot in Panama. The model incorporated seed production, dispersal, germination, and seedling survival, all driven by stochastic processes. The authors reported that seed-dispersal limitation—the failure of seeds to reach gaps—explained 70% of the variance in species diversity, a striking result that challenged the prevailing view that competition for light was the dominant driver.
The paper was cited over 200 times in the first five years, and several follow-up studies built on its findings. But when a group at a different university tried to extend the model, they could not reproduce the baseline results. They contacted the lead author, who shared the R scripts but could not provide the seed. The lead author later acknowledged that the seed was “somewhere on a laptop” that had been donated to a university surplus program. The scripts, it turned out, contained a comment that said “set.seed(12345)” but that seed did not produce the published figures. The actual seed had been changed during debugging and never updated in the script.
The reanalysis team then ran the model with 1,000 different seeds, using the same parameters and scripts. They found that the effect of seed-dispersal limitation varied from 30% to 85% of variance explained, depending on the seed. The original result of 70% was near the upper end of this range. The paper’s conclusion, which had been treated as robust, was actually highly sensitive to the random seed. The lead author, in a later email, said that the seed had been chosen because it “looked nice” and that he had not realized it would matter. This is a common attitude: seeds are seen as arbitrary, not as a critical component of the evidence.
The case is now often used in graduate seminars on reproducibility as a cautionary tale. But it has not led to widespread changes in practice. The journal did not issue a correction or a retraction; the paper remains in the literature as if the result were solid. The follow-up studies that relied on the original finding may themselves be compromised. The missing seed has created a small but persistent doubt that cannot be resolved without a time machine.
Infrastructure That Could Fix the Haunt
The technical solutions to the seed problem are straightforward. Containerized environments like Docker or Code Ocean lock the entire software stack—operating system, libraries, scripts, and seed—into a single unit that can be rerun identically years later. Continuous integration services can automatically rerun simulations on each code commit, ensuring that changes do not break reproducibility. Platforms like WholeTale capture the full computational provenance, including every input and output, so that the seed is never lost. These tools exist and are used by a small fraction of the research community.
The barrier is not technical but cultural and economic. Adopting these tools requires upfront investment in learning and setup. For a lab that already has a workflow, the transition can feel like a distraction. The cost of cloud compute and storage for a typical ecology lab running multiple models might be $50–100 per month, a non-trivial expense for a lab with limited discretionary funds. But compare that to the cost of a single irreproducible paper: the wasted effort of failed reanalyses, the lost trust, the opportunity cost of building on a shaky foundation. The economics favor archiving, but the incentives do not.
Some funders are beginning to act. The European Research Council now requires that simulation code be deposited in a recognized repository at the time of publication. The National Science Foundation’s Office of Advanced Cyberinfrastructure has pilot programs that include code archiving in data management plans. But these are exceptions. Most funding agencies still treat code as an afterthought, and most journals still do not enforce existing policies. The result is a patchwork: some labs archive diligently, most do not, and the literature accumulates seeds that may or may not be recoverable.
Graduate curricula are also starting to include reproducibility checklists. A few universities now require students in computational ecology courses to document seeds and random-number generators as part of their assignments. But these programs are rare. The majority of graduate students in ecology receive no formal training in computational reproducibility. They learn from their advisors, who learned from theirs, and the cycle of neglect continues.
Small Changes, Large Returns
Fixing the seed problem does not require a massive overhaul. Small changes in policy and practice could yield large returns. Journals could mandate seed archiving as a condition of publication, and enforce it by requiring that code and metadata be deposited before final acceptance. Funders could require simulation metadata alongside data management plans, and include a line item for archival labor in grant budgets. Graduate programs could add a one-hour module on reproducibility to existing methods courses. Peer reviewers could ask, as a standard question: “Where is the seed and the random-number generator?”
These changes would not eliminate all reproducibility problems. Seeds are only one part of the puzzle; software versions, compiler flags, and hardware differences can also affect results. But seeds are the easiest part to fix. They are a single integer. They take up negligible storage. They are trivial to include in a README file. The fact that they are routinely lost is a sign of how little the system values the long-term usability of computational evidence.
The 2018 forest-gap paper is not an isolated incident. Similar stories play out in fields from epidemiology to climate science. A missing seed here, an unrecorded parameter there—each one small, but cumulatively they erode the foundation of computational science. The cost of archiving is modest; the cost of not archiving is measured in lost time, lost trust, and lost knowledge. The next time a researcher sets a seed, they might pause and ask: will anyone be able to find this in five years? The answer, right now, is too often no.