Food Modelling Journal :
FSKX (Food Safety Knowledge)
|
Corresponding author: Esther M. Sundermann (esther-maria.sundermann@bfr.bund.de), Guido Correia Carreira (guido.correia-carreira@bfr.bund.de), Annemarie Käsbohrer (annemarie.kaesbohrer@bfr.bund.de)
Academic editor: Matthias Filter
Received: 11 Jun 2021 | Accepted: 31 Aug 2021 | Published: 03 Nov 2021
© 2021 Esther M. Sundermann, Guido Correia Carreira, Annemarie Käsbohrer
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Sundermann EM, Correia Carreira G, Käsbohrer A (2021) An FSKX compliant source attribution model for salmonellosis and a look at its major hidden pitfalls. Food Modelling Journal 2: e70008. https://doi.org/10.3897/fmj.2.70008
|
|
To reduce the burden of human society that is caused by zoonotic diseases, it is important to attribute sources to human illnesses. One powerful approach in supporting any intervention decision is mathematical modelling. This paper presents a source attribution model which considers five sources (broilers, laying hens, pigs, turkeys) for salmonellosis and uses two datasets from Germany collected over two time periods; one from 2004 to 2007 and one from 2010 to 2011. The model uses a Bayesian modelling approach derived from the so-called Hald model and is based on microbial subtyping. In this case, Salmonella isolates from humans and animals were subtyped with respect to serovar and phage type. Based on that typing, the model estimates how many human salmonellosis cases can be attributed to each of the considered sources. A reference description of the model is available under DOI: 10.1111/zph.12645. Here, we present this model as a ready-to-use resource in the Food Safety Knowledge Exchange (FSKX) format. This open information exchange format allows to re-use, modify, and further develop the model and uses model metadata and controlled vocabulary to harmonise the annotation. In addition to the model, we discuss some technical pitfalls that might occur when running this Bayesian model based on Markov chain Monte Carlo calculations. As source attribution of zoonotic disease is one useful tool for the One Health approach, our work facilitates the exchange, adjustment, and re-usage of this source attribution model by the international and multi-sectoral community.
Salmonella, R programming language, mathematical modelling, Bayesian model, Markov chain Monte Carlo method, Food Safety Knowledge Exchange (FSKX) format
Zoonotic diseases are a major burden for human society. The burden relates to two categories: 1) human health burden in form of mortality and morbidity (
Salmonella is the second most common zoonotic disease in Europe with a stable number of salmonellosis cases during 2014–2018 (
To reduce the human cases of zoonoses, it is important to understand the relationship of potential sources and human illness (
One modelling approach for source attribution that is based on microbial subtyping is the Bayesian model. In the context of food safety, the models developed by
The two datasets and the mathematical model by
The model metadata are part of the FSKX-file (see Suppl. material
Source: PUBLISHED SCIENTIFIC STUDIES
Identifier: SourceAttributionBfRBayesDB
Rights: Creative Commons Attribution 4.0 (CC BY 4.0)
Availability: Open access
Language: English
Software: FSK-Lab 1.9.0
Language Written In: R 3
Objective: The model attributes human cases of the salmonellosis caused by various serovars of Salmonella from various sources (namely, broilers, laying hens, pigs, turkeys, and unknown). The model is parameterized using data from Germany.
Name: Broilers
Description: Tons of broiler meat consumed
Unit: Tons
Origin Country: Germany
Name: Laying hens
Description: Number of eggs consumed
Unit: Number of eggs
Origin Country: Germany
Name: Pigs
Description: Tons of pork consumed
Unit: Tons
Origin Country: Germany
Name: Turkeys
Description: Tons of turkey meat consumed
Unit: Tons
Origin Country: Germany
Name: People in Germany
Target Population: People in Germany that were identified by medical professionals to be suffering from salmonellosis
The model attributes human cases of the zoonotic disease salmonellosis to a certain source (namely, broilers, laying hens, pigs, turkeys, and unknown). It is based on a Bayesian microbial subtyping approach described by
Temporal Information: In
Study Title: The role of parameterization in comparing source attribution models based on microbial subtyping for salmonellosis
Study Description: Two datasets from active monitoring in Germany were available. The data comprise four potential animal sources: broilers, laying hens, pigs, and turkeys. For each considered salmonellosis case, the serotype was determined. For cases caused by Salmonella Enteritidis or Salmonella Typhimurium additionally the phage type was determined (see Tables 2 and 3 in
Datasets covering studies on Salmonella in different sources for two time periods were compiled and used for this analysis. For both time periods, reliable data from active monitoring on four potential animal sources were available: broilers, laying hens, pigs, and turkeys. Cattle were not included in any study or program and were therefore not included in this analysis. The datasets, which cover the years 2004–2007 and 2010/2011, are called baseline data and monitoring data, respectively. In addition, data on human salmonellosis cases were considered.
The first dataset on Salmonella in sources was generated by four baseline studies conducted during 2004 and 2007 in Germany (
The second dataset on Salmonella in sources was compiled from monitoring programs during 2010 and 2011 in Germany (
To summarize, the baseline and the monitoring data are comparable, i.e., the data were compiled in a similar way and the intention measures in the years were the same, thus, no significant difference in the data is expected.
Data on human Salmonella cases came from the Robert Koch Institute (RKI). The serotype distribution was obtained via their online database SurvStat@RKI (https://survstat.rki.de/, data access: 07.02.2012). In addition, phage type information for S. Enteritidis and S. Typhimurium strains were provided via personal communication by Wolfgang Rabsch (RKI). Since only a subset of all strains isolated from humans had been phage typed, we assumed that the phage type distribution among the typed strains is also representative for the untyped strains. To account for the four year time period of the baseline studies (from 2004 to 2007), we summed up all the corresponding sero- and phage types associated with human salmonellosis cases over that time period (see
The presented Bayes data-based (DB) model is a source attribution model that is based on microbial subtyping (
A note about terminology: the terms "subtype" and "type" are used interchangeably.
The so-called Hald model (
\(o_i \sim Poisson \left (\sum\limits_{j=1}^{J} \lambda_{ij} \right) \hspace{2cm} (1)\)
where \(o_{i}\) is the number of observed cases for Salmonella of subtype \(i\). The number of subtypes run from \(i=1, 2, ...,I\), where \(I\) is the total number of Salmonella subtypes present in the data. The number of sources in the data and thus considered in the model is \(J\). Here, \(\lambda_{ij}\) is the number of expected cases caused by Salmonella subtype \(i\) in source \(j\) (with \(j\) running from \(j=1,2,...,J\)). The Hald model defines \(\lambda_{ij}\) as follows:
\(\lambda_{ij} = M_{j} \hspace{0.5mm}\cdot p_{ij} \hspace{0.5mm}\cdot q_{i} \hspace{0.5mm}\cdot a_{j} \hspace{2cm} (2)\)
where \(M_j\) is the amount of source \(j\) consumed (in tons, except for laying hens where it is number of eggs). The values \(p_{ij}\) (in %) for the prevalence \(p_{ij}\) of Salmonella subtype \(i\) in the source \(j\). The parameter \(q_i\) is a subtype-dependent factor which describes the ability of the Salmonella subtype \(i\) to cause illness. The parameter \(a_j\) is a source-dependent factor describing the ability of source \(j\) to serve as a vector for Salmonella. Equation 2 represents the multiparameter prior of the model with the two parameters \(a_j\) and \(q_i\) of unknown value. For the parameter \(a_j\) and \(q_i\), uniform distributions where defined as prior distributions.
Some authors describe difficulties with the convergence of the Hald model (
\(q_{ut,j} = \frac{o_{ut}}{\sum\limits_{i}o_i}\cdot\frac{1}{p_{ut,j}} \hspace{2cm} (3)\)
This reparameterization can only be done if all serotypes are phage typed. As not all the data of serotypes Enteritidis and Typhimurium considered by
Following the idea of
1. Parameterization of the subtype-dependent parameter \(q_{i}\)
2. Parameterization of the source-dependent parameter \(a_j\)
\(a_{nut}=\frac{\sum\limits_{i}o_i}{M_{nut}} \hspace{2cm} (4)\)
3. Parameterization of the consumption data \(M_j\)
To estimate unknown parameters, uniform distributions are assumed as prior distributions for \(a_j\) and \(q_i\) (see Expressions 5 and 6, respectively). Unknown parameters are: 1) all \(q_i\) which belong to non-unique types, 2) unique \(q_i\) which have not been chosen according to the first step of the parameterization setup, 3)all \(a_j\) which correspond to sources \(j\) which have at least one unique type. Note that \(M_j\) is always set to a fixed value. Consequently, if there are no unique types, all \(a_j\) are parameterized according to Equation 4 and all \(q_i\) according to Expression 6.
In the model presented in this paper the following prior distributions were assumed:
\(a_{j} \sim uniform(0,0.2) \hspace{1cm} (j=1,2,..,J) \hspace{2cm}(5)\)
\(q_{i} \sim uniform(0,1) \hspace{1cm} (i=1,2,...I). \hspace{2.5cm} (6)\)
The the limits of the prior distributions were chosen such that they produce complete posterior distributions for both datasets (baseline and monitoring data). Depending on the data, one might have to adjust the limits of the distribution (see Section "The effect of prior distributions on completeness of posterior distributions" for details).
In the next section, we describe how to parameterize the model and run model simulations using FSKX format.
All model parameters and their descriptions are presented in Table
Description of the model parameters of the source attribution model. In the row that specifies the source, article always refers to the reference description of
Id | list_sources |
Classification | INPUT |
Name | list_sources |
Description | List all possible sources |
Unit | [] |
Data Type | INTEGER |
Source | Article |
Value | c('Broilers', 'Laying hens', 'Pigs', 'Turkeys') |
Id | qfix_ind |
Classification | INPUT |
Name | qfix_ind |
Description | Indices of subtype‐dependent factor for subtype i (qi), which will be set to fixed values. These are the four values for the human cases concerning the "unique types": S.Virchow, S.E. PT 1, S.T. DT 193, and S. Saintpaul |
Unit | [] |
Data Type | VECTOROFNUMBERS |
Source | Data |
Value | c(63,64,65,66) |
Min Value | 1 |
Max Value | Number of considered subtypes |
Id | input_FileName |
Classification | INPUT |
Name | input_FileName |
Description | Name of the file that contains the analysed data |
Unit | [] |
Data Type | STRING |
Source | Article |
Value | "Table2.csv" |
Id | OpenBUGS_parameter |
Classification | INPUT |
Name | OpenBUGS_parameter |
Description | The values that should be logged while running the OpenBUGS-model |
Unit | [] |
Data Type | STRING |
Source | Article |
Value | c("source", "unknown", "a", "q", "lambdaexp") |
Id | OpenBUGS_niter |
Classification | INPUT |
Name | OpenBUGS_niter |
Description | Number of total iterations per chain used in the OpenBUGS-model |
Unit | [] |
Data Type | INTEGER |
Source | Article |
Value | 30000 |
Min Value | OpenBUGS_nburnin+1 |
Id | OpenBUGS_nburnin |
Classification | INPUT |
Name | OpenBUGS_nburnin |
Description | Length of burn in, i.e. number of iterations to discard at the beginning. |
Unit | [] |
Data Type | INTEGER |
Source | Article |
Value | 10000 |
Min Value | 1 |
Id | aValue |
Classification | INPUT |
Name | aValue |
Description | Values for the source-dependent factors (ai) that are used to determine inital values for the OpenBUGS model |
Unit | [] |
Data Type | VECTOROFNUMBERS |
Source | Data |
Value | c(0.002,0.001,0.19, 0.18, 0.178) |
Min Value | 0 |
Id | qValue |
Classification | INPUT |
Name | qValue |
Description |
Values for the subtype-dependent factors (qi) that are used to determine inital values for the OpenBUGS model |
Unit | [] |
Data Type | VECTOROFNUMBERS |
Source | Data |
Value | c(0.001,0.002, 0.199, 0.18, 0.175) |
Min Value | 0 |
Id | OpenBUGS_model |
Classification | INPUT |
Name | OpenBUGS_model |
Description | The filename of the txt-file that contains the OpenBUGS-model |
Unit | [] |
Data Type |
STRING |
Source | The filename is freely chosen. The BUGS-model is descrided in the reference article. |
Value |
"BugsModel.txt" |
Id | mean_res |
Classification | OUTPUT |
Name | mean_res |
Description | Mean number of estimated human salmonellosiscases attribute to potential sources |
Unit | Cases |
Data Type | VECTOROFNUMBERS |
Min Value | 0 |
Max Value | 1 |
Id | quantil_95 |
Classification | OUTPUT |
Name | quantil_95 |
Description | 95%-quantile of estimated human salmonellosiscases attributed to the potential sources |
Unit | Cases |
Data Type | VECTOROFNUMBERS |
Min Value | 0 |
Max Value | 1 |
Id | quantil_05 |
Classification | OUTPUT |
Name | quantil_05 |
Description | 5%-quantile of estimated human salmonellosiscases attributed to the potential sources |
Unit | Cases |
Data Type | VECTOROFNUMBERS |
Min Value | 0 |
Max Value | 1 |
The simulation settings for the source attribution model. The settings specify the parameter names and the values (see Table
defaultSimulation | |
list_sources | c('Broilers', 'Laying hens', 'Pigs', 'Turkeys') |
qfix_ind | c(63,64,65,66) |
input_FileName | "Table2.csv" |
OpenBUGS_parameter | c("source", "unknown", "a", "q", "lambdaexp") |
OpenBUGS_niter | 30000 |
OpenBUGS_nburnin | 10000 |
aValue | c(0.002,0.001,0.19, 0.18, 0.178) |
qValue | c(0.001,0.002, 0.199, 0.18, 0.175) |
OpenBUGS_model | "BugsModel.txt" |
SimulationTable3 | |
list_sources | c('Broilers', 'Laying hens', 'Pigs', 'Turkeys') |
qfix_ind | c(30,31,32,33) |
input_FileName | "Table3.csv" |
OpenBUGS_parameter | c("source", "unknown", "a", "q", "lambdaexp") |
OpenBUGS_niter | 30000 |
OpenBUGS_nburnin | 10000 |
aValue | c(0.01, 0.015, 0.099, 0.08,0.02) |
qValue | c(0.001, 0.002, 0.9,0.85, 0.99) |
OpenBUGS_model | "BugsModel.txt" |
The Bayes DB model is implemented in the programming language R (
The fskx-model can be executed, developed further, and easily adapted to new data on the local computer, e.g., using the KNIME extension FSK-Lab (see https://foodrisklabs.bfr.bund.de/fsk-lab/ and
In order to execute the model, please register at the virtual research environment "FMJ_Lab".
Execute with default simulation parameters: execute
The default simulation runs for 2 minutes 11 seconds on the virtuel research environment.
Execute another simulation scenario or create a personalized scenario: execute
The main result is that the existing source attribution model previously published in
To be able to successfully use the model, it is important to know how to set up and run the model as well as assess the appropriateness of the results. We present these practical issues since this is a purely technical paper it seems appropriate to provide this level of technicality here.
When running our Bayesian model using Markov Chain Monte Carlo (MCMC) methods, we studied three important aspects of model diagnostics. To ensure a high quality estimation of unknown parameters, we check the following aspects of a MCMC method: the convergence behaviour of the Markov chains, the completeness of posterior distributions, and the consistency of results.
The limits for the uniform distribution have a strong influence on the completeness of the posterior distributions. The limits are incorporated into the OpenBUGS code of the model (see file "BugsModel.txt" in the fskx-model). In the Bayes DB model, the lower limit is 0 and the upper limits are 0.2 for \(a_j\) and 1 for \(q_i\) for both datasets (see Section "Bayes data-based (DB) model—a variation of the David model" and Expressions 5 and 6). The upper limits were chosen such that the model provides complete posterior distributions. This was assured by examining visually the plots of the posterior distributions of \(a_j\) and \(q_i\) (
The posterior distributions for the fifth entry in the list of Salmonella subtypes (q5), which is S. enterica serotype Enteritidis PT 21, as a function of the possible values of q5. The shown posterior distributions are calculated by the Gibbs-Sampler software OpenBUGS using the Bayes DB model presented in
The choice of the starting values of the Markov chains (also known as initial values) has an impact on the convergence and uncertainty estimates of the model calculation. The model runs with five Markov chains. The default starting values for the five chains are listed in Table
Starting points for the Markov chains of Parameter scenario 1, their effects on the convergence behaviour and the model predictions. The starting points are evenly spaced in the lower fifth of the space of possible starting points (see the points in the scatter plot in the upper right corner). With this set of starting points, Markov chains converge quickly as can be seen in the four trace plots on the left hand side which show how the paramter values that the model estimates change through the iteration steps of the model calculations. Each of the four trace plots correspond to one model parameter (\(a_1\), \(a_2\), \(q_2\) and \(q_3\), where types 1, 2 and 3 correspond to S. enterica serotype Enteritidis PT 11, PT 14b, and PT 19, respectively). In each trace plot there are five traces, one trace for each Markov chain. Each Markov chain has its own colour. The predicted source attribution shows small error bars (see the bar plot).
Starting points for the Markov chains of Parameter scenario 2 and their effects on the convergence behaviour and the model predictions. The starting points are concentrated near the points (0, 0) and (0.18, 0.18) (see the points in the scatter plot in the upper right corner). With this set of starting points, Markov chains converge slowly as can be seen in the four trace plots on the left hand side which show how the paramter values that the model estimates change through the iteration steps of the model calculations. Each of the four trace plots correspond to one model parameter (\(a_1\), \(a_2\), \(q_2\) and \(q_3\), where types 1, 2 and 3 correspond to S. enterica serotype Enteritidis PT 11, PT 14b, and PT 19, respectively). In each trace plot there are five traces, one trace for each Markov chain. Each Markov chain has its own colour. The predicted source attribution shows small error bars (see the bar plot).
Starting points for the Markov chains of Parameter scenario 3 and their effect on the convergence behaviour and the model predictions. The starting points are concentrated near the points (0, 0.18) and (0.18, 0.18) (see the points in the scatter plot in the upper right corner). With this set of starting points the Markov chains do not converge within 30,000 iterations for the parameters \(a_2\) or \(q_3\) as can be seen in the four trace plots on the left hand side which show how the paramter values that the model estimates change through the iteration steps of the model calculations. Each of the four trace plots correspond to one model parameter (\(a_2\), \(a_3\), \(q_2\) and \(q_3\), where types 2 and 3 correspond to S. enterica serotype Enteritidis PT 14b and PT 19, respectively). In each trace plot there are five traces, one trace for each Markov chain. Each Markov chain has its own colour). The error bars of the predicted source attribution are large (see the bar plot).
In Parameter sScenario 1, the starting points are evenly spaced in the lower fifth of the plane of possible starting values (see Fig.
In Parameter scenario 2, the starting points are concentrated near two points: one point is (0, 0) the other (0.18, 0.18) (see Fig.
Finally, the starting points cluster near two points: one point is (0, 0.18) the other (0.18, 0.18) (see Fig.
Some authors pointed out that the parameter for consumption data, \(M_j\), are not essential for the approach (
Simplifying the Bayes DB model for the baseline data by setting all \(M_j\) to 1 and keeping the prior distributions as they were defined in Expression 5 and Expression 6 caused problems. OpenBUGS was not able to successfully execute the model, due to numerical problems (OpenBUGS reports an "conjugate gamma updater error" for one of the \(q_i\)). This problem disappeared when the prior distributions for \(q_i\) were changed to \(q_i \sim uniform=0,2600000)\) but the model results remained inconsistent. Changing the prior distribution for \(a_j\) to \(a_j \sim uniform(0,30000)\) led to consistent results.
For the monitoring data, setting all \(M_j\) to 1 and using prior distributions as defined in Expression 5 and 6 led to inconsistent results. The model worked properly when priors distributions were set to \(a_j \sim uniform(0, 30000)\) while the priors for \(q_i\) remained the same as in Expression 6 (cf. Fig.
Model-data fit when setting \(M_j\)=1 and using different parameterizations. Each point in the figure corresponds to one bacterial subtype. Subfigure A shows a consistent model fit due to using the prior \(a_j \sim uniform(0,30000)\). I.e. the logarithm of the number of cases as found in the data corresponds well to the logarithm of predicted number of cases. Subfigure B shows an inconsistent model fit due to using the prior \(a\sim uniform(0,20)\). Here, the model systematically underestimates the number of cases for the subtypes as the points gather well below the identity line.
One way to interpret the need for enlarging the priors for \(a_j\) and \(q_i\) is that parameters \(a_j\) and \(q_i\) must compensate for the restrictions applied to \(M_j\). One may consider \(a_j\) and \(q_i\) as complex priors distributions that combine estimates of the potential of the Salmonella of type \(i\) and the source \(j\) to cause salmonellosis. In summary, if \(M_j\) is simplified, the prior distributions may need to be adjusted.
Source attribution methods aim to identify and quantify the contribution of different sources to disease burdens like salmonellosis (
Bar plot of number of human cases of Salmonella infection attributed to different sources. Subfigure A shows the result for the baseline data 2004–2007 (the so-called defaultSimulation in Table
The presented results allow to analyse the quantity of the burden assignable to each source and provide the basis to compare different datasets. Although the baseline and the monitoring data are comparable and no significant difference between the datasets is expected (see Section "Data" for details), it provides the basis for comparison. There is much more to say about the model and its results but we focus here on the technical aspects of making the model FSKX compliant and some of the model mechanics. For a more detailed discussion of the model and its results see
Modelling of source attribution is a powerful approach that can contribute to the reduction of human zoonotic cases, in particular salmonellosis. However, model results are highly sensitive to changes of multiple parameters that can differ for each model. In the presented model, these parameters include the initial values for observed Salmonella cases and the assumption about the consumption data. If someone aims to reproduce the model results, this is only possible if the parameter settings are identical to the original settings. In other words, slight changes in a model parameter might result in a big change in the model prediction and thus, the results presented in an article or report cannot be reproduced. The issue of reproducible results is a general challenge in science (
The implementation of a model in a standardized and annotated exchange format like FSKX-format is a way that focuses on long-term usability and understandability of the model. The community as well as the creators would benefit from such an approach. One example where a creator developed a model with an FSKX conform end-product in mind is the work of
Much time-consuming and/or error-prone work can be saved in the future if model development is done with a mind-set of long-term usability, reproducibility, and understandability. The FSKX format enables sharing model code reliably and reproducibly and thus paves the way for successful collaboration and further development of models.
In this work, we demonstrated that it is straight forward to take a Bayesian source attribution model running under R and OpenBUGS originally published in
We like to acknowledge the original work of Hannah Jabin and Lars Valentin and furthermore thank the colleagues from the Robert Koch Institute (W. Rabsch) and German Federal Institute for Risk Assessment (C. Dorn, A. Schroeter, R. Helmuth, A. Friedrich and I. Szabo) for providing the data on Salmonella isolates from humans and animals, respectively.
EMS is funded by the JIP MATRIX within the One Health EJP. One Health EJP has received funding from the European Union Horizon 2020 research and innovation program under grant agreement No 773830. Gathering the data and analyzing it with source attribution models was initially done as a part of the project RESET which was financially supported by the German Federal Ministry of Education and Research (BMBF) through the German Aerospace Center grant number 01Kl1013A‐H.
Esther M. Sundermann: Conceptualization, Data Curation, Project administration, Software, Visualization, Writing - Original Draft, Writing - Review & Editing. Guido Correia Carreira: Conceptualization, Formal analysis, Writing - Original Draft, Visualization, Writing - Review & Editing. Annemarie Käsbohrer: Data Curation, Writing - Review & Editing. The author contributions are taken from https://www.elsevier.com/authors/policies-and-guidelines/credit-author-statement