Measuring Cancer Prevalence in Europe: the EUROPREVAL Project

Cancer prevalence is the proportion of individuals in a population who at some stage during their lifetime have been diagnosed with cancer, irrespective of the date of diagnosis. Cancer prevalence statistics have generally been provided by a limited number of well established cancer registries that have been in existence for several decades. The advent of systematic follow-up of life status of incident cases and the availability of new statistical methodologies, now makes it possible for registries established during the 1970s or 1980s to provide prevalence data. The main problems encountered in the estimation of prevalence are the inclusion of: (i) cases lost to follow-up; (ii) cases known only from their death certificate; (iii) cases diagnosed before the start of registration; and (iv) the treatment of multiple tumours and migrations. The main aim of this paper was to review these problems and discuss, through the experience gained with EUROPREVAL, how they can be overcome. A method is presented for the calculation of prevalence of all cancers combined in the populations covered by the 45 cancer registries participating in EUROPREVAL. Prevalence of cancer is estimated to be 2% on average, with the highest values (3%) in Sweden and the lowest in Eastern Europe, with a minimum of approximately 1% in Poland.


Introduction.
Cancer prevalence is the proportion of individuals in a population diagnosed with cancer during their lives, irrespective of the date of diagnosis.This definition assumes that cancer is an irreversible disease and diagnosed individuals remain cancer cases until death.Such people make greater demands on the health system than the general population.They require treatment, followup for cancer recurrence, screening for independent second cancers, and may be permanently impaired or disabled as a result of their cancer.However prevalent cancer cases are highly heterogeneous group in terms of health status as they include patients undergoing clinical treatment and those diagnosed many years previously who may be considered cured of their cancer and require few if any additional health care resources.Time from diagnosis is therefore an essential qualifier of cancer prevalence data .Unlike cancer incidence or mortality, prevalence has not been a major focus of epidemiological statistics; nevertheless several methods have been developed to provide estimates of prevalence mainly as a by-product of other activity.For example health surveys of samples of the general population [1] can provide prevalence estimates from persons reporting they have been diagnosed with cancer.However, health surveys are expensive, require very large samples to obtain data on rarer cancers, and are prone to bias, as the compliance of seriously ill persons is expected to differ from that of healthy people.Direct methods [2][3][4] employ the incidence and follow-up data collected by population-based cancer registries (CRs).In essence they calculate prevalence by counting how many incident cases are still alive at a given index date; however cancer registration must have continued long enough so that essentially all surviving cancer cases are registered otherwise the prevalence estimate will be low.Indirect methods estimate prevalence by modelling the mathematical relationships between incidence, prevalence, survival and mortality.Depending on what data are available incidence and survival [5], incidence and mortality [6], or mortality and survival [7] data can be used to obtain prevalence functions.Mortality data are usually obtained from national statistics, while incidence and survival data are provided by CRs.Cancer prevalence data are not systematically available at the level of national populations.Some prevalence data pertaining to a few US states and some European countries that have been covered by cancer registration for many decades have been published [2][3][4][8][9][10][11][12][13].EUROPREVAL is the first European-wide project to estimate the prevalence of the most important cancers in the participating European countries.The first objective of the project was to use the direct method to calculate prevalence data for European populations covered by cancer registration.Data were available from the EUROCARE-2 Study [14], collecting data on cancer patients diagnosed between 1978 and 1992 from 56 participating population-based CRs in 17 European countries.In order to ensure that the cancer prevalence estimates from the numerous participating CRs were comparable across the board, it was necessary to solve some important methodological problems pertaining to data registration, consistency and comparability.The aim of this paper is to review these problems and discuss, through the experience gained with EUROPREVAL, how they can be overcome.We then apply these solutions to the basic prevalence calculations for all cancers combined in the populations covered by the 45 cancer registries participating in EUROPREVAL.The results of the study are more extensively presented in a companion paper appearing in this same issue [15].

Problems arising from the direct method of calculating prevalence
The direct method of calculating cancer prevalence in a population covered by cancer registration is simply to count all incident cases of cancer that are still alive at a given date (the index date).The procedure basically consists of allocating all incident cases still alive at the index date to cells in a two-dimensional matrix according to their age at the index date and number of years since the cancer was diagnosed.When divided by the corresponding population sizes, they provide prevalence figures as proportions, that are specific for age and disease duration.However the figures thus produced are likely to be incomplete or inaccurate due to loss of cases during followup, cases known from death certificate only (DCO), problems arising form the treatment of migration and multiple cancers, and lack of completeness due to surviving cases diagnosed before the registry came into existence.These problems are discussed below.
Cases lost to follow-up Some cases are inevitably lost to follow-up, and their vital status at the index date is therefore uncertain.For most European registries, the percentage of cases lost is under 1% [16].Higher percentages are the norm in US registries [13], because follow-up procedures differ and there is a relatively high rate of state-to-state migration.Cases lost to follow-up are considered in the analysis by attributing an estimated survival probability.The approach of Feldman et al. [4] was to derive this probability from a single life table calculated for the whole set of patients not lost to follow-up, irrespective of age and period of diagnosis.A more recently published approach [12] estimates the survival probability of each lost case from the subset of followed patients belonging to the same age and period of diagnosis.The Feldman et al. approach gives more stable estimates, however it can be erroneous for patients considerably younger or older than the average.The second approach [12] in theory provides more accurate probability estimates, but is subject to random variability when the total number of patients in the age group considered is small.Whatever method is used, the number of lost cases estimated alive at the index date is added to the number of prevalent cases determined by the direct method.
Cases known from death certificate only (DCO) Some cases are notified to registries only when they die and the death certificate reports cancer as the underlying cause.How and whether DCO cases should be included in prevalence counts are unsolved questions.It can be argued such cancers are diagnosed very close to death, that patients were not actually treated as cancer patients and therefore contribute to the prevalent population for a negligible time.On the other hand, some patients are DCO not because they were first diagnosed at death but because an earlier diagnosis failed to reach the CR, and these should be included in the prevalence count; however, this is not a simple task.The problem is to estimate the number of cancer cases not observed by the registry who actually had a cancer diagnosis at the prevalence date; these cases will be registered as DCO after this date.Specific studies of trends and survival times of DCO cases will have to be carried out to be able to estimate the numbers of such cases.In any event, if they are not included, the proportion of DCO to registered cases should be reported to provide an indication of the extent to which prevalence may have been underestimated.This we have done in presenting the EUROPREVAL data.

Migration
People moving away after cancer diagnosis and registration are usually followed even if they move to a different health area and are therefore included in prevalence data although they no longer make demands on the health area of diagnosis.Conversely, patients who move into another health area after diagnosis are not counted in the prevalence of that area even though they are treated there.In such situations the prevalence of a region or health area is underestimated when the net flow is into that area, and is overestimated in the opposite case.However the resulting error is small in most European countries, where the net migration rate is usually between -1% and +1% per year.Prevalence estimates should not therefore be substantially affected by migration, unless there is a net migration of cancer patients from one area to another.To our knowledge, this phenomenon has not been considered in any prevalence analysis, mainly because there is no systematic information on migration for health reasons.Migration was not considered in presenting the data from EUROPREVAL.

Multiple cancers
Prevalence can refer to the number of people with cancer or the number of cancers in the population.The difference lies in the way multiple primary malignant tumours are accounted for.Person prevalence considers only the first primary malignant cancer diagnosed in each person, and is a measure of the number of people actually making demands on health care resources for cancer.On the other hand, patients with two or more tumours are counted several times in Tumour prevalence, which considers all primary malignant cancers in a person irrespective of whether they are the first or subsequent cancers.If the multiple cancers are treated independently, this second indicator is more pertinent to the demand for health care.The difference between the two indicators may be substantial, particularly in the oldest age groups.Inclusion of multiple tumours in comparative studies is complicated by lack of uniformity among CRs in the application of cancer coding rules, particularly for tumours in paired organs [17].Furthermore, the numbers of multiple tumours registered depend on the age of a CR: the older the registry, the greater the likelihood of registering previous diagnoses in multiple cancer patients.Recently established registries may know from clinical records that a given tumour is not the first primary, but may not have the resources or procedures necessary to access full information on previously diagnosed cancers.The EUROPREVAL project considered person prevalence only.

Completeness bias.
Even when corrected to include DCO and lost-to-follow up cases, prevalence measured on populations covered by CRs is still incomplete as prevalent cases diagnosed before the registry began operating will not be recorded.Such unobserved cases are far from negligible, especially for recently established registries (less than 15 years) and for cancer sites with good prognoses [18].It is vital, therefore, that these unobserved cases are estimated and included in the prevalence data.It is also essential, in a Europe-wide study involving numerous registries operating for variable lengths of time (from 40 or more years to only about five years) that a uniform and unbiased way of dealing with completeness is used, so as to provide comparable prevalence estimates for populations covered by cancer registration for different lengths of time.We define as the observed prevalence that produced by the counting method described above to which have been applied various (small) corrections to take account of cases lost to follow-up.We then apply to the observed prevalence an appropriate correction factor, called the completeness index [18] which is an estimate of the non-registered cases still alive.The figure thus produced is defined as the total prevalence.The completeness index will vary according to the length of the registration period and characteristics of the cancer being considered.Completeness indexes were estimated for the Connecticut Cancer Registry and the total prevalence figures thus calculated were compared with observed prevalence, which, since the registry has been operating for more than 50 years, should have been almost complete; concordances were found to be satisfactory.[19] The same method wash applied to Italian prevalence data [20] and to EUROPREVAL data to improve European estimates of prevalence.

Estimation of total prevalence
In a population covered by cancer registration for L years, the total prevalence (N tot ) is given by the sum of the observed prevalence (N obs ) -the proportion of patients diagnosed after the start of registry activity -plus the unknown unobserved prevalence (N unobs ) of patients diagnosed before that date.N tot is not directly measurable but can be estimated indirectly by dividing N obs by a completeness index R [18] which depends on L, the length of time the registry has been operating: R is an estimate of the extent to which the observed prevalence represents the total prevalence, and is defined: where N obs (m) and N tot (m) are model-based estimates of observed and total prevalence respectively, and are derived from parametric models of age-specific cancer incidence probability and relative survival probability [18].The completeness index R takes the value one when all prevalent cases are observed, and approaches zero as the proportion of prevalent cases that are observed decreases.The value of N O (m) depends on the registration period (L), cancer site, sex, and age class: all the factors that influence incidence and survival in the models.Simple log-linear models can be used as incidence functions whose major determinants are age at diagnosis and date of birth.These models are consistent, for a general class of cancers, with the multistage theory of carcinogenesis [21].As shown by Capocaccia and De Angelis [18] R is not influenced by absolute incidence levels, but only by the age slope of incidence.R is larger for cancers whose incidence rises steeply with age, e.g.prostate cancer, and is lower for cancers like cervical cancer whose incidence is largely independent of age.
Survival models with cure can be used as relative survival functions.This class of models assumes that only a portion of patients, the so called fatal cases, have an excess death risk while the remainder have the same mortality rate as the general population (not affected by the specific cancer) and can thus be regarded as cured [22].Cure models allow estimation of long term survival, which must be estimated accurately as it plays a crucial role in estimating prevalence.Survival has a direct influence on R: cancers with poor survival are characterised by high R values, as only recently diagnosed patients are likely to be alive at the prevalence date.By contrast a high proportion of patients with good prognosis cancers who were diagnosed before a young registry started operating will still be alive at the index date, so R will be low.
Similarly R can be used to estimate the partial prevalence for a period longer than the observation period, for instance the 15-year prevalence in a population observed only for 10 years.This method is useful to decompose estimated total prevalence by duration of disease.

Standard errors of prevalence estimates
In cases where all prevalent cases are observed and followed from diagnosis to the index date, a simple Poisson distribution can be used to derive the standard error (SE) of the number of cases [5]: The standard error of the prevalence as a proportion is obtained by dividing √N T by the population count.However total prevalence is of a composite estimator made up of (i) a direct count N O of cases observed and followed, (ii) an estimated number (N lost ) of lost-to-follow-up cases surviving until the index date, and (iii) an estimated proportion 1/R of cases diagnosed before registration began.While the first term is Poisson distributed, the second term comes from an estimated life table − itself derived from the cases actually followed, and the last term is based on statistical models applied to the same or, in some cases, to an independent dataset.A formal theory of prevalence estimator sampling errors, that takes all these sources of variability and their interrelations into account and, is being developed but is not yet available [23,24].A rough approximation to the SE of the total prevalence can be calculated assuming that the proportion of lost (P lost = N lost / N O ) and the completeness index R are without error: However it is apparent that that SEs calculated with this expression systematically underestimate the real variability of the prevalence figures, as potentially important sources of uncertainty are neglected, and cannot be used for formal statistical testing.

Prevalence of all cancers combined from European registry data
In the EUROPREVAL Project prevalence estimates for the most important cancer sites were produced for 17 European countries.The results of the study are extensively presented in this same issue [15].Here we report the methodological choices adopted in the study and present prevalence results in Europe for all cancers combined (Tables 1 and 2).The EUROPREVAL project used incidence and follow-up data provided by the EUROCARE project [14], collecting data from 56 participating population-based CRs from 17 European countries.The EUROCARE 2 database contains data on cancer patients diagnosed between 1978 and 1992, the minimum information for each patient being sex, date of birth, date of diagnosis, date of end of follow-up, tumour site, morphology, and life status.To calculate consistent prevalence figures at a given date, incidence and follow-up data must be complete at that date.The most recent date for which these data are available for all registries participating in EUROCARE is 31st December 1992 and this was taken as the index date for prevalence computation.A certain fraction of the records (1-2% in most European registries) has wrong or missing values for some patient variables.Often the month of birth or month of diagnosis is missing, but in a few cases both the month and year are missing or in other cases the sex is not specified.The exclusion of such cases from calculations leads to underestimation of the prevalence.Automatic procedures to correct data incompatibilities or to impute missing values were therefore used whenever possible.We used the direct approach to estimate the prevalence of all cancers in European registries at the index date of 31st December, 1992.Specifically developed software [25] for the calculation of prevalence has been used.The basic data are shown in Table 1 along with the main steps in the calculation, so as to illustrate the problems discussed above.Data from most participating CRs were included; data from specialized registries (concerned with digestive system or hematological cancers, etc.) were not included.For each CR, the following are reported in the Table : (a) number of years of cancer registration, prior to the index date, available from EUROCARE-2 database; (b) the number of cancer cases collected during the period and included in the analysis; (c) the number of cases alive at the index date; (d) the number of lost cases; (e) the number of lost cases estimated alive at the index date; (f) the observed prevalent cases [= (c)+ (e)] ap to (a) years after diagnosis; (g) the completeness index; (h) the total prevalent cases [=(f)/(g)]; (i) the population count, in hundred thousands; (j) the total prevalence per 100,000 [= (h)/(i)]; and (k) the average yearly DCO cases as a percentage of the total prevalence.The incidence period considered ranged from 23 years (1970-1992) in Iceland, Saarland and Geneva, to five years (1998-1992) in several southern European registries.Some registries, mainly in northern Europe where cancer registration started during the 1950s, also provided a complete set of data covering their entire registration period.These data were used to check the estimates of R index.Cases lost to follow-up were considered in the analysis by assuming they had the same survival probability as not-lost cases of the same age at diagnosis and number of years passed since diagnosis.For each calendar period considered, the number of lost cases estimated alive was added to the number of prevalent cases.The contribution of lost cases estimated alive to the observed prevalence was highest in Somme, where they represent about 15% of the observed prevalent cases, followed by Warsaw (10%), and East Anglia (9%).Percentages ranging between 2% to 6% were observed in three registries: Geneva (a registry with good follow-up procedures, but with problems due to patient migration), Torino, and Cracow.In all the other registries, lost cases contributed less than 2% to the observed prevalence.As noted, migration was not considered as its influence on prevalence can be assumed to be negligible.Only first primary tumours were included; patients with multiple tumours were considered as one prevalent case.DCO cases were not included, but formed a small proportion of the total prevalence (always below 3% and in most cases less than 1%).Lack of completeness generally has a major influence on the prevalence estimations.Thus, the completeness index was around 90% only in registries with more than 20 years of follow-up, and fell to below 50% in registries with only five years of follow-up.In the latter cases half or more of the total prevalence had to be estimated.The average estimated prevalence in European countries was 2,042 x 100,000, i.e. about two out of one hundred European citizens have a previous diagnosis of cancer.Sweden presents the highest estimated prevalence (3,046).High levels were estimated also in Germany (2.,777), Italy, and Switzerland (with more than 2,500).Low prevalence was estimated in Eastern European countries with Poland (1,169) at the lowest level.
In Table 2, registries have been grouped by country, while prevalence proportions have been decomposed by time from diagnosis.About 20% (variable from 19% to 25% according to country) of total prevalent cases were diagnosed with cancer since less than 2 yrs, while about 40% (35% to 46% ) and 60% (53% to 66%) were diagnosed since less than 5 yrs and 10 yrs respectively.The geographical variability of these proportions is lower than the absolute levels of prevalence, at least for of all cancer combined.It is related indeed to country-specific incidence distribution by cancer site and to survival levels, rather than to the follow up length

Validation of total prevalence estimates
As statistical modelling is incorporated in total prevalence estimates, through the use of completeness index, validations against empirical-based estimates were achieved in EUROPREVAL study.To check the values of the completeness indexes provided by the statistical models, we used the data of the long-established CRs whose observation periods are long enough to have registered virtually all prevalent cases.We calculated figures of the 15 year prevalence, ignoring cases registered prior before that, and then corrected these data using the corresponding completeness index R.The results, the estimated total prevalence, were then compared with the observed total prevalence.This method was also used to check prevalences estimated in SEER registries up to 1993, using data collected from 1940 to 1993 by the Connecticut cancer registry [19].The results of the validation analysis are shown in Table 3.Total prevalence estimates for the ten cancer sites included in the study were separately analysed.For the Finnish and Danish registries the estimated total prevalence was slightly below the observed total prevalence.This is probably because the prevalence figures from these registries refer to cancers rather than persons with cancer.By contrast, for the registries of Eindhoven, Estonia and Saarland (with observation periods of 20 to 25 years) estimated total prevalence figures were more frequently above the observed total prevalence.Differences between estimated and observed values were generally less than ±10%, the only exceptions were cervical cancer and Hodgkin's disease.For Hodgkin's disease the estimated figures greatly exceeded the observed figures.For cervical cancer, the estimated total prevalence was lower than the observed prevalence in Finland and Denmark, but was higher than 10% in all other registries except Iceland.These mismatches can be attributed to marked changes in the epidemiology of these two cancers.For Hodgkin's diseases the use of new more effective therapies became widespread; for cervical cancer screening became widespread.Our modelling approach was based on EUROCARE 2 data for the incidence period 1978-1989, and did not take account of these developments.Thus the excess of estimated prevalent cases of Hodgkin's disease can be explained by model-based overestimation of backward projections of survival; the lower prevalence estimates for cervical cancer in Finland and Denmark are the result of the high incidence levels in the pre-screening period that were not considered in the model.

Conclusions
As the population ages the fraction most at risk for developing cancer grows, while advances in cancer treatment are resulting in increasing proportions cancer patients living longer.Thus the demand for social services and especially health care by this sector of the population is growing, particularly in developed countries.Cancer prevalence is a vital indicator, as it is a measure of the number of cancer patients who are requiring heath and social services resources and can be used to adequately plan future allocation of such resources.Providing and updating reliable and systematic prevalence statistics obtained using uniform and validated methodologies such as we now have or cancer incidence, survival, and mortality is therefore important for all in European countries.The breakdown of cancer prevalence figures according to time since diagnosis is an important first step towards the development of specific indicators of health care needs for specific sections of the population.A subsequent step will be to classify prevalent cases by disease stage at the index date.This will be even more informative for planning the allocation of health resources, as it will make it possible to identify four groups of patients: those recently diagnosed patients who are receiving primary treatment; those who can be considered cured of their cancer; those in the terminal phase of their illness; and the remainder with intermediate status, also referred to as "continuing-phase" [26].The groups this identified are much more homogeneous in terms of predictable health needs than subgroups simply defined by time since diagnosis.Our approach to the estimation of the stage distribution of prevalence will be described in forthcoming papers from EUROPREVAL.

Table 1
Calculation of prevalence for all cancers combined in European cancer registry areas.National registries are in capital letters.The columns show: (a) the number of years of cancer registration, before the index date of 31 December, 1992; (b) the numbers of cancer cases collected during the period and included in analysis; (c) the numbers of cases alive at the index date; (d) the numbers of cases lost; (e) the numbers cases lost estimated alive at the index date; (f) observed prevalent cases up to (a) years after diagnosis [= c+e]; (g) the proportion of the total prevalence observed by the registry; (h) total prevalent cases [= f /g]; (i) the population, in hundred thousands; (j) the total prevalence per 100,000 [= h/I]); (k) the average DCO cases per year as percentage of total prevalence.

Table 2 .
Total prevalence decomposition by duration of disease for all cancers combined in European countries at the common index date of 31, December 1992.Prevalence proportions per 100,000 of population within 2, 5 ,10 years since diagnosis.

Table 3 .
Comparison of the estimated total prevalence (calculated using the completeness index at 15 years) with the observed prevalence for registries established for more than 20 years.Prevalence values (number of prevalent cases) are referred to the index date of 31, December 1992 .