9512.net
甜梦文库
当前位置:首页 >> >>

A method for assessing risk of disclosure in census microdata and


A method for assessing risk of disclosure in census microdata and tabular data
O. Duke-Williams, S. Openshaw, P. Rees, A.Evans Centre for Computational Geography, University of Leeds, UK April 15, 1998
The area of con dentiality in personal databases is one which is currently of considerable interest. This paper concentrates on current research relating to measuring the risk of disclosure of information about individuals in census data, which the authors believe is directly applicable to other data sources, such as employer-employee databases. Various methods have been used in the past by census o ces in different countries to ensure that requirements of con dentiality are maintained. These include both legal safeguards | typically the user of census data has to sign a binding agreement that they will not misuse the data | and the use of various mechanisms to anonymise and modify data prior to release. The most common methods used are data aggregation, data perturbation and data suppression. There has been little quantitative research to investigate the extent to which any of these methods achieve their stated aim or the extent to which they change the data. The measurement of con dentiality risks is crucial to the comparison of di erent protection methods. Quantitative measures of risk of disclosure would allow researchers to comment critically on the various protection methods associated with current practice, and based on those assessments to give advice about the degree to which data should be modi ed in order to ensure con dentiality. This paper describes research carried out using a new method developed to assess the risk of disclosure in both microdata and tabular data. The work is related to the design of output areas for the reporting of census results. This requires a measure of risk which can be assessed for speci c contexts such as local areas rather than for an entire dataset. This localisation of risk assessment allows datasets to be produced in which there is an e cient balance between security and utility of the data. Where risk of disclosure is found to be unacceptable, a proposed area can be rejected, and by a process of redesign and reassessment the area can be modi ed until risk has been reduced to an acceptable level, however where the risk of disclosure is found to be low, the proposed area can be accepted. The alternative | of assessing risk for an entire dataset globally | will generally lead to a situation where areas of high risk are protected at the

Abstract

1

expense of unnecessary modi cation or aggregation of data in other areas, leading to a reduction in the overall utility of the data.

1 Introduction
Statistical con dentiality in databases containing personal or commercially sensitive data is an area in which there is considerable interest, and the ever increasing power of computers has signi cant implications for data providers today. Developments in hardware and software technology have led to a situation where there is both greater exibility in the outputs that data providers could produce, and a recognition of this exibility by data users who desire exible outputs tailored to their own requirements and interests. However, this increase in computing power also means that so called data attackers will also have more sophisticated and powerful methods at their disposal. There is thus an increased perception of con dentiality risks in sensitive data. A variety of methods have been and are used to ensure con dentiality of data by statistical agencies around the world. These methods include data aggregation, data suppression, data perturbation and sampling; the latter also being employed in some cases in order to reduce costs. Typically, more than one of these methods will be used: data might be aggregated via micro-aggregation, global recoding and or tabulation, with some additional technique such as perturbation or suppression used to deal with cases that are still considered to be problematic after the initial aggregation. Elliot et al. 1998 have suggested a terminology of special uniques to identify records which have a particularly unusual combination of characteristics such as persons aged 16 who have a marital status of `widowed' which are likely to remain unique in all but the most coarse and thus least meaningful recoding systems. It is these special uniques which lead to the argument that additional protection of the data is required. A fundamental methodological problem is that there is no explicit statistical measure of the con dentiality risks inherent in the release of tabular census data. Con dentiality risks are generally presented as either a function of the risks present in the microdata which is used to produce that tabular data, or by using some indirect measure of risk such as the proportion of cells below some critical value. This paper proposes a measure of risk which is based on both the microdata used to create a set of tables, and the structures of the tables themselves, which serve as a protection mechanism by reducing the number of relationships between variables. The method is broadly described in the next section of this paper, and this is followed by a fuller example. The paper then presents results from a number of experiments conducted and considers possible modi cations to and extensions of the method.

2

2 Measuring risks in Census data
The methods used to measure the risk of disclosure in tabular data are based on modi ed algorithms used to assess risk in a set of microdata from the number of uniques in the data. This section starts by outlining a method for computing a measure or risk from microdata, and then goes on to describe how the method has been adapted to consider tabular data.

2.1 Measuring risks in microdata

The methods employed are based on methods traditionally used when assessing risk in a set of microdata. Let Xij be a data pro le for person i on variable j . There are N persons in a database probably a subset of a larger database, constrained by geography or some other variable and M variables available as match keys variables that are available from some alternative data source. These M variables would not be the full set of data in the database, as that would mean that there would be not point to carrying out a record matching procedure. Disclosure is considered to be a risk if the set of variables M can be unambiguously matched between data sources, such that additional variables from the microdata can be discovered for individuals listed in the match data set. The importance of such a match depends on the context of the data. In some cases, it may be possible to discover commercially sensitive data from the microdata, whereas in other cases the data revealed may be innocuous, but the mere fact that it is possible to disclose information is considered a breach of promises that data providers have given with respect to con dentiality. Let Dik be a match indicator which represents whether the ith record us the same as the kth record, where i 6= k. Hence
Dik

= 0 if

M X jX j

ij , Xkj j = 0

1

otherwise
Dik

M X jX = 1 if j

ij , Xkj j

0

The number of times that the data pro le of person i does not match any of the other N records is therefore
M XD i6=k ik

2

Hence a person is unique in the database of N records given the match key of M variables if 3

M XD i6=k

ik = N , 1

3

and therefore there is risk for this person or whatever that database contains that further data might be disclosed if the record is released. This assumes that the worst case risk of disclosure for record k is some function of how unique the data pro le is, with the relative uniqueness being given by 4 with a maximum value of 1.0 indicating a completely unique pro le. If Equation 3 is true, then the risk of disclosure is at a maximum because record k is unique. If Equation 2 is equal to 0 then there is the special case that all records are identical; for any reasonable values of N the equation can probably by considered asymptotic. Equation 4 can be used to measure the frequency of unique record data pro les in the set N by simply counting the number of times that result is 1. More formally, we can de ne a Kronecker delta k such that
N

PM k Dik i6=
,1

else

i6=k Dik k = 1 if N , 1 = 0 k=0

PM

Hence the frequency of uniques is

N X k

k

and the average risk of disclosure for all N persons is simply 5 For Equation 5 to be small requires the unique count to be small or for N to become increasingly large. As the number of uniques gets larger, so N has to increase or coding changes have to be made to the M variables in the match key, in order to reduce the number of uniques. Note that there are almost N 2 comparisons to be made in the computation of Equation 5 and this could cause problems for large values of N . A further problem is that there is no clear threshold which Equation 5 has to meet. Nevertheless, this type of risk assessment is appropriate for a variety of microdata; see Marsh et al. 1991 and Skinner et al. 1990 1994.
N

PN

k k

4

2.2 Measuring risks in tabular data

Once personal data are aggregated by geographical region or some other variable then some modi cation is needed to this measure of record uniqueness. As data are aggregated, the individual counts coded in great detail are replaced by frequency counts either or individual values of categorical data, or of values of continuous data that has been coded into a number of classes, with loss of linkage between individual variables. Some of the micro information is preserved by cross-tabulation of two or more variables, but overall there is an immense loss in the multivariate dimensionality of the data. At the microdata level, a set of 20 variables for each record represent co-ordinates in 20 dimensional space, but as these are aggregated into a number of tables, this collapses to a number of two or three dimensional views of the original information. Despite this large degree of data generalisation, the same fears of disclosing information about identi able individuals or organisations are present. There are two fundamentally di erent situations which may exist with respect to aggregate data; either the data may be expressed as derived summary statistics, or the data may be expressed as raw counts. In the rst case, there may well be no con dentiality risk, depending on the nature of the statistics reported. Statistics derived for aggregate data from several variables such as chi-square values or a classi cation based on some multivariate clustering procedure which cannot be decomposed to reveal the constituent counts can be considered as a form of encryption which can not be reversed by a data attacker. At best, one can only derive estimates of the original data which can not be sensibly used in comparison with external match variables. However, many users need or desire raw counts, and thus the second situation must be considered. If tables are published that contain raw counts derived directly from input microdata, then we need to generalise equations 1 to 5 such that they can be applied to aggregate data, with the same worst case scenario assumptions being applied. Despite the loss in overall dimensionality of data when it is tabulated, the publication of a set of tables intended to ful l most users requirements will tend to produce a dramatic expansion in the number of `variables', by providing a wide range of cross-tabulated counts. In the case of the UK Census, the 24 questions asked expand up to about 20,000 cross-tabulated variables for wards average population around 5020 persons and about 10,000 variables for Enumeration Districts, the smallest areas for which tables are published average population of 366 persons. It is clear that the con dentiality risk involved in releasing 10,000 counts for a small area must be greater than if only 100 counts were to be released, and equally the size of area considered `safe' for 100 counts would presumably be smaller that the size considered safe for 10,000 counts. There is almost certainly some kind of disclosure risk trade-o between the degree of geographical aggregation and the coding detail of the variables being aggregated. The problem is how to identify this trade-o when no risk disclosure function has been formally de ned. Equations 1 to 5 can be modi ed to handle aggregated data as follows. The 5

same equations are used, and all that changes in the nature of the Xij variables. The previous microdata Xij is used to create a new variable Zij which still relates to individual i but the M original variables have been expanded into a vector containing a count for each cross-tabulated variable given the set of tables to be produced from the data. Thus in the UK case, the 24 original variables would be expanded into a vector of some 10,000 0 or 1 counts. This must be achieved via some software transformation which takes into account the structures and data recodings used for all output tables. Each long vector is composed of a number of sections relating to each table or subtable in the output, and each section will consist of a single variable set to `1', with the other cells being 0. There will be considerable redundancy in this vector, and an e cient implementation may well represent the vector in an alternative form, but for the same of explanation the concept of a long vector of zero-one values will be retained. A new aggregated dataset, Ykj is de ned, which is an aggregation of the full set of Zij vectors for a particular geographical area k. Note that here k is being used to refer to an area, whereas in previously it was used to refer to individuals. However, it is useful to retain the same subscript to emphasise the parallels between the microdata and aggregate data situations. The full set of tabular data for any geographical area k is calculated as the sum of N individual microdata records assigned to that area, viz
Yij

=

Mk XZ i

ij

6

where Ykj is the j th count for area k and Zij is the data for the ith individual of the Nk records assigned to the area. Ykj is thus the sum of all individual records for the area, and contains all the values that would be contained within the published set of tables for the area; the tables can be produced by mapping the vector on the output table frameworks. The con dentiality risk of nding any individual living in the area k at the time the data was collected now depends on their being a perfect match between Ykj and a single unique case Zij . Equations 4 and 5 can be re-used here based on revised Dik values that use Z and Y and with the matching restricted to 0-1 values in Z . Equation 1 is now re-de ned so that
Dik

= 1 if

otherwise

j 2
Dik

X jZ

ij , Ykj j = 0

7

=0

where  de nes the j values in Zij that have values of 1 in the total vector . There are M such j values. Complete con dentiality is assured only in the value of Dik is 0 for all N individuals in the area.
Yij

6

Person Sex Age Ethnic group 1 1 27 1 2 2 40 4 3 1 11 1 4 2 59 1 5 1 52 1 6 1 38 2 7 2 5 2 8 1 13 1 9 2 68 1 10 2 57 3 Table 1: The sample set of individual data

3 Worked example
In this section a small worked example will be described. The example uses a small set of data which includes a number of variables which might typically be included in Census microdata. The data is shown in Table 1. It is a set of 10 unique individuals, with three attributes: sex, age in years, and ethnic group, which has 4 categories. Clearly, this is a much simpler and smaller dataset that would be used in practice, but it is can usefully be employed to help explain the methodology. This data can be used with either the microdata data method or with the tabular data methods described above. Both workings are shown for the sake of completeness.

3.1 Calculation of risk for microdata 3.2 Calculation of risk for microdata

Table 2 shows the calculations that lead to Equation 5. As all the records in this data set are unique, the value of Equation 5 is 1.0, indicating that all records are at risk. Typically data of this type might be subject to some recodings, and Table 3 shows the data with some alternative codings for the detailed Age variable. `Age1' is a 5 year age group starting with 0 4, and `Age2' is a lifestage grouping, with classes 0 14, 15 29, 30 64 and 65+ for the sake of simplicity it is assumed that retirement age is common for men and women. The ethnic group variable is assumed to already have been broadly coded. The sex variable can obviously not be recoded, although it might be replaced with a `persons' variable in extreme cases. Clearly, as there are only a few variables in this sample data, there are only a limited number of cross-tabulations which could be constructed. In this example, it is assumed that three tables will be built from the data: 7

Person i

PM k Dik Pi6==k Dik M i
6

1 2 3 4 5 6 7 8 9 10

1 1 1 1 1 1 1 1 1 1

2 1 1 1 1 1 1 1 1 1

3 1 1 1 1 1 1 1 1 1

4 1 1 1 1 1 1 1 1 1

Person k 5 6 7 1 1 1 1 1 1 1 1 1 1 1 1 - 1 1 1 - 1 1 1 1 1 1 1 1 1 1 1 1

8 1 1 1 1 1 1 1 1 1

9 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 - 1 1 -

9 9 9 9 9 9 9 9 9 9 1 1 1 1 1 1 1 1 1 1 N ,1 1 1 1 1 1 1 1 1 1 1 k Table 2: Calculations for Equation 5 in microdata

Person Sex Age Age1 Age 2 Ethnic group 1 1 27 6 2 1 2 2 40 9 3 4 3 1 11 3 1 1 4 2 59 12 3 1 5 1 52 11 3 1 6 1 38 8 3 2 7 2 5 2 1 2 8 1 13 3 1 1 9 2 68 14 4 1 10 2 57 12 3 3 Table 3: The sample data with recodings

8

Age Male Female 04 0 0 59 0 1 10 14 0 1 15 19 1 0 20 24 2 2 25 30 35 40 45 50 55 60 65 70 29 34 39 44 49 54 59 64 69 74 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

75 79 80 84 85 89 90+

Table 4: Age by sex table for the sample data 1. Age 5 year by sex 2. Age life stage by ethnic group 3. Sex by ethnic group It is worth noting that each table contains a variable that is also used in another table, and so this is a simple example of linked tables. The crosstabulations for the sample dataset for these tables are shown in Tables 4 to 6. The sample data can be converted into a series of 0,1 vectors where each position represents a cell in one of the tables produced, with the value being set to 1 if the individual has that particular combination of values, or  otherwise. In this example, the vectors will have a total length of 62 cells, comprised of 38 from the rst table sex=2  age=19, 16 from the second table and 8 from the third. The table cells are assumed to be numbered in a column-then-row order, so that in the rst table, cell 1 is the combination Age=0 4 and Sex=Male, cell 2 is Age=0 4 and Sex=Female, cell 3 is Age=5 9 and Sex=Male and so on. The rst individual in the data is a white male aged 27, and thus serves to increment the following cells: 9

Age White Black Asian Chinese 0 14 2 1 0 0 15 29 1 0 0 0 30 64 2 1 1 0 65+ 1 0 0 1 Table 5: Age by ethnic group table for the sample data Sex White Black Asian Chinese Male 4 1 0 0 Female 2 1 1 1 Table 6: Sex by ethnic group table for the sample data Cell 11 in table 1 Age=25 29 and Sex=Male Forming this portion of the vector:
000000000010000000000000000000000000

Cell 5 in table 2 Age=15 29 and Ethnic Group=White Forming this portion of the vector:
0000100000000000

Cell 1 in table 3 Sex=Male and Ethnic Group=White Forming this portion of the vector:
10000000

When these portions are put together, we get the full person vector with the other individuals in the data providing the vectors
i= 2: i= 3: i= 4: i= 5: i= 5: i= 6: i= 8: i= 9: i=10: 00000000000000000100000000000000000000 00001000000000000000000000000000000000 00000000000000000000000100000000000000 00000000000000000000100000000000000000 00000000000000100000000000000000000000 00010000000000000000000000000000000000 00001000000000000000000000000000000000 00000000000000000000000000010000000000 00000000000000000000000100000000000000 0000000000010000 0000000010000000 0000000010000000 0000000010000000 0000000001000000 0100000000000000 1000000000000000 0000000000001000 0000000000100000

i= 1: 00000000001000000000000000000000000000 0000100000000000 10000000 00000001 10000000 00001000 10000000 00100000 00000100 10000000 00001000 00000010

In order to make the vectors easier to read, spaces are used to separate out the boundaries of the input portions. The ten individual portions can be added together to create the totals vector Ykj

Total 00012000001000100100100200010000000000 1100100031111000 41002111

10

The task now is to compare each of the individual data pro les with the aggregate record, Records are considered to be at risk if a value of `1' in an individual Zij vector matches a value of `1' in the total Ykj . If the value in any cell of the aggregate record is greater than 1 then it indicates that the corresponding cell in a published table would contain a county greater than 1, and thus it would not be possible to infer anything conclusively about a particular individual. Each individual record will contain a xed number of `1's, depending on the number of tables or sub-tables de ned. If each 1 in the individual record matches a 1 in the aggregate record, then that individual has a unique presence within the output tables for the match key of M variables, and is at risk of being disclosed. Using the sample data, we can compare the rst record with the aggregate record: The asterisk symbols indicate the locations of the cell increments in the individual record. It can be seen that in the nal section of vector, there is not a match between the individual record and the aggregate record, and thus the record is not at risk of disclosing further information, and thus Dik see Equation 7 for this record is 0 see Equation 7. If this is repeated for the remaining individuals we nd: i Vectors D
2 00000000000000000100000000000000000000 0000000000010000 00000001 00012000001000100100100200010000000000 1100100031111000 41002111
ik

00000000001000000000000000000000000000 0000100000000000 10000000 00012000001000100100100200010000000000 1100100031111000 41002111

1 0 0 0 1 1 0 0 0

3 4 5 6 7 8 9 10

00001000000000000000000000000000000000 0000000010000000 10000000 00012000001000100100100200010000000000 1100100031111000 41002111 00000000000000000000000100000000000000 0000000010000000 00001000 00012000001000100100100200010000000000 1100100031111000 41002111 00000000000000000000100000000000000000 0000000010000000 10000000 00012000001000100100100200010000000000 1100100031111000 41002111 00000000000000100000000000000000000000 0000000001000000 00100000 00012000001000100100100200010000000000 1100100031111000 41002111 00010000000000000000000000000000000000 0100000000000000 00000100 00012000001000100100100200010000000000 1100100031111000 41002111 00001000000000000000000000000000000000 1000000000000000 10000000 00012000001000100100100200010000000000 1100100031111000 41002111 00000000000000000000000000010000000000 0000000000001000 00001000 00012000001000100100100200010000000000 1100100031111000 41002111 00000000000000000000000100000000000000 0000000000100000 00000010 00012000001000100100100200010000000000 1100100031111000 41002111

The risk of disclosure given by Equation 5 is PN k 3 k = = 0:3 N 10 11

Thus the processes of aggregation and broad coding have served to reduce the risk of disclosure considerably. The e ects of recoding could be removed by constructing tables which only used the original input variables. Note that the method e ciently takes into account the e ect of linked tables, by considering the input of any person into all the output tables given all the coding schemes. When considering the risk of disclosure using a method such as the one described here the possibility of false matches also needs to be considered; a false match is de ned here as a record which generates a match against a given set of key variables, but does not generate an unambiguous match with respect to other variables. An example is shown by individual 7 in the sample data. If it is supposed that we have a match key of age and sex i.e. corresponding to the rst table, then individual 7 provides an apparent match, as is shown by the single entry in the cell `Age 5 9; Female' in Table 5.

4 Experiments conducted
A number of experiments have been conducted with an implementation of the algorithm described above. These obviously require access to some original set of microdata. Two di erent sets have been used. The rst is a synthetic data set created through a replication of the UK Sample of Anonymised Records SAR for Yorkshire and Humberside Duke-Williams and Rees 1998. This data set a considerable degree of realism in terms of it's spatial distribution, but lacks the heterogeneity that characterises the real world, and thus may give a misleading picture of con dentiality risks. The second data set used was an an anonymised set of Italian microdata records for the Arezzo area in Tuscany. This has greater realism in terms of demographic heterogeneity, although creates some additional problems on that the set of variables is di erent to that in the SAR data set. Both data sets were modi ed to have a common format and set of base variables and codings, with missing variables being assigned via pro rata assignment, `educated guesswork' or, if necessary, random assignment.

4.1 Experiments with synthetic data

The synthetic data set consisted of records for some 5 million individuals, with a variety of geographies superimposed to generate aggregations for areas of a variety of sizes. The microdata was mapped to various table codings using a number of the tables that were published as part of the Small Area Statistics of the 1991 UK Census. A description of all the recoding schemes used is given by Williamson 1993. A subset of tables was chosen which excluded some hard to code categories namely variables involving persons in communal establishments, and tables which had special cases for Scottish data. From this subset a shortlist of eight tables were selected which could be replicated with reasonable completeness from the available data; these tables were the SAS tables S02, S07, S08, S11, S12, S14, S34 and S38. These tables were then used in various combinations to be considered as match keys. 12

Two sets of test runs were completed. In the rst set, all the tables were used as a match key for a variety of geographies. The purpose of this was set of experiments was to investigate the amount of records which were found to be unique across all tables. None were found. This clearly contrasts with the situation that would occur given the release of microdata, where the number of uniques increases rapidly as the number of variables in the match key increases. The second set of runs used two tables as a key in an attempt to locate identi able records i.e. persons who were unique in the tabular output - represented by a cell with a value of 1 in each table. The possibility of nding people in the other six tables was then examined, with no constraint about linkage of variables in the tables. That is to say, if a person was found to be uniquely identi able in the match tables containing variables a, b, c and d, and the individual was also identi able in an additional table which used the variables e and f , then it would be considered a match, despite that fact that one could not in practice prove any connection between the cell entries. Real matches occur when people fell into cells with only one value in both tables and were uniquely identi able in the other six tables. These are regarded as a disclosure risk, If a match occurs in the two key tables but the individual is not unique in the other six tables, then that is regarded as a false match. Figure 1 shows the results for the synthetic data for Enumeration Districts in Yorkshire and Humberside. The higher rows of points are the false disclosure probabilities, and the lower are the true disclosure probabilities. There are no realistic risks of disclosure except where the areas are very small. Figures 2 and 3 shows the results for the data aggregated into 1km grid squares; Figure 3 shows in detail the results for areas with populations of 500 or fewer. This is more interesting, and there is a non-zero risk of disclosure for small sized areas but it is swamped by false matches. Figure 4 shows the results for the data aggregated in to a grid with variable sized cells, such that there are small cells in densely populated areas, and large cells in rural areas, with all cells having a minimum of 32 households. Again, there is a high false match rate the upper line of dots and here a zero real match rate suggesting that a minimum size threshold does in fact work in this instance for this geography.

4.2 Experiments with Italian data

The set of experiments as described above for the synthetic data were carried out on the Italian data. Figure 5 shows the results of the second set of runs. As before, there is a small risk of disclosure for small areas, but this is swamped by false matches; Figure 6 shows an enlargement of the left hand side of the graph, This gure shows that once the ED has a population greater than about 5, then the false matches outweigh the true matches. In real world cases, the populations of output areas would be considerably greater than 5, and the false matches would serve to mask the true matches.

13

5 Future modi cations
There are a number of modi cations which could be considered to the method to extend and improve it. An obvious requirement would be to be able to use a greater number of tables than the 8 used here. There is no logical limit to the number of tables which could be used; the constraining factor becomes time and compute power. The current implementation was designed as a proof of concept and training model, and it's e ciency could no doubt be improved. More signi cant concerns relate to the assumptions made in assessing risk. There a number of ways in which the algorithm could subject to a simple modi cation to allow for more stringent risk considerations. One argument that has been raised is that it is important to consider the e ects of zero cell values. At the most extreme case, it is possible that a cross-tabulation may show that there are a large number of people, all of whom have exactly the same combination of variables. The corresponding cell in the output table would have a high value and would thus be ignored by the method as currently described, and yet it is clear that the characteristics of all individuals in the table would be unambiguously known. In more realistic examples, it is possible that a given cell might be the only non-zero cell for a given row or column in the output table, and again this might be considered to be a disclosing con guration. It would be quite straightforward to adapt our algorithm to modify entries in the totals vector Ykj where that entry was the only non-zero count in the table, or in the more exacting case, the only non-zero entry in a row or column. If such cells were altered to have value of 1, then they would register as matches when compared with the individual Zij vectors, and would be assessed as being disclosing counts. In a similar way, it may be considered that counts below some given threshold | say 2 or 3 | are considered unsafe, because although they can not unambiguously be used to identify individuals, they may be used to seed conditional probability models with a good estimate. As before, it would be possible to alter the Yij values that were considered unsafe to be equal to 1, so that they would match the Zij vectors. Both of these modi cations would obviously serve to increase the number of `true' matches found, and it is intended to present additional results which demonstrate the degree to which such modi cations increase the disclosure risk statistic. Finally, another modi cation which can be considered is to alter the calculation of Dik when assessing individual records. At present this is set to either 1 or 0 depending on whether or not all the cells of value 1 are matched. It would be plausible to allow values in the range 0-1, which would indicate that the record is partially at risk. Such a statistic might then be used to produce a fuzzy model of disclosure risk rather than a polarised `safe'-`unsafe' model.

14

6 Conclusions
The development of an explicit measure of the con dentiality risks implicit in the release of census data or other sensitive data is viewed as extremely useful for the following reasons: 1. It demonstrates and thus enhances the con dentiality of sensitive data by ensuring safe release by numerically measuring the actual risks and by applying consistent standards; and 2. It creates the prospect of more useful and more exible outputs of data that may better match user needs; this is obviously bene cial to the user, and where statistical agencies are able to charge for data they provide may also make those services more commercially rewarding. The method described is a useful way of assessing the risk posed by releasing a known set of microdata in tabular form, that takes into account the e ects of common variables being present in more than one table. The method could be easily adapted to meet more stringent requirements governing the de nitions of cells at risk. The results from experiments conducted so far suggest that aggregating data into a tabular form is a very e ective mechanism for protecting the data; it is therefore a method of output which can be used to provide a rich set of data to the user, with a known level of risk attached, which is likely to be small and may well be zero.

References
Duke-Williams, O. and Rees, P. 1998. Can census o ces publish statistics for more than one small area geography? an analysis of the di erencing problem in statistical disclosure, International Journal of Geographical Information Science. Forthcoming. Eliot, M., Skinner, C. and Dale, A. 1998. Special uniques, random uniques and sticky populations: Some counterintuitive e ects of geographical detail on disclosure risk, Proceedings of the Conference on Statistical Data Protection, Lisbon, 1998. Marsh, C., Skinner, C., Arber, S., Penhale, B., Openshaw, S., Hobcraft, J., Lievesley, D. and Walford, N. 1991. The case for samples of anonymised records from the 1991 Census, Journal of the Royal Statistical Society A 1542: 305 340. Skinner, C., Marsh, C., Openshaw, S. and Wymer, C. 1990. Disclosure avoidance for census microdata in great britain, Proceedings of the US Bureau of the Census Annual Research Conference, US Bureau of the Census, pp. 131 143. 15

Skinner, C., Marsh, C., Openshaw, S. and Wymer, C. 1994. Disclosure control for census microdata, Journal of O cial Statistics 10: 31 51. Williamson, P. 1993. Metac91: A database about published 1991 census tables contents windows 3.1 version, Working Paper 93 18. School Of Geography, University of Leeds, UK.

16

1 Disclosure risk False Disclosure risk

0.8

Disclosure risk

0.6

0.4

0.2

0 0 5000 10000 15000 20000 Population of area 25000 30000 35000

Figure 1: Disclosure risks for wards in synthetic data
1 Disclosure risk False Disclosure risk

0.8

Disclosure risk

0.6

0.4

0.2

0 0 2000 4000 6000 Population of area 8000 10000 12000

Figure 2: Disclosure risks for 1km grids in synthetic data 17

1 Disclosure risk False Disclosure risk

0.8

Disclosure risk

0.6

0.4

0.2

0 0 50 100 150 200 250 300 Population of area 350 400 450 500

Figure 3: Disclosure risks for 1km grids in synthetic data - detail
1 Disclosure risk False Disclosure risk

0.8

Disclosure risk

0.6

0.4

0.2

0 0 50000 100000 150000 200000 Population of area 250000 300000 350000

Figure 4: Disclosure risks for variable sized grid in synthetic data 18

1 Disclosure risk False Disclosure risk

0.8

Disclosure risk

0.6

0.4

0.2

0 0 2000 4000 6000 8000 Population of area 10000 12000 14000

Figure 5: Disclosure risks for `ED's in Italian data
1 Disclosure risk False Disclosure risk

0.8

Disclosure risk

0.6

0.4

0.2

0 0 10 20 30 40 50 60 Population of area 70 80 90 100

Figure 6: Disclosure risks for `ED's in Italian data - detail 19


赞助商链接

更多相关文章:
更多相关标签:

All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。
甜梦文库内容来自网络,如有侵犯请联系客服。zhit325@126.com|网站地图