Complex Example: UNAIDS Data Clean-Up

UNAIDS provides data in a number of different formats--the one being considered here is the least accessible in terms of data formatting. It is the most complex example of inappropriate data in a spreadsheet field that will be considered.

As shown in the sample below, the organization has included three data elements in a single field along with a number of characters which are not meaningful in a spreadsheet or database application. The downloads include characters like "<", "..., "[", "]", which are not accept in spreadsheets when not in the context of a function. In addition the tables use a space as a 000's separator. These are mixed in with the numerical data. The first step of the data cleanup involves getting rid of these characters in the data tables being modified.

The table below demonstrates this issues.

As a preliminary set of steps,use the FIND and REPLACE options.

1) Find ... and replace all with an empty replace option

2)Find < and replace all with an empty replace option

3) Find (a single entry of the space bar) and replace with an empty option [This removes the spaces being used as '000s dividers.).

The data will then look like this.


Although the to extract the three data elements are all in the same speadsheet, the document is too wide for the web page and the three sets of calculations will be shown with the non-essential pages narrowed.

The first table below shows the calculations for extracting the medium estimate, which is the first in the list in each cell. As note in earlier sections there are many different ways of combining the functions--the ones suggested may not be the most efficient, but they work. Also noted previously, once you have gained experience it is possible to combine several functions together in one equation.

MEDIUM ESTIMATE

There are 4 steps in the process of extracting the medium estimate..

1) Use the FIND function to locate the "[" character. (Column C)

2) Use the LEFT function with with the results of the FIND function minus 1 to extract the data (Column D)

3) The data is extracted as text--use the VALUE function to convert the text to numeric (Column E)

4) Use COPY and PASTE (special, value) to copy column E to Column F which will now be in numeric form.

For the moment Ignore the values "#VALUE!" in the cells with no data--these will be dealt with as a last step

LOW ESTIMATE

The low estimate is the second set of numbers in the field. There are 3 steps for extracting the low estimate.

There are three steps to extracting the low estimate.

1) Use the FIND function to find the placement of   "-". (Column G)

2) Use VALUE and MID with the results in Column C to extract the low estimate and convert it to numeric. (Column H)

3) Use COPY and PASTE (special, value) to copy column H to Column I which will now be in numeric form.

HIGH ESTIMATE

The low estimate is the second set of numbers in the field. There are xxx steps for extracting the low estimate.

There are three steps to extracting the low estimate.

1) Use the LEN function to to determine the number of characters in the cells in B column (Column J)

2) Use VALUE and MID with the results in Columns G and J to extract the high estimate and convert it to numeric. (Column K)

3) 4) Use COPY and PASTE (special, value) to copy column K to Column L which will now be in numeric form.

REMOVE #VALUE!

Use FIND to find the expressions #VALUE! and replace with an empty replace function.

FINAL TABLE

Your final table will look like the table below. You can now delete columns C, D, E, G, H, J and K.

Copyright 1998/2019 GRI Updated 25 March 2019