Harmonizing Product Codes with R

Abstract

Innovation is a major engine of economic growth. To compare products over time, harmonization of product codes is mandatory. This package provides an easy-to-use approach to harmonize product codes. Moreover, it offers an application that allows finding all new and dropped products for given firm-level data based on harmonized product codes.

Idea behind harmonization

The basic idea that stands behind the harmonization is to keep track of each single code, during a certain time period. In the simplest case, a code doesn’t change during the examined period, i.e. no harmonization needed. In any other case, all changes, which are associated with a specific code need to be documented. There are different kinds of changes: 1 to 1, 1 to many, may to 1, many to many, 1 to none, none to 1. The last two described changes, simply mean that a code was dropped, respectively that a new code was created. Those two procedures are only possible for PC8- and not for CN8-classification. 1 to many or many to 1 changes can occur, if two or more codes are split, respectively merged. It is also possible that a ‘mixture’ of changes is present, e.g. a code can merge and remain the same in terms of notation at the same time.

In a more technical way this means, a (n + m) x k matrix, consecutively referred to as ‘history matrix’, is designed to store all the information, where n is the number of observations, m is the number of added rows due changes and k the number of years. In this matrix every row represents a history of a particular code. For every code that has multiple replacements in one year a new row has to be added to the matrix. This is necessary, since one has to keep track of the changed codes as well. History matrices are designed for CN8- and for PC8-classification. These history matrices are used as a input for the final harmonization. In the further process, the history matrices are extended by new variables, like additional classification systems (HS6, BEC), binary indication variables and the harmonized code itself.

The goal of the harmonization is to make a comparison possible for the give time period. Therefore a baseline is created, defined in the last year of the period. This information is stored in the variable CN8plus (or PC8plus respectively). In order to achieve this information, the harmonization is done in several steps. Firstly, one has to check if a code didn’t change, i.e. all codes between the first and the last year of interest are the same and no change appeared in addition. If this is the case, this code does not need further harmonization and is used as CN8plus already. Secondly, all connections among codes need to be documented. This means if a code split or merged into several codes for example, this codes are grouped together from now on (consecutively referred to as family). If a mixture of changes, e.g. a code split and remained the same in terms of notation in the same year, this information is stored in the variable ‘flag’. However, these codes with flag-variables being 1, are also part of a certain family. Summarizing this means, if one takes all this information into account, one reaches the professed goal of CN8plus, which therefore includes all codes that did not change and all families. Furthermore, the varaibale flag can be 2 or 3. The variable flag being equal to 2, means that this code is either new or was dropped during the periode of interest. If it is 3, the code had at least one simple change, but is not associated with a family.

The additional classification systems, are based on CN8plus (or PC8plus respectively). Since CN8 and HS6 are closely connected (HS6 are the first six digits of CN8), this transformation is mostly straight forward and is stored in the variable HS6. However, HS6 changes its classification due a separate change-list as well. Considering these lists, yields the variable HS6plus. For PC8 codes it is not that easy. Here, first all PC8 codes need to be translated into CN8 codes (if possible) and afterwards the same procedure like with CN8 codes is used. Note that not all PC8 codes do have a corresponding CN8 code. Therefore some codes might be lost due to this issue.

The BEC classification system is also based on CN8plus. Concordance lists between CN8 and BEC are used to derive this classification. This information is stored in several ways: higher (one digit) and lower (up to 3 digits) aggregation as well as the basic class of the good. If CN8plus contains a family, it is not possible to assign a HS6- nor a BEC-classification, because a family can include several different codes, with again several different related HS6- and BEC-codes. The resulting matrix, we will call it harmonization matrix from now on, is therefore a (n + m) x (k + l) matrix, where the additional parameter l describes the the added columns.

Concordance files between HS6 and BEC Rev. 4 exist for 1996, 2002, and 2007. For 2012 and 2017, there exists a concordance between HS6 and BEC Rev. 5. Therefore, we provide BEC codes from Rev. 4 until 2011 and BEC codes from Rev. 5 thereafter. Moreover, BEC codes can be classified into three classes defined by the System of National Accounts (SNA), which focus on the end-use of the product.

Main Functions

harmonize_cn8()

provides for a given time period a data frame that contains all CN8 product codes and their history, harmonized CN8plus codes, harmonized HS6plus codes, and BEC classification. The “plus-codes” are the main outcome of the function. They provide harmonized information of the product codes, i.e. comparable codes. Every harmonization refers to the last/first year of interest. The following table offers an overview of all provided variables.

Variable	Explanation
CN8_xxxx	a specific CN8 code in a given year
CN8plus	the harmonization code for CN8, which refers to the last/first year of the time period
HS6plus	the harmonization code of HS6, which refers to the last/first year of the time period
BEC	provides the BEC classification at a high aggregation level (1 digit)
BEC_agr	provides the BEC classification at a lower aggregation level (up to 3 digits)
SNA	provides information if the code is classified as consumption, capital or intermediate good in SNA
flag	integer from 0 to 3; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition; 2 indicates that this code is either new or was dropped during the period of interest; 3 indicates the code had at least one simple change, but is not associated with a family
flagyear	indicates the first year in which the flag was set

For more application details, see ?harmonization_cn8.

harmonize_pc8()

provides for a given time period a data frame that contains all PC8 product codes and their history, harmonized PC8plus codes, harmonized HS6plus codes, and BEC classification. The “plus-codes” are the main outcome of the function. They provide harmonized information of the product codes, i.e. comparable codes. Every harmonization refers to the last/first year of interest. The following table offers an overview of all provided variables.

Variable	Explanation
PC8_xxxx	a specific PC8 code in a given year
PC8plus	the harmonization code for PC8, which refers to the last/first year of the time period
HS6plus	the harmonization code of HS6, which refers to the last/first year of the time period
BEC	provides the BEC classification at a high aggregation level (1 digit)
BEC_agr	provides the BEC classification at a lower aggregation level (up to 3 digits)
SNA	provides information if the code is classified as consumption, capital or intermediate good in SNA
flag	integer from 0 to 3; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition; 2 indicates that this code is either new or was dropped during the period of interest; 3 indicates the code had at least one simple change, but is not associated with a family
flagyear	indicates the first year in which the flag was set

For more application details, see ?harmonization_pc8.

Support Functions

All support functions are used within the main functions. They provide intermediate steps to harmonize the data. However, they can be used as stand-alone functions as well.

history_cn8()

provides a data frame that contains all CN8 product codes and their history over time for the demanded time period. This dataset is the basis for the main function harmonize_cn8() and can be obtained therewith as well. The following table offers an overview of all provided variables.

Variable	Explanation
CN8_xxxx	a specific CN8 code in a given year
flag	integer from 0 to 2; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition; 2 indicates that this code is either new or was dropped during the period of interest
flagyear	indicates the first year in which the flag was set

For more application details, see ?history_cn8.

history_pc8()

provides a data frame that contains all PC8 product codes and their history over time for the demanded time period. This dataset is the basis for the main function harmonize_PC8() and can be obtained therewith as well. The following table offers an overview of all provided variables.

Variable	Explanation
PC8_xxxx	a specific PC8 code in a given year
flag	integer from 0 to 3; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition; 2 indicates that this code is either new or was dropped during the period of interest
flagyear	indicates the first year in which the flag was set

For more application details, see ?history_pc8.

cn8_to_bec()

provides a data frame that contains all CN8 product codes and related BEC and HS6 codes in a given time period. Therefore, this data serves as a connection between CN8 and BEC classification and between CN8 and HS6 classification. It forms the basis of some output of the main function, namely: BEC, BEC_agr, SNA and HS6plus. The following table offers an overview of all provided variables.

Variable	Explanation
CN8	a specific CN8 code
HS6	provides the HS6 classification of the CN8plus code
BEC	provides the BEC classification on a high aggregation level (1 digit)
BEC_agr	provides the BEC classification on a lower aggregation level (up to 3 digits)

For more application details, see ?cn8_to_bec.

pc8_to_bec()

provides a data frame that contains all PC8 product codes and related BEC and HS6 codes in a given time period. Therefore, this data serves as a connection between PC8 and BEC classification and between PC8 and HS6 classification. It forms the basis of some output of the main function, namely: BEC, BEC_agr, SNA and HS6plus. The following table offers an overview of all provided variables.

Variable	Explanation
PC8_xxxx	a specific PC8 code
HS6	provides the HS6 classification of the PC8plus code
BEC	provides the BEC classification on a high aggregation level (1 digit)
BEC_agr	provides the BEC classification on a lower aggregation level (up to 3 digits)

For more application details see ?pc8_to_bec.

get_data_directory()

provides the directory where custom data must be stored and the used data (e.g., concordance lists, list of codes) can be edited. However, before editing the employed data or using additional concordance lists for example, it is highly recommended to read first the instructions in this vignette carefully (also see section Data Sets and Custom Data). The directory is provided in the R console. Further features (like open an explorer, print available data in console) are only executable if the directory path does not contain any blanks.

For more application details see ?get_data_directory.

Additional Functions

These functions go beyond the primary purpose of this package. The additional functions provide an application of the data frames obtained by the main functions. To use these additional functions, data on firm-level is required, which is data that is not provided by the package. The firm-level data must provide columns with the following names: ID, year and CN8 or PC8. Other columns may exist; however, they will not be used by the function. The following table summarizes the variables that need to be included in the firm-level data.

Variable	Explanation
ID	specific code that describes a firm over the years (this code does not change over time)
year	year in which the firm produced a product
CN8	CN8 code of firm product
PC8	PC8 code of firm product

  utilize_cn8()

may provide two data frames:

A data frame that contains all changed CN8 product codes per firm per year. In more detail, this means how many products remained the same, were added, were dropped, how many products were produced by a certain firm in a given year, and how many products were produced in the year after. As a base of this computation CN8plus codes or HS6plus codes can be used.
A data frame that is based on the entered firm data. The entered firm data data is extended by harmonized data (that is CN8plus, flag, flagyear, HS6plus, BEC, BEC_agr, SNA).

The tables at the end of this section offer an overview of all provided variables.

  utilize_pc8()

may provide two data frames:

A data frame that contains all changed PC8 product codes per firm per year. In more detail, this means how many products remained the same, were added or dropped - the value of the same/added/dropped products - how many products were produced by a certain firm in a given year, and how many products were produced in the year after. As a base of this computation PC8plus codes or HS6plus codes can be used.
A data frame that is based on the entered firm data. The entered firm data data is extended by harmonized data (that is PC8plus, flag, flagyear, HS6plus, BEC, BEC_agr, SNA).

The tables at the end of this section offer an overview of all provided variables.

Since the provided data frames do not differ between utilize_cn8() and utilize_pc8(), in terms of notation, the tables are only provided once here.

Table that summarizes the output, described by the notation a. above:

Variable	Explanation
firmID	specific code that describes a firm over the years (this code does not change over time)
period_UL	lower limit of the time period
period	time period in which the product was produced
gap	indicating if the time period is greater than one (i.e. upper limit - lower limit > 1)
same_products	number of products that were produced in both years (i.e. remained in the product portfolio of this firm)
value_same_products	value of products that were produced in both years (i.e. remained in the product portfolio of this firm); the value is calculated in the upper limit of the time period
new_products	number of added products in the upper limit of the time period (i.e. added to the product portfolio of this firm)
value_new_products	value of added products in the upper limit of the time period (i.e. added to the product portfolio of this firm)
dropped_products	number of dropped products in the upper limit of the time period (i.e. removed of the product portfolio of this firm)
value_dropped_products	value of dropped products in the upper limit of the time period (i.e. removed of the product portfolio of this firm); the value is calculated in the lower limit of the time period
nbr_of_products_period_LL	number of all products produced in the lower limit of the time period (i.e. entire product portfolio of this firm)
nbr_of_products_period_UL	number of all products produced in the upper limit of the time period (i.e. entire product portfolio of this firm)

Table that summarizes the output, described by the notation b. above:

Variable	Explanation
firmID	specific code that describes a firm over the years (this code does not change over time, provided by user)
year	year in which the firm produced a product (provided by user)
CN8	CN8 code of firm product (provided by user)
PC8	PC8 code of firm product (provided by user)
(value)	value of the corresponding product code (may be provided by user)
…	additional columns from original firm data (provided by user)
CN8plus	final harmonization, which refers to the last year of the time period
PC8plus	final harmonization, which refers to the last year of the time period
flag	integer from 0 to 3; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition; 2 indicates that this code is either new or was dropped during the period of interest; 3 indicates the code had at least one simple change, but is not associated with a family
HS6	provides the HS6 classification of the PC8plus / CN8plus code
HS6plus	also adjusts for the change lists of HS6
BEC	provides the BEC classification on a high aggregated level (1 digit)
BEC_agr	provides the BEC classification on a less aggregated level (up to 3 digits)
SNA	provides information if the code is classified as consumption, capital or intermediate good in SNA

Data Sets

By default, the package provides several data sets for CN8-, PC8-, HS6- and BEC-classification. This data allows for harmonization of CN8 product codes between 1995 and 2022 and PC8 product codes between 2007 and 2017. All available data sets are stored within the package. The function get_data_directory() provides support to access the data more easily. All data included in the package was downloaded from EU server Ramon originally and altered if needed.

Provided data in more detail:

CN8 data is provided in the corresponding CN8 folder. This folder contains two different types of files. Firstly, a list of all existing CN8 codes for every year, e.g. for the year 2000, CN8_2000.rds. More technically speaking, these files provided a data frame with one column and n rows, where n is the number of existing CN8 codes in a given year. An example (first six lines) of the year 2000 is the following:
```
     group
1 01011100
2 01011910
3 01011990
4 01012010
5 01012090
6 01021010
```
Secondly, the CN8 folder contains a concordance list of all CN8 codes over time, a .csv file, where the separator is a semicolon, i.e. “;”. A header is necessary. The header names must be the following: from, to, obsolete, new. The period between “from” and “to” is always one year and describes when the code changed. The “obsolete” and “new” codes represent the outdated code and the replacement, respectively. The first six lines of the default csv-file look like the following:
```
from;to;obsolete;new
1988;1989;02012011;02012021
1988;1989;02012011;02012029
1988;1989;02012019;02012029
1988;1989;03036010;03036011
1988;1989;03036010;03036019
```
PC8 data is provided in the corresponding PC8 folder. This folder contains two different types of files. Firstly, a list of all existing PC8 codes for every year, e.g. for the year 2010, PC8_2010.rds. More technically speaking, these files provided a data frame with one column and n rows, where n is the number of existing CN8 codes in a given year. An example (first six lines) of the year 2010 is the following:
```
      2010
1 07101000
2 07291100
3 07291200
4 07291300
5 07291400
6 07291500
```
Secondly, a concordance between every year is necessary. These files contain two years in their filenames, with a period of one year in between, e.g. between 2010 and 2011 this results in PC8_2010_2011.rds. More technically speaking, these files are data frames with two columns, which must be named “new” and “old” and n rows, where n is the number of changes in a given year. An example (first six lines) of the changes between 2010 and 2011 is the following:
```
       new      old
1 07101000 07101000
2 07291100 07291100
3 07291200 07291200
4 07291300 07291300
5 07291400 07291400
6 07291500 07291500
```
Thirdly, the PC8 folder contains concordance lists between PC8- and CN8- classifications for every year. This data is needed in terms of translating PC8 into BEC. An example for the year 2010 would be PC8_CN8_2010.rds. Technically this means, a data frame with two columns, named “PRCCODE” for PC8 codes and “CNCODE” for CN8 codes and n rows, where n is the number of concordances between specific codes is provided by every year. However, no concordance between PC8 and CN8 may be possible. In this case, the missing value is filled by NA. Some examples out of the associated file for the year 2010 can be found below:
```
   PRCCODE CNCODE
1 10131430   <NA>
2 10139100   <NA>
3 10399100   <NA>
4 13301110   <NA>
5 13301121   <NA>
6 13301122   <NA>
     PRCCODE   CNCODE
2400 8111136 25151200
2401 8111150 25152000
2402 8111233 25161100
2403 8111236 25161200
2404 8111250 25162000
2405 8111290 25169000
```
HS6 data is provided in the corresponding HS6 folder. This folder only contains one type of file, which are correspondence lists between the changes of HS6 codes over time. Those changes happened in several years: 1992, 1996, 2002, 2007, 2012 and 2017. For every period, a separate concordance list is necessary. csv-files provided this data, where the separator is a semicolon, i.e. “;” and the filenames contain both years. For example, between 1996 and 2002, the file is called HS_1996_to_HS_1992.csv. Also, headers are included in this file. For this specific case, they are “HS 1996” and “HS 1992”. For other periods the headers change accordingly. An example (first six lines) of the changes between 1996 and 1992 is the following:
```
HS 1996;HS 1992
10111;10111
10119;10119
10120;10120
10210;10210
10290;10290
```
BEC data data is provided in the HS6toBEC folder. This folder contains only one type of file, which are correspondence lists between HS6- and BEC-classification in the years HS6 codes changed (i.e. 2002, 2007, 2012, 2017). For each year, a separate concordance list is necessary. csv-files are used for this data, where the separator is a semicolon, i.e. “;” and the filenames contain the year. For example, in 2002, the file is called HS2002toBEC.csv. Also, headers are included in this file, namely “HS” for the HS6 codes and “BEC” for the BEC codes. An example (first six lines) of the concordance in 2002 is the following:
```
HS;BEC
10110;41
10190;111
10210;41
10290;111
10310;41
```

Custom Data

The use of additional concordance lists for example or altering provided data is possible. However, it is highly recommended to read first the instructions in this vignette carefully. If new data is added, there are some mandatory aspects and some valuable aspects to acknowledge.

Mandatory aspects:

New data must be stored inside the package. This can be easily done by adding new files in the appropriate subfolder of the package database. get_data_directory() may provide help to find the correct folder to store new data.
Chosen filenames must be analogue to already existing files.
The structure of the new data is crucial. The section Data Sets may provide more details. In short: file-type, header names, column numbers and datatype (numeric, character, …) are very important.
All new added .csv-files must be encoded using UTF-8.

Valuable aspects:

It is highly recommended to download new data from EU server Ramon and only alter content-related data if necessary.
Product codes need to have the correct length, e.g. CN8 codes must be eight digits long. Some programs tend to interpret codes as numeric values and cut of leading zeros, which leads to completely wrong results.

Faculty of Economics and Statistics, University of Innsbruck, Christoph.Baumgartner@uibk.ac.at ↩︎
Faculty of Economics, Business and Tourism, University of Split, Stjepan.Srhoj@efst.hr ↩︎
Faculty of Economics and Statistics, University of Innsbruck, Janette.Walde@uibk.ac.at ↩︎

Harmonizing Product Codes with R

Christoph Baumgartner¹

Stjepan Srhoj²

Janette Walde³

Overview

Idea behind harmonization

Main Functions

Support Functions

Additional Functions

Data Sets

Custom Data

Harmonizing Product Codes with R

Christoph Baumgartner1

Stjepan Srhoj2

Janette Walde3

Overview

Idea behind harmonization

Main Functions

Support Functions

Additional Functions

Data Sets

Custom Data

Christoph Baumgartner¹

Stjepan Srhoj²

Janette Walde³