1997 Commodity Flow Survey

Appendix C. Sample Design, Data Collection, and Estimation

INTRODUCTION

The primary goal for the 1997 Commodity Flow Survey (CFS) is to estimate shipping volumes (value, tons, and ton-miles) by commodity and mode of transportation at varying levels of geographic detail. A detailed description of the sample design for the 1997 CFS is provided below.

SAMPLE DESIGN

The sample for the 1997 CFS is selected using a stratified three-stage design in which the first-stage sampling units are establishments, the second-stage sampling units are groups of four 1-week periods (reporting weeks) within the survey year, and the third-stage sampling units are shipments.

First Stage

To create the first-stage sampling frame, we extracted a subset of establishment records from the 1995 Standard Statistical Establishment List (SSEL). The SSEL is a database, maintained by the Bureau of the Census, that contains a record for each establishment with employees. (An establishment is a single physical location where business transactions take place.) Establishments having nonzero payroll in 1994 and classified in the mining, manufacturing, wholesale, or selected retail industries, as defined by the 1987 Standard Industrial Classification (SIC) Manual, are included on the sampling frame. Auxiliary establishments (e.g. warehouses and central administrative offices) with shipping activity are also included. Auxiliary establishments are establishments that are primarily involved in rendering support services for other establishments within the same company, instead of for the public, government, or other business firms. All other establishments contained on the sampling frame are referred to as nonauxiliary establishments. For each establishment we extracted sales, payroll, number of employees, name and address information, as well as a primary identifier. We also computed a measure of size for each establishment. The measure of size for a particular establishment is designed to approximate the establishment's total value of shipments for 1994.

To reduce the amount of sampling variability and because estimates are desired for each commodity, we used a stratified design with a certainty component for each three-digit SIC. To accomplish this, each establishment on the sampling frame is classified into a three-digit SIC grouping. For each group of establishments, a boundary (or cutoff) that divides the certainty establishments from the noncertainty establishments is determined using the Lavallee-Hidiroglou algorithm. If an establishment's measure of size is greater than the cutoff, the establishment is selected ``with certainty''. Establishments selected ``with certainty'' were assured of being selected and represented only themselves (i.e., have a selection probability of one and a sampling weight of one). No certainty cutoffs are set for auxiliary establishments because they only make up a small portion of the estimated total value of shipments for all establishments on the sampling frame.

Establishments not selected with certainty makeup the noncertainty universe. We stratify the noncertainty universe by SIC recode, National Transportation Analysis Region (NTAR), and a flag used to differentiate auxiliary establishments from nonauxiliary establishments. Each SIC recode is constructed from a group of related three-digit SIC codes. The NTARs, developed by the Department of Transportation as combinations of Bureau of Economic Analysis (BEA) Areas, collectively provide a mutually exclusive and exhaustive coverage of the United States. Finally, the auxiliary stratification came about because establishments with different types of operation may have different shipping practices. We refer to a particular SIC recode-NTAR-auxiliary flag combination as a primary stratum.

We further stratify the noncertainty establishments within each primary stratum using the measure of size previously described. We refer to these measure-of-size strata as substrata of the primary strata. The measure of size stratification increases the efficiency of the sample design. The Dalenius-Hodges cumulative rule is used to set the substratum boundaries. We then use Neyman allocation to determine the sample size required within each substratum to meet a coefficient of variation constraint on the primary stratum total measure of size. Within each substratum, a simple random sample of establishments is selected without replacement.

To arrive at the final sample size, we allocated additional establishments to some of the strata so that the probability of selecting any establishment is no less than 1 in 100. In total, the first-stage sample comprises 102,739 establishments.

Second Stage

The frame for the second stage of sampling consists of 52 one-week reporting periods (reporting weeks) during the interval from December 29, 1996, to December 26, 1997. Each establishment selected for the 1997 CFS was systematically assigned to report for a group of four reporting weeks throughout the survey year. The four reporting weeks in a given group are separated by 12 weeks. For example, an establishment might be requested to report data for the 5th, 18th, 31st, and 44th weeks of the survey year.

Third Stage

For each of the four reporting weeks in which an establishment is asked to report, we request the respondent to construct a sampling frame that consists of all shipments made by their establishment in each particular reporting week. For any particular reporting week, if an establishment makes 40 or fewer shipments during that week, we ask the respondent to provide information about all of their establishment's shipments from that week, i.e., no sampling is required. For establishments making more than 40 shipments in a given reporting week, we ask the respondent to select a systematic sample of these shipments and to provide us with information only about the selected shipments. The size of a particular respondent's sample for a given reporting week should be between 20 and 40 shipments, depending on the total number of shipments the establishment made during that reporting week.

DATA COLLECTION

Each establishment selected into the CFS sample is mailed a questionnaire for each of its four reporting weeks. For a given establishment, we request the respondent to provide the following information about their establishment's shipments: domestic destination or port of exit, commodity, value, weight, mode(s) of transportation, the date on which the shipment was made, and an indication of whether the shipment was an export, hazardous material, or containerized. For shipments that include more than one commodity, respondents are instructed to report the commodity that makes up the greatest percentage of the shipment's weight. For exports, we also ask the respondent to provide the mode of export and the foreign destination city and country.

We used two versions of the questionnaire to collect data from the sampled establishments--the CFS-1000 and the CFS-2000. Each establishment received the CFS-1000 in each of its first three reporting weeks. However, for the fourth reporting week, a subsample of approximately 25,000 establishments received the CFS-2000, while the remaining establishments received the CFS-1000. The CFS-2000 requests the respondent to provide additional information about their establishment's access to on-site and off-site shipping facilities, as well as transportation equipment. See Appendix E for a copy of each questionnaire.

ESTIMATION

Each shipment has associated with it a single tabulation weight, that is used in computing all estimates to which the shipment contributes. The tabulation weight is a product of seven different weights. A description of each weight follows.

CFS respondents provide data for a sample of shipments made by their respective establishments in the survey year. For each establishment, we produce an estimate of that establishment's total value of shipments for the entire survey year. To do this, we use four different weights, the shipment weight, the shipment nonresponse weight, the quarter weight, and the quarter nonresponse weight.

Like establishments, we identify shipments as either certainty or noncertainty. (See the Nonsampling Error section in Appendix B for a description of how certainty shipments are identified.) For noncertainty shipments, the shipment weight is defined as the ratio of the total number of noncertainty shipments (as reported by the respondent) made by an establishment in a reporting week to the number of sampled noncertainty shipments for the same week. This weight uses the data from the sampled shipments to represent all the establishment's shipments made in the reporting week. However, some respondents fail to provide sufficient information about a sampled shipment. For example, a respondent may not be able to provide value, weight, or a destination ZIP Code for some of the sampled shipments. If these data items cannot be imputed, then these shipments would not contribute to tabulations and are deemed ``unusable.'' (A usable shipment is one that has valid entries for value, weight, and origin and destination ZIP Codes.) To account for these ``unusable'' shipments, we apply the shipment nonresponse weight. For noncertainty shipments from a particular establishment's reporting week, this weight is equal to the ratio of the number of sampled shipments for the reporting week to the number of ``usable'' shipments for the same week. The shipment weight and shipment nonresponse weight for certainty shipments from a particular establishment's reporting week are both equal to one.

The quarter weight inflates an establishment's estimate for a particular reporting week to an estimate for the corresponding quarter. For noncertainty shipments, the quarter weight is equal to 13. The quarter weight for most certainty shipments is also equal to 13. However, if a respondent is able to provide information about all large (or certainty) shipments made in the quarter containing the reporting week, then the quarter weight for each of these shipments would be one. For each establishment, the quarterly estimates are added to produce an estimate of the establishment's value of shipments for the entire survey year. Whenever an establishment does not provide the Census Bureau with a response for each of its four reporting weeks, we compute a quarter nonresponse weight. The quarter nonresponse weight for a particular establishment is defined as the ratio of the number of quarters for which the establishment was in business in the survey year to the total number of quarters (reporting weeks) for which we received usable shipment data from the establishment.

Using these four component weights, we compute an estimate of each establishment's value of shipments for the entire survey year. We then multiply this estimate by a weight that adjusts the estimate using value of shipments and sales data obtained from other Census Bureau surveys and preliminary results of the 1997 Economic Census. This weight, called the establishment-level adjustment weight, attempts to correct for any sampling or nonsampling errors that occur during the sampling of shipments by the respondent.

The adjusted value of shipments estimate for an establishment is then weighted by the establishment weight. This weight is equal to the inverse of the establishment's probability of being selected into the sample.

A final adjustment weight, called the SIC-level adjustment weight, uses preliminary results of the 1997 Economic Census to account for establishments from which we did not receive a response (including establishments from which we did not receive any usable shipment data) and for changes in the population of establishments between the time the first-stage sampling frame was constructed (1995) and the year in which the data were collected (1997). Separate SIC-level adjustment weights are determined for nonauxiliary and auxiliary establishments.