Using Software to Analyze Your Data: Stata for Dummies

Symbols and short-cuts:

+	Addition

-	Subtraction

Multiplication

/	Division

^	Raise to power

>	Greater than (Note: When greater than is used Stata assigns all missing values to an upper limit. To prevent having this assignment, you should do one of two things: (1) assign an upper limit to the test command (e.g., if y>50 & y<100, where the upper limit is just above the greatest value in your dataset); (2) assign a does not equal command excluding all missing values (preferred method) (e.g., if y>50 & y~ = A.@). All empty cells in Stata will have a A.@ entry. Therefore, the does not equal extension should exclude all cells with A.@. (In writing this command, as with any command involving text-filled cells, quotation marks must surround the text.)

<	Less than (Note: No bottom limit needs to be assigned.)

>=	Greater than or equal to

<=	Less than or equal to

= =	Equals

~ =	Does not equal

&	And

Or

Page Up Moves you to previous command

Page Down Moves you to next command

Control K	Stops the current output from proceeding

Notes:

	In the list below, Avn@ = variable name. 	No variable name in Stata can begin with a numeric value or contain a dash(-). 	All extensions, such as Aif vn1==x; if vn1==x | vn1==y; if vn1= =x & vn1= =z,” can be included in any of the commands below. If there is no comma in the command line, they should be expressed at the end of the command. If there is a comma in the command line, they should be expressed immediately before the comma in the command line. 	Before opening large datasets, type in set memory 100m to the command line and press ENTER to allocate enough memory for Stata to open up and work properly with the dataset.

Numeric Types: Continuous 	e.g., age

Categorical 	e.g., race (white, black, Hispanic)

Dichotomous: only two possible outcomes 	0/1, yes/no, true/false 	e.g., death

Data Formats: Numeric 	byte		-127 to 126 	integer		-32,768 to 32,766 	long		-2,147,483,648 to 2,147,483,646 	float		10-37 to 1037 	double		10-99 to 1099

String	E.g., text, numbers treated as text

Creating a log: In order to record the work session, you must first create a log file in Stata. Without a log file, you will not be able to print or preserve your work. The log file records both what you type and the Stata output in response to your command. This file also can be opened in word processing programs and incorporated into other documents. As an alternative to typing the commands below, you can also click on the stop light button (red, yellow, green dots) and specify where to save the log file.

log using “filename”	Creates a log file that records both what you type and the Stata output in response to your command, e.g., log using “c:\work\example.log”. log off	Temporarily stops the file from logging. log on	Resumes the suspended log file. log close	Closes the log file.

Importing Data into Stata from Excel: There are several ways to import data into Stata. Using Excel, one way is to block the data you wish to put in your Stata dataset and copy the data in Excel using the edit function. Open Stata. Open the data editor, and paste the data using the edit command. Close the data editor, and use the “Save as” function on the File menu bar to save the Stata dataset. Data is now in Stata format and can be open directly in Stata.

Another way is to use Stata commands to import data:

insheet using	Reads text (ASCII) files created by spreadsheet or database programs into Stata format. The “filename”		data must be tab separated or comma separated.

infile using	Reads data files into Stata format. The data can be tab separated, comma separated, or space “filename”		separated.

Data transfer programs are also available that can transfer data from many different program formats into Stata format. These include StatTransfer and DBMS Copy.

Commands:

describe Describes your dataset with the file directory pathway, a list of the variables, the number of observations, and the type and labels for each variable.

list vn Lists all the values for the variable named. Can be combined with if/and/or statements to impose limits and restrictions.

sum vn Provides the number of observations, mean, standard deviation, minimum and maximum values for the variable specified.

sum vn, detail In addition to providing the number of observations, mean, standard deviation, minimum and maximum values for the variable specified, detail provides the median, variance and 1, 5, 10, 25, 75, 90, 95, an 99 percentiles for the variable.

tab vn Tabulates the variable by listing every observed outcome (with its individual percent and cumulative percent of the population) for the variable. For example, tabulation of a categorical variable with three possible outcomes would list each outcome, the number of observations for each outcome, along with the percent representation of each outcome in the entire population, as well as the cumulative percent of each outcome in an ascending manner. The tabulation of a continuous variable would likewise list each outcome with its individual percent representation and cumulative percent in relation to the entire population. Tabulation of a continuous variable allows one to generate outputs used for percent and cumulative distribution graphs.

tab vn1, sum (vn2) Tabulates vn1 (a categorical variable), and summarizes (n, mean, s.d.) vn2 (generally a continuous variable) as it is associated with each of the tabulated vn1 outcomes. For example, tab drug, sum (age) would list the number of observations, mean, and standard deviation for the age of patients receiving each of the different study drugs (e.g., experimental drug, currently accepted drug, or placebo) in the dataset.

tab vn1 vn2 Tabulates vn1 and vn2 in a cellular table providing the number of observations in each cell. This can be used only when vn1 and vn2 are both categorical variables. For example tab artery thrombus would generate a 3 X 2 table showing the number LAD, LCx, and RCA with and without thrombus, respectively.

tab vn1 vn2, row column Performs the same test as above, but in addition to the number of observations in each cell, this command generates the percent distribution of that number for each cell. This is usually the preferred method of tabulating two categorical variables.

tab vn1 vn2, chi2 Tabulates vn1 and vn2 in a cellular table as above, but in addition performs chi-square (Pearson) statistical analysis between groups. This can be combined with the Arow column@ extension to generate the percent distribution for each cell as well. The command would be written as follows: tab vn1 vn2, row column chi2. This is usually the preferred method of tabulating and calculating the chi-square p-Value for two categorical variables.

tab vn1 vn2, all exact Tabulates vn1 and vn2 in a cellular table as above, but in addition outputs all of the statistical association measures between groups. Such measures include the Pearson chi-square, Goodman and Kruskal=s gamma, Cramer=s b, and Fisher=s exact. The Fisher=s exact value can be obtained with the Pearson chi-square value by the command tab vn1 vn2, chi2 exact where again the Arow column@ extension can be added as above to generate percent distributions. The Fisher’s exact test should be used in analyses with very few events, often such as death.

tabi x1 y1 \ x2 y2 Uses the data you enter to perform a Fisher=s exact analysis. The syntax of the command should be x1 = # of observations affirmative (group 1), y1 = # of observations negative (group 1), x2 = # of observations affirmative (group 2), y2 = # of observations negative (group 2).

tab vn, plot Tabulates vn (where vn is a categorical variable), and shows visually in a plot the relative distribution for each of the groups in a horizontal bar graph of sort.

ttest vn1, by (vn2) Unpaired t-test which can be used when vn1 is a continuous variable and vn2 is a categorical variable with only two possible outcomes. This t-test will also provide the means, standard errors, and number of observations for the continuous variable in association with each of the two categorical outcomes. For example, ttest age, by (gender) would compare the mean age of female patients with the mean age of male patients using t-test analysis.

ttesti x1 y1 z1 x2 y2 z2 Unpaired t-test where the operator would type in the numeric values for the number of observations (x1), mean (y1), and standard deviation (z1) for one group, and the number of observations (x2), mean (y2), and standard deviation (z2) for the other group being compared.

ttest vn1=vn2 Paired t-test. The only observations included in this analysis are observations for which there are entries for both vn1 and vn2 for the same subject. For example, using Apre_sten@ as the variable name for the percent diameter stenosis in arteries prior to PTCA and Apoststen@ as the variable name for the percent diameter stenosis in arteries after PTCA, the command ttest pre_sten=poststen would compare the stenoses pre- and post-PTCA using t-test analysis. Important Note: The only data included in this analysis are data for which there are values for both the pre-PTCA percent stenosis and the post-PTCA percent stenosis for the subject. If the subject was missing either of these values, his/her data would not be included in the t-test analysis. This is the nature of paired analyses.

ttest vn1=0 Performs a t-test analysis to determine whether vn1 is statistically equal to (or different than) zero.

ranksum vn1, by(vn2) Tests the hypothesis that two unmatched samples (vn2, a categorical variable) are from populations with the same distribution using the Wilcoxon rank-sum test. Vn1 is the continuous variable. The results indicate whether or not the medians are statistically different, e.g., p>0.05 indicates no difference in medians while p<0.05 indicates the medians differ. Used with populations that are not normally distributed or when imputed values have been used in the dataset.

correlate Typing correlate alone produces a correlation matrix (of r-values) for all variables in the dataset. The datasets used at PERFUSE are generally too large (i.e., matsize in Stata is too small). Is useful with smaller datasets.

correlate vn1 vn2 Generates r-value correlations between each variable entered. This command can be extended to as many variables as the operator chooses, and the level of correlation for all possible comparisons are listed as a matrix.

anova vn1 vn2 Performs analysis of variance testing with vn1 as continuous variable and vn2 as categorical variable. When there are only two possible outcomes for the categorical variable, the anova functions similar to a t-test. Notes: (1) In Stata, the random group dropped in anova testing is always the last group unless specified. To specify which group to drop see Stata reference manuals. (2) After running an anova test, the operator can simply type Aregress@ to obtain the regression output for each of the groups.

oneway vn1 vn2 Performs more informative outputs of standard one-way analysis of variance testing with vn1 as continuous variable and vn2 as categorical variable. When combined with the bonferroni command, the output is generated as a matrix showing the difference between groups as well as the p-value after Bonferroni correction factors (Bonferroni-adjusted significance) have been applied. To apply the Bonferroni option the command should be expressed as follows: oneway vn1 vn2, bonferroni.

kwallis vn1, by(vn2) Tests the hypothesis that more than two unmatched samples (vn2, a categorical variable) are from populations with the same distribution using the Kruskal-Wallis test. Vn1 is the continuous variable. The results indicate whether or not the medians are statistically different, e.g., p>0.05 indicates no difference in medians while p<0.05 indicates the medians differ. Used with populations that are not normally distributed or when imputed values have been used in the dataset.

regress vn1 vn2 Performs linear regression with vn1 as the dependent variable. When combined with only one additional variable (e.g., regress vn1 vn2), performs univariate linear regression. When combined with multiple variables (e.g., regress vn1 vn2 vn3 vn4), performs multivariate linear regression. Multivariable models are generated from multivariate linear regression testing. When categorical variables with more than two possible outcomes are entered into a regression model, they are so as ordinal (numeric/continuous) variables and not as nominal (categorical) variables. Therefore, the operator should convert all multi-outcome categorical variables to dichotomous (0/1) variables if he/she wishes to enter them into the regression model.

logistic vn1 vn2 Performs logistic regression with vn1 as the dependent categorical variable. As with the Aregress@ linear regression command, when combined with only one additional variable (e.g., logistic vn1 vn2), performs univariate logistic regression. When combined with multiple variables (e.g., logistic vn1 vn2 vn3 vn4), performs multivariate logistic regression. As with linear regression testing, the operator should convert all multi-outcome categorical variables to dichotomous (0/1) variables if he/she wishes to enter them into the regression model or use the xi command (see below) which creates dummy variables for you. The Alogistic@ command is similar to the Alogit@ command (see below), except that the logistic command presents the estimates for each of the independent variables as odds ratios instead of beta coefficients. When an independent variable is inversely related to the dependent variable, an odds ratio less than 1, or a coefficient with a negative value is generated. When an independent variable is directly related to the dependent variable, an odds ratio greater than 1, or a coefficient with a positive value is generated. Logistic model outputs also generate R2 values for the model. The R2 value for the model is essentially a measure of the degree of variance in the dependent variable explained by the degree of variance of the independent variable(s) entered in the model. Important Note: To plot the ROC curve plotting sensitivity vs. 1-specificity after running a logistic regression model, simply type lroc alone.

logit vn1 vn2 Performs logistic regression as above, but displays the relation of the independent variables as beta coefficients. For a more detailed explanation of the output, see the logistic command above.

mlogit vn1 vn2 Performs logistic regression with vn1 as the dependent categorical variable with more than two outcomes. The outcomes of the dependent variable should not have a natural ordering, but should rather be outcomes randomly coded such as for the variable Aartery@ where the number assigned to each infarct related artery is not of importance. The mlogit test will generate identical outputs as the logistic and logit tests if the dependent variable has only two possible outcomes. As with the logit test, the output for mlogit testing expresses coefficients (and not odds ratios as for the logistic test) for the independent variables in the model. When mlogit tests are performed, Stata will show the operator which outcome is dropped and compared against at the bottom of the output in colored lettering.

mlogit vn1 vn2, rrr Performs logistic regression identically to the mlogit test described above, but instead of expressing coefficients for the independent variables in the model, it expresses relative risk ratios for them. Do not let the terms relative risk and odds ratios (from the logistic test) be confusing. They are synonymous here and will always be identical when there are only two possible outcomes for the dependent variable. xi	Used with regression models (logistic, regress, stcox, etc.) to create dummy variables for categorical data. The xi: is placed before the model and i. is placed before the categorical variable. For example, a logistic model for death that includes TFG and age would be set up as follows: xi: logistic death i.tfg By default, the dummy-variable set is identified by dropping the dummy corresponding to the smallest value of the variable. In this case, that would be TFG 0.

IMPORTANT NOTE: For all odds ratios and coefficients generated in outputs from regression (logistic or linear) testing, the values listed are for a one unit change in the independent variable. For dichotomous categorical variables, the values listed then would be for the affirmative of the variable since the only possible outcomes are an affirmative or a negative for the variable.

Survival analysis:

st stset vn	Declares the data to be survival data. This is the first step in doing any survival analysis in Stata, where vn here is the time to the event, i.e., time to failure. Stata assumes that all patients had an event at the time listed in vn.

st stset vn1, failure (vn2)	Vn2 should be a dichotomous variable for the event, where 1 indicates an event occurred, such as death, and 0 indicates no event occurred during the follow-up time. Stata will then not assume that an event occurred at the time listed in vn1, but will know which individuals had an event and which individuals were “censored”, i.e., alive at the end of the follow-up period.

sts graph	Produces a single Kaplan-Meier survival curve based on the data that has been stset.

sts graph, by(vn)	Produces Kaplan-Meier survival curves for each of the categories in vn. For example,  sts graph, by(gender) would produce a graph with 2 curves, one for survival among females and one for survival among males.

sts test vn	Uses probability testing to see if the 2 survival curves are different using the log-rank test. For example, sts test gender would indicate the probability that the Kaplan-Meier curve for men differs from the curve for women. stcox vn	Estimates Cox proportional hazard model. As with logistic regression, can be used with continuous or dichotomous variables. If using categorical variables, you must use the xi command to create dummy variables (see xi command). Multiple variables can be entered into the model.

Graphing:

graph vn1 Graphs the variable specified in quintiles if continuous, and in categories if categorical as a histogram.

graph vn1, bin(x) Graphs the variable specified by the number of divisions Ax@ as a histogram. For example, the command graph age, bin(20) would graph a histogram consisting of 20 bars with even increments between the minimum and maximum value for patient age.

graph vn1 vn2 When two variables are specified in a graph command, Stata plots each point according to its coordinates. The first variable specified, Avn1", will be the y-axis variable; and the second variable specified, Avn2", will be the x-axis variable.

graph vn1 vn2, pie Graphs the variables specified as a pie graph according to their distribution. If both variables are categorical, Stata will graphically represent the percent of the total pie (i.e. inclusive of only those observations covered by the variables specified) which is affirmative (the higher numeric value) for each of the variables. For example, to generate a pie graph for the epicardial artery distribution in the dataset, the operator would need to have (or create) dichotomous variables for LAD, LCx, and RCA each separately (where a 1 is assigned as the affirmative and a 0 is assigned as the negative). The command line would read: graph lad lcx rca, pie.

Generating variables:

(1) gen newname=1 if vn1==y (2) replace newname=0 if vn1==z Use this formula to create new variables within Stata. The if vn1==y/z extension is just an example of how the desired conditions would be attached to the command. Any condition at all can be specified. To create a variable Alad@ (where LAD arteries equal 1 and non-LAD arteries equal 0) from a variable Aartery@ where (1 equals LAD, 2 equals LCx, and 3 equals RCA), the operator would express the command as follows. gen lad=1 if artery==1 replace lad=0 if artery==2 | artery==3

It is important to always define upper limits or using exclusion criteria for A.@ values when creating variables using the > extension, so that Stata does not assign values to cells with A.@ entries. For example, to create a variable age40 (where 1 equals patients older than 40 and 0 equals patients 40 and younger) from the variable age (which is a continuous variable for the patient age), the operator would express the command as follows: gen age40=1 if age>40 & age~=. replace age40=0 if age<=40 or as gen age40=1 if age>40 & age<1000 replace age40=0 if age<=40 The first of the methods above uses exclusion criteria (essentially commanding Stata to include only cells where the Aage" variable does not equal A.@), while the second method uses a defined upper limit. Although both methods are equally effective if expressed correctly, the author of these guidelines generally prefers the first method since it is more direct and does not require that the operator know the upper limit of the real data in the dataset.

label var vn “Description of vn” This command is used to label variables in the dataset. Stata allows the variable label to be up to 80 characters long. For example, to label our variable called age40, you would type label var age40 “Age >40? (1=yes, 0=no)”

Merging Data:

Often you will have multiple datasets on the same group of patients containing different variables. For example, one dataset may contain patient demographic data (e.g., age, gender, race) and a second dataset contains data on patient clinical outcomes (e.g., recurrent MI, death) while a third dataset contains data on what drugs each patient is taking (e.g., aspirin, beta-blockers, thrombolytics). In order to do an analysis on the association of aspirin with recurrent MI, correcting for patient age, you would have to merge these three datasets into one. To merge two datasets in Stata:

1.	Sort the data in dataset 1 by the patient identifier and save the file. use “c:\demographics.dta” sort id save “c:\demographics.dta”, replace

2.	Sort the data in dataset 2 by the patient identifier and save the file. use “c:\outcomes.dta” sort id save “c:\outcomes.dta”, replace

3.	Merge the two files. use “c:\demographics.dta” merge id using “c:\outcomes.dta”

The file which is open during the merge is called the master file (in this case, c:\demographics.dta), and the file merged into the master file is called the using file (c:\outcomes.dta).

Merging creates a new categorical variable at the end of each row named _merge. The _merge variable indicates if the patients were in both of the datasets or if the patients were in just one dataset but not the other.

There are three possible outcomes for the merge variable: _merge=3	The patient was in both datasets _merge=2	The patient was in the using dataset but not the master dataset _merge=1	The patient was in the master dataset but not the using dataset

Thus, if you only want to look at patients who were in both the demographic dataset and the outcomes dataset, you can delete all patients whose _merge does not equal 3 (e.g., drop if _merge==1 | _merge==2).

To merge in another file into the combined demographics and outcomes dataset:

1.	Drop the merge variable. drop _merge

2.	Sort and save the unified dataset you have just created. sort id save “c:\unified.dta”

3.	Sort the data in dataset 3 by the patient identifier and save the file. use “c:\meds.dta” sort id save “c:\meds.dta”, replace

4.	Merge the two files. use “c:\unified.dta” merge id using “c:\meds.dta”

Again, the _merge variable will be created, indicating whether the patients in the unified database and the meds database were idenitical.