The following two problems will require a lot of calculations in STATA. It will generate many pages of output. Here is how your should organize it. The first pages should contain your answers to all the questions, along with showing any key algebraic equations or explanations you need to use along the way. After that, include a printout of the output from the regressions you executed in support of your answers. Highlight any numbers in this output that you used in the first section. (You are encouraged to save paper here, you may print this section with a small font, double-sided and/or with 2-up format.) Last, include a copy of the DO file that contains the commands you asked STATA to execute. Be sure you organize these in a way that will be clear to the reader.
In the dataset Smoker, there is information on 1196 males from the United States. Data from this sample includes the variables:
smoke= 1 for smokers, and 0 for nonsmokers
age=age in years
educ= number of years of schooling
income= family income
pcigs= an index of the average price of cigarettes in the individual’s state
a) Create a dummy variable “hi_ed” that is a 1 if a person has ?17 years of education, zero otherwise.
b) Estimate a linear regression (which in this context is called a linear probability model (LPM)) for the binary variable smoke on the independent variable hi_ed. Report the beta coefficient on the dummy variable and its p-value. In words, express what the beta coefficient means in this case.
c) Create a frequency table for the smoke and hi_ed variables. The command in STATA istabulatesmoke hi_ed.
d) Calculate the probability that a person smokes if high education. Calculate the probability for smoking for low edcuation.
e) What is the relationship between the results for parts b and d?
f) Calculate the odds that a low education person smokes. Calculate the odds a high education person smokes. Calculate the odds ratio.
g) Estimate the logistic regression (logistic command) for smoke and hi_ed. Confirm that this equals the odds ratio. In words, express what the odds ratio coefficient means in this case.
a) Estimate the logistic regression for smoke on pcigs when hi_ed=0. Then, calculate the predicted values for Y. The command to do this is “predict yhat_0.” Repeat for when hi_ed=1, and also create a predicted value variable yhat_1.
b) Compare the coefficients for the two models. In words, explain what the model is saying about the impact of pcigs for the two different education groups.
c) Create a graph of the predicted values for the two versions. Use the following syntax:twoway (line yhat_0 pcigs, sort) (line yhat_1 pcigs, sort), legend(label(1 “low ed”) label(2 “hi ed”))
d) For a low education person, a 1 unit change of the price index of cigarettes will change the odds of smoking by how much?
a) Run a logit regression (stata commandlogit) for smoke on all the independent variables: age, educ, pcigs, income. Report the coefficients and p-values. This Model 1. Remember, for logit, (rather than logistic), the coefficients have not been exponentiated.
b) Create an interaction variable of educ and income. Run another logit regression, adding this interaction to Model 1. Report the coefficients and p-values. This is Model 2.
c) What differences strike you about Model 1 and Model 2? In particular, note how the signs and significance levels of the variables of educ and income have changed now that the interaction is included. In a clearly articulated paragraph (or two) give a thoughtful answer as to what you think must be going on. (This is not easy. Take your time and think hard about it.Your answer should contain two parts. First, talk about what the coefficients of the Model 2 regression are implying. Second, try and come up with an intuitive/economic hypothesis for why we are observing these results.)
The file Divorce.dta contains a four-period panel data set including various socioeconomic factors for all 50 US states plus the District of Columbia. The panels were taken in 1965, 1975, 1985, 1995. The variables we’re particularly interested in are:
divorce: number of divorces and annulments per state per one thousand population.
birth: number of live births per state per one thousand population.
marriage: number of marriages per state per one thousand population.
unemploy: total unemployment rate as a percentage of the total work force.
crime: total # of criminal offenses known to police per one hundred thousand population.
AFDC: average monthly AFDC payments per family. (AFDC is commonly known as Welfare, an income support payment by the government to certain low income groups, particularly unmarried women with children.)
a) Tell STATA this is a panel data set, with panels defined by entities “state” across periods “time.” The command for this is:xtset state time.
b) Divorce will be our dependent variable throughout. Hypothesize the sign of the coefficients (or zero) that each of the independent variables in the dataset will have on divorce rates. Briefly provide an explanation of your thinking.
c) Run a regression of divorce on birth, marriage unemployment, crime, afdc using the entire pooled dataset. I’ll call this the “naïve” regression. Report the coefficients and their p-values. Which, if any of the variables have a sign different from what you hypothesized?
a) Run a fixed effects regression using dummy variables for each state. Use the same variables as part 1c. The command for this is:regress variables i.state. Report your coefficients and p-values for all variables EXCEPT the state dummies. Are there any substantial changes in the fixed effect regression vs the “naïve” pooled regression?
b) All else equal, what is the difference in the divorce rate for Alabama vs. Oregon.
c) If AFDC payments increase by 200, what is the predicted increase in the divorce rate.
a) Use the built-in fixed effects regression command:xtreg variables, fe. Note that the coefficients resulting from this are identical to part 2a. (Don’t forget the fe part or you won’t get a fixed effects regression.)
b) Add to the regression of 3a a series of dummy variables to account for the time period. That is, estimatextreg variablesi.time, fe.This essentially adds fixed-effects terms for time, in addition to the state specific fixed effects.Report your coefficients and p-values.
c) From the results of 3b, how much higher is the divorce rate in 1995 compared to 1965?
d) Once we have controlled for time, how has the effect of AFDC payments on divorce rates changed compared to in 2a? Write a brief paragraph where you hypothesize why this change has happened.