The dataset contains three tables and you are required to model them and answer the following questions:

1. Open the attached document past_events.csv. In this document, Opponent and Day of Week are two attributes of an event, and value is an index of how well this event performed on a collection of different measurements of ticket demand. We are interested in knowing what effect opponent and day of week have on value. What could we do to learn about these effects and estimate the magnitude of each one? What are the advantages and disadvantages of different approaches? What challenges make this problem difficult? How might you get around those challenges?

2. The values are constrained to always be between 0 and 1. Knowing this, does it change the way you would estimate the effects? Why or why not?

3. Open the attached document future_events.csv. Given the two datasets, how could we predict or estimate the event values for the future events?

4. Would your method of prediction restrict values to be between 0 and 1? If not, what would be the best way to ensure this?

5. Notice that one of the opponents in the future events dataset is not in the past events dataset. Would your solution come up with a value for the New Orleans’ game? If not, what are some ways we could handle this?

6. Now look at the team_quality.csv dataset. Would these groupings of teams into high / medium / low affect the approach you take for predicting the value of the New Orleans game? The rest of the games?

7. How good is your process of estimating the effects? What ways could you evaluate how good your predictions are?

8. How would the problem and your answer change if we had 2000 more observations in past_events?

The data you’ll get:

Lean Data Scientist

Statistics Dataset

Download Takehome

Statistics Dataset