SM35581SM2019: Homework 6, ex1 and Notebook 6

I am currently attempting to tackle exercise 1 of homework 6 and have several questions in its regard and on the first part of notebook 6.
Please excuse me if some of these might be trivial.

1. We're assuming that the values we're trying to fit are the numbers of deads for each population group, which we assumed to be distributed as a Poisson, therefore the dataset contains a total of 35 total observations. Is this assumption correct? I'm asking because, if this is the case, the dataset consists solely of single observations for each subgroup, which makes me fear may raise complications when performing the train-test split.

2. I'm not sure on how to treat the cathegorical covariates when expressing their relation with the poisson's mean: i'm assuming that just using their integer representation in the linear combination of the exponent would be a bad idea, as it would introduce a sequentiality which is not necessarily implied by the correspondent label (even if, perhaps, still reasonable when representing the age buckets). Is this a legit concern?

3. Assuming the answer to the previous is "yes", i'd use the following representation:
    - The "age" buckets use a one-hot representation. Each component has its own weight in the sum (9 buckets, therefore 9 separate weights)
    - The "smoking habits" cathegories uses two binary variables (i.e. "smokes cigarettes" and "smokes cigars and pipes"), thus mapping to two terms in the linear combination with separate weights.
    Would such representation be correct or am I over-complicating the matter?

4. Should we introduce an additional gaussian noise on top of the Poisson? If not, why?

5. I'd assume a standard prior for all the weights and the bias in both the model and the guide since I see no particular reason to prefer other distributions. Am I missing something there?

6. Related to the previous, is there any particular reason for which, in notebook 6, a LogNormal was chosen for the bias term (other than making the posterior analitically untractable)? And why would one chose a different distribution for the weights in the guide? And, why the (multivariate, i suppose?) Gamma in that specific case?

7. Still about notebook 6, why were the hyper-hyper-parameters w_loc, w_scale, b_loc and b_scale initialised using `torch.rand()` instead of sampling a uniform distribution in (0,1)?

Excuse me for the torrent of questions, I would much appreciate your guidance on these doubts.