SM35581SM2019: Homework 2 - Exercise 2 (ecdf implementation)

I report the text of ex.2 of homework 2 for practicality:

---

Implement the empirical cumulative distribution function F_X(x)= cdf(dist, x) taking as inputs a pyro.distributions object dist, corresponding to the distribution of X

, and integer value x.

Suppose that X∼N(0,1)

and plot F_X(x).

(sorry for the messy format)

---

Now here are my doubts:

1. Shouldn't the ecdf be derived from a sample of a distribution? Why would we feed the distribution object to the function instead of a vector samples from said `dist` object?

2. If the `x` input parameter was intended to be the input of the F_X(x) function, why would it be an integer? Shouldn't it be possible to chose it in the generic distribution support, therefore more likely a real number? This is especially true in the case asked for in the second point!

Thanks in advance,

Michele

Ri: Homework 2 - Exercise 2 (ecdf implementation)

di GINEVRA CARBONE - venerdì, 3 aprile 2020, 10:10

Hi Michele,

The request for an input integer was my mistake, it was intended to be any real number. I updated the notebook soon after noticing the error (https://github.com/ginevracoal/statistical-machine-learning/blob/master/homeworks/homework_02.ipynb).

The ecdf should compute the estimates from a vector of samples from the input distribution and, additionaly, it should be able to return the estimated value corresponding to a specific input x. I hope the request is clearer now.

Ginevra

Ri: Homework 2 - Exercise 2 (ecdf implementation)

di MICHELE RISPOLI - sabato, 4 aprile 2020, 13:20

Hello Ginevra,

thanks for the answer, it did clear my doubts.

I reckon i should have come up with the question earlier, since at the time i wrote the post I had my solutions already submitted.

Since I had these doubts, in the proposed solution I chose to explain and implement my best guess for the exercise request, that is, ecdf takes a sample vector from a pyro.distribution and an arbitrary x value: I believed the choice of the sampling parameters (i.e. size and random seed) was better left to the function caller, and having otherwise a non-deterministic function was not desirable.

If I got your answer correctly though, we were supposed to take the sample inside the function.

I understand this was a toy problem, therefore i could have probably avoided posing myself the parameters problem mentioned above, but since I mentioned it, what's the common/best approach with random seeding inside functions in a realistic scenario?

Ri: Homework 2 - Exercise 2 (ecdf implementation)

di GINEVRA CARBONE - lunedì, 6 aprile 2020, 08:29

Yes, this was just a simple example and in general there is no need to worry about the style of your implementation, I'm just interested in the conceptual part of it.

I don't think there is a "best approach" for seeding. Any choice is correct as long as your results on that function call are reproducible.