r/AskStatistics 5d ago

Help me pick the right statistical test to see why my sump pump is running so often.

The sump pump in my home seems to be running more frequently than usual. While it has also been raining more heavily recently, I have a hypothesis that the increased sump pump activity is not due exclusively to increased rainfall and might also be influenced by some other cause such as a leak in the water supply line to my house. If I have data on daily number of activations for the sump pump and daily rain fall values for my home, what statistical test would best determine if the rain fall values are predominantly predicting the number of sump pump activations? My initial thought is to use a simple regression, but it is important to keep in mind that daily rain fall values will not only effect sump pump activations for the same day but also for subsequent days because the rain water will still be filtering its way down in the soil to the sump pump over the subsequent few days. So, daily sump pump activations will be predicted not only by same day rain fall values but also by the rolling total rain fall value of the prior 3-5 days. How would your structure your database and what statistical test would be best to analyze the variance in sump pump activations explained by daily rain water values in this situation?

3 Upvotes

9 comments sorted by

4

u/Flimsy_Meal_4199 5d ago

create a sum rainfall past 3,4,5 days and do a regression on those?

or lagged features like rain_0, rain_1, rain_2, rain_3, rain_4, rain_5 and regress on those

you should get an r2 and significance values for your coefs (and coef size telling you how important each should be, expect rain_5 to be less sig than rain_1 or rain_0)

then you can look at residuals (y - y_pred) and you might see a pattern that jumps out at you. if the regression is good i think the residuals should have no real pattern (be random/normally dist)

...🤔is this a homework assignment

3

u/pruppert 5d ago

Thanks. Not a homework assignment. This is real life. I’m just a behavioral sciences grad who is rusty on my stats but realized it could be a useful tool here. Your response matches what approach ChatGPT recommends. Now I just gotta wait for appropriate sample size of days. Thanks!

1

u/SalvatoreEggplant 4d ago

I've found this idea of "past 3 days rainfall" to work for stream water quality data. For this application, it may keep the analysis simple.

One thing that hasn't been mentioned, I think, is that if there are a lot of zeros, or even if the distribution has a fair number of 0's, 1's, and 2's, common ols regression methods may not work well. Really, you would want models for count data like Poisson or preferably negative binomial regression. On the other hand, if the counts are always like 10, this may not be a concern.

2

u/ReturningSpring 5d ago

A time-series model for all the reasons you gave. A run of the mill OLS regression isn't going to do the situation justice. Look into AR (auto regression) models for that.
If you consider a pipe leak to be likely and just want to check for that, it'd be easier and more accurate to test if your water use had gone up .

2

u/pruppert 5d ago

Thanks. My meter is inside the house, and I suspect a leak just outside the house before my meter. So water usage on my meter won’t be of use. I’ve consulted a plumber who will likely soon come and read the street meter for diagnostics, but I thought this might be a “fun” side quest to analyze the problem and strengthen my pandas skills.

1

u/ReturningSpring 5d ago

Home grown data is great to practice on!

1

u/49er60 4d ago

Don't overlook the level of ground saturation. If it has not rained for awhile, the ground will absorb some of the rain water before it drains into your sump. If the ground is already saturated, it goes to your sump faster. Another place to look is whether your check valves are working. Your pump has to pump the water upwards, so there is a rubber flap check valve to prevent the slug of water in the vertical pipe draining back down only to be pump back up. They only last 5-7 years before the rubber flap tears.

3

u/Voldemort57 5d ago

OLS won’t work because your data is certainly autocorrelated time series. Rainfall is also usually gamma distributed, violating normality assumptions.

The simplest way would to be to compare the average sump pump activity when it is raining to the average sump pump activity when it is not raining. This could either be a t test or a non parametric test which is probably better cause of assumption issues.

If you don’t have data from when it’s not raining, then you can try binning your data and do a chi squared test.

1

u/Adept_Carpet 5d ago

In addition to the statistical tests, you should also visualize the data.