Context I am driving always the same car and i take almost always the same route. However, at the gas station i like to change the gas type; between SP98 (sold as "Super plus" sometimes) and SP95 E10 (This is, "super" gas with 10% Alcohol). E10 is sold for 1,38€; SP98 is sold for 1,46€; per liter. From my feeling i would say that my car consumes a lot more with E10. From the data, what can we derive there? I challenge you to partial out the factor "E10 gas" and tell me how much my car really consumes more with it. I applied my own basic linear regression on it and had as a result that it consumes 0.4 liters more with E10 gas. Linear regressions have the disadvantage that you can only really use them if the features are independent. **I challenge you to predict the consumption depending on the gas type!** Content Since a few months, i write down the data of my car's display after each ride; while regularly changing the gas type. In the file, you will find the displayed distance (km); the consume (L/100km); the average speed (km/h), the temperature i had inside (°C), the temperature outside (°C), anything special that happened, if it was raining, if the air condition was on, if it was sunny enough that the car felt warm when i started it... and yes - the gas type i was using. I have also two columns saying how much and which gas type I was buying. Careful with those. The numbers don't add exactly up, because I note only the rides that occur under certain conditions: If the car was not cooling down enough to have another independent measure from the one before, i don't note it. I started writing down the data in November, changed to SP98 in winter, and back to E10 in spring. Apart from that, the data is rather clean as i was doing my own project on it already. Acknowledgements Thanks to Victor Chernozhukov who was planting this idea in my head, even if it took some years until i finally acted on it. 🙂 Inspiration I was using a linear regression to partial out the influence of the gas type. The gas type is truly independent from the rest of the variables, so it should be possible without problem. However - depending on how i engineer the other features, the result is between 0.4 and 0.8 liters per 100km influence. A large, single-feature-depending difference usually is a hint for lots of covariance between the features; meaning in turn that linear regression might not be the best tool here.