(This article was first published on R – Win-Vector Blog , and kindly contributed toR-bloggers)
is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner.
accepts an arbitrary “from the wild” data frame (with different column types,
s and so forth) and returns a transformation that reliably and repeatably converts similar data frames to numeric (matrix-like) frames (all independent variables numeric free of
s, infinities, and so on) ready for predictive modeling. This is a systematic way to work with high-cardinality character and factor variables (which are incompatible with some machine learning implementations such as random forest
, and also bring in a danger of statistical over-fitting) and leaves the analyst more time to incorporate domain specific data preparation (as
tries to handle as much of the common stuff as practical). For more of an overall description please see here
We suggest any users please update (and you will want to re-run any “design” steps instead of mixing “design” and “prepare” from two different versions of
For what is new in version 0.5.27 please read on.
0.5.27 is a maintenance release. User visible improvements include.
- Switching `catB` encodings to a logit scale (instead of the previous log scale).
- Increasing the degree of parallelism by separately parallelizing the level pruning steps (using the methods outlined here ).
Changing the default for
FALSE. We still think working logistic link-space is a great idea for classification problems, we are just not fully satisfied that un-regularized logistic regressions are the best way to get there (largely due to issues of separation and quasi-separation). In the meantime we think working in an expectation space is the safer (and now default) alternative.
Falling back to
stats::chisq.test()instead of insisting on
stats::fisher.test()for large counts. This calculation is used for level pruning and only relevant if
rareSig < 1(the default is
1). We caution that setting
rareSig < 1remains a fairly expensive setting. We are trying to make significance estimation much more transparent, for example we now return how many extra degrees of freedom are hidden by categorical variable re-encodings in a new score frame column called
The idea is having data preparation as a re-usable library lets us research, document, optimize, and fine tune a lot more details than would make sense on any one analysis project. The main design difference from other data preparation packages is we emphasize “y-aware” (or outcome aware) processing (using the training outcome to generate useful re-encodings of the data).
We have pre-rendered a lot of the package documentation, examples, and tutorials here .