Clean Data by RFE

Clean data using Recursive Feature Elimination (RFE)

Overview

This node performs data cleaning using Recursive Feature Elimination (RFE), a technique commonly used in machine learning to select the most relevant features for predictive modeling. It takes input data, applies RFE based on a specified target column and number of features to keep, and outputs the cleaned dataset with reduced features.

Typical use cases include:

  • Preparing datasets by removing irrelevant or less important features before training models.
  • Improving model performance by focusing on key predictors.
  • Reducing dimensionality for easier data analysis.

For example, if you have a dataset with many columns but want to keep only the top 5 features that best predict a target variable, this node will help automate that selection.

Properties

Name Meaning
Target Column The name of the target column for RFE; this is the dependent variable to predict.
Number of Features to Keep Number of features to retain after applying RFE; determines how many top features remain.
Output Format Choose the output format of the cleaned data: either JSON (default) or Table format.

Output

The node outputs a single JSON object under the json field containing the cleaned dataset after feature elimination:

  • When Output Format is set to json, the output is:
    {
      "cleanedData": { /* cleaned dataset as JSON */ }
    }
    
  • If an error occurs during execution, the output contains an error message and stack trace:
    {
      "error": "Error message",
      "stack": "Stack trace or 'No stack available'"
    }
    

The node does not output binary data.

Dependencies

  • Requires Python to be installed and accessible via the command line.
  • Depends on an external Python script (rfe_script.py) located relative to the node's directory (../../model/rfe_script.py).
  • The Python script is expected to accept JSON data, target column name, and number of features as arguments, and return cleaned data in JSON format.
  • No direct API keys or external web services are required.

Troubleshooting

  • Common issues:

    • Python not installed or not in system PATH, causing the script execution to fail.
    • The Python script file missing or path incorrect.
    • Input data not properly formatted as JSON or missing the target column.
    • Errors thrown by the Python script due to invalid parameters or data.
  • Error messages:

    • "Error executing RFE script": Indicates failure running the Python script. Check Python installation and script path.
    • Stack traces are provided when available to aid debugging.
  • Resolutions:

    • Ensure Python is installed and accessible.
    • Verify the presence and correctness of the rfe_script.py file.
    • Confirm input data includes the specified target column.
    • Validate the number of features parameter is a positive integer.

Links and References

Discussion