# 一行代码不用写，就可以训练、测试、使用模型，这个 star 量 1.5k 的项目帮你做到

###### 内容来源：计算机视觉研究院 作者：Edison_机器之心

igel 是 GitHub 上的一个热门工具，基于 scikit-learn 构建，支持 sklearn 的所有机器学习功能，如回归、分类和聚类。用户无需编写一行代码即可使用机器学习模型，只要有 yaml 或 json 文件，来描述你想做什么即可。

igel 基于 scikit-learn 构建，支持 sklearn 的所有机器学习功能，如回归、分类和聚类。用户无需编写一行代码即可使用机器学习模型，只要有 yaml 或 json 文件，来描述你想做什么即可。

igel 目前支持的所有配置如下所示：

# dataset operations
dataset:
type: csv  # [str] -> type of your dataset
sep:  # [str] -> Delimiter to use.
delimiter:  # [str] -> Alias for sep.
header:     # [int, list of int] -> Row number(s) to use as the column names, and the start of the data.
names:  # [list] -> List of column names to use
index_col: # [int, str, list of int, list of str, False] -> Column(s) to use as the row labels of the DataFrame,
usecols:    # [list, callable] -> Return a subset of the columns
squeeze:    # [bool] -> If the parsed data only contains one column then return a Series.
prefix:     # [str] -> Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …
mangle_dupe_cols:   # [bool] -> Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.
dtype:  # [Type name, dict maping column name to type] -> Data type for data or columns
engine:     # [str] -> Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.
converters: # [dict] -> Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
true_values: # [list] -> Values to consider as True.
false_values: # [list] -> Values to consider as False.
skipinitialspace: # [bool] -> Skip spaces after delimiter.
skiprows: # [list-like] -> Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
skipfooter: # [int] -> Number of lines at bottom of file to skip
nrows: # [int] -> Number of rows of file to read. Useful for reading pieces of large files.
na_values: # [scalar, str, list, dict] ->  Additional strings to recognize as NA/NaN.
keep_default_na: # [bool] ->  Whether or not to include the default NaN values when parsing the data.
na_filter: # [bool] -> Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.
verbose: # [bool] -> Indicate number of NA values placed in non-numeric columns.
skip_blank_lines: # [bool] -> If True, skip over blank lines rather than interpreting as NaN values.
parse_dates: # [bool, list of int, list of str, list of lists, dict] ->  try parsing the dates
infer_datetime_format: # [bool] -> If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them.
keep_date_col: # [bool] -> If True and parse_dates specifies combining multiple columns then keep the original columns.
dayfirst: # [bool] -> DD/MM format dates, international and European format.
cache_dates: # [bool] -> If True, use a cache of unique, converted dates to apply the datetime conversion.
thousands: # [str] -> the thousands operator
decimal: # [str] -> Character to recognize as decimal point (e.g. use ‘,’ for European data).
lineterminator: # [str] -> Character to break file into lines.
escapechar: # [str] ->  One-character string used to escape other characters.
comment: # [str] -> Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character.
encoding: # [str] -> Encoding to use for UTF when reading/writing (ex. ‘utf-8’).
dialect: # [str, csv.Dialect] -> If provided, this parameter will override values (default or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quoting
delim_whitespace: # [bool] -> Specifies whether or not whitespace (e.g. ' ' or '    ') will be used as the sep
low_memory: # [bool] -> Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference.
memory_map: # [bool] -> If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.

split:  # split options
test_size: 0.2  #[float] -> 0.2 means 20% for the test data, so 80% are automatically for training
shuffle: true   # [bool] -> whether to shuffle the data before/while splitting
stratify: None  # [list, None] -> If not None, data is split in a stratified fashion, using this as the class labels.

preprocess: # preprocessing options
missing_values: mean    # [str] -> other possible values: [drop, median, most_frequent, constant] check the docs for more
encoding:
type: oneHotEncoding  # [str] -> other possible values: [labelEncoding]
scale:  # scaling options
method: standard    # [str] -> standardization will scale values to have a 0 mean and 1 standard deviation  | you can also try minmax
target: inputs  # [str] -> scale inputs. | other possible values: [outputs, all] # if you choose all then all values in the dataset will be scaled

# model definition
model:
type: classification    # [str] -> type of the problem you want to solve. | possible values: [regression, classification, clustering]
algorithm: NeuralNetwork    # [str (notice the pascal case)] -> which algorithm you want to use. | type igel algorithms in the Terminal to know more
arguments:          # model arguments: you can check the available arguments for each model by running igel help in your terminal
use_cv_estimator: false     # [bool] -> if this is true, the CV class of the specific model will be used if it is supported
cross_validate:
cv: # [int] -> number of kfold (default 5)
n_jobs:   # [signed int] -> The number of CPUs to use to do the computation (default None)
verbose: # [int] -> The verbosity level. (default 0)
hyperparameter_search:
method: grid_search   # method you want to use: grid_search and random_search are supported
parameter_grid:     # put your parameters grid here that you want to use, an example is provided below
param1: [val1, val2]
param2: [val1, val2]
arguments:  # additional arguments you want to provide for the hyperparameter search
cv: 5   # number of folds
refit: true   # whether to refit the model after the search
return_train_score: false   # whether to return the train score
verbose: 0      # verbosity level

# target you want to predict
target:  # list of strings: basically put here the column(s), you want to predict that exist in your csv dataset
- put the target you want to predict here
- you can assign many target if you are making a multioutput prediction

• 支持所有机器学习 SOTA 模型（甚至包括预览版模型）；

• 支持不同的数据预处理方法；

• 既能写入配置文件，又能提供灵活性和数据控制；

• 支持交叉验证；

• 支持 yaml 和 json 格式；

• 支持不同的 sklearn 度量，进行回归、分类和聚类；

• 支持多输出 / 多目标回归和分类；

• 在并行模型构建时支持多处理。

igel 支持的回归、分类和聚类模型。

$igel --help # or just$ igel -h
"""
Take some time and read the output of help command. You ll save time later if you understand how to use igel.
"""

"""
igel init <args>
possible optional args are: (notice that these args are optional, so you can also just run igel init if you want)
-type: regression, classification or clustering
-model: model you want to use
-target: target you want to predict

Example:
If I want to use neural networks to classify whether someone is sick or not using the indian-diabetes dataset,
then I would use this command to initialize a yaml file:
$igel init -type "classification" -model "NeuralNetwork" -target "sick" """$ igel init

# model definition
model:
# in the type field, you can write the type of problem you want to solve. Whether regression, classification or clustering
# Then, provide the algorithm you want to use on the data. Here I'm using the random forest algorithm
type: classification
algorithm: RandomForest     # make sure you write the name of the algorithm in pascal case
arguments:
n_estimators: 100   # here, I set the number of estimators (or trees) to 100
max_depth: 30       # set the max_depth of the tree

# target you want to predict
# Here, as an example, I'm using the famous indians-diabetes dataset, where I want to predict whether someone have diabetes or not.
# Depending on your data, you need to provide the target(s) you want to predict here
target:
- sick

igel 的使用方式应该是从终端（igel CLI）：

$igel fit --data_path 'path_to_your_csv_dataset.csv' --yaml_file 'path_to_your_yaml_file.yaml' # or shorter$ igel fit -dp 'path_to_your_csv_dataset.csv' -yml 'path_to_your_yaml_file.yaml'

"""
That's it. Your "trained" model can be now found in the model_results folder
(automatically created for you in your current working directory).
Furthermore, a description can be found in the description.json file inside the model_results folder.
"""

$igel evaluate -dp 'path_to_your_evaluation_dataset.csv' """ This will automatically generate an evaluation.json file in the current directory, where all evaluation results are stored """ 如果你对评估结果比较满意，就可以使用这个训练 / 预训练好的模型执行预测。 $ igel predict -dp 'path_to_your_test_dataset.csv'
"""
This will generate a predictions.csv file in your current directory, where all predictions are stored in a csv file
"""

$igel experiment -DP "path_to_train_data path_to_eval_data path_to_test_data" -yml "path_to_yaml_file" """ This will run fit using train_data, evaluate using eval_data and further generate predictions using the test_data """ 当然，如果你想写代码也是可以的： 交互模式 交互模式是 v0.2.6 及以上版本中新添加的，该模式可以让你按照自己喜欢的方式写参数。 也就是说，你可以使用 fit、evaluate、predict、experiment 等命令而无需指定任何额外的参数，比如： igel fit 如果你只是编写这些内容并点击「enter」，系统将提示你提供额外的强制参数。0.2.5 及以下版本会报错，所以你需要使用 0.2.6 及以上版本。 如 demo 所示，你不需要记住这些参数，igel 会提示你输入这些内容。具体而言，Igel 会提供一条信息，解释你需要输入哪个参数。括号之间的值表示默认值。 端到端训练示例 项目作者给出了使用 igel 进行端到端训练的完整示例，即使用决策树算法预测某人是否患有糖尿病。你需要创建一个 yaml 配置文件，数据集可以在 examples 文件夹中找到。 拟合 / 训练模型： model: type: classification algorithm: DecisionTree target: - sick $ igel fit -dp path_to_the_dataset -yml path_to_the_yaml_file

$igel evaluate -dp path_to_the_evaluation_dataset Igel 进行模型评估，并将 statistics/results 保存在 model_results 文件夹中的 evaluation.json 文件中。 预测： 这一步使用预训练模型预测新数据。这由 igel 自动完成，你只需提供预测数据的路径即可。 $ igel predict -dp path_to_the_new_dataset

Igel 使用预训练模型执行预测，并将其保存在 model_results 文件夹中的 predictions.csv 文件中。

# dataset operations
dataset:
split:
test_size: 0.2
shuffle: True
stratify: default

preprocess: # preprocessing options
missing_values: mean    # other possible values: [drop, median, most_frequent, constant] check the docs for more
encoding:
type: oneHotEncoding  # other possible values: [labelEncoding]
scale:  # scaling options
method: standard    # standardization will scale values to have a 0 mean and 1 standard deviation  | you can also try minmax
target: inputs  # scale inputs. | other possible values: [outputs, all] # if you choose all then all values in the dataset will be scaled

# model definition
model:
type: classification
algorithm: RandomForest
arguments:
# notice that this is the available args for the random forest model. check different available args for all supported models by running igel help
n_estimators: 100
max_depth: 20

# target you want to predict
target:
- sick

$igel fit -dp path_to_the_dataset -yml path_to_the_yaml_file 评估： $ igel evaluate -dp path_to_the_evaluation_dataset

\$ igel predict -dp path_to_the_new_dataset

／End.