Существует много типов разреженных матриц, каждый из которых предоставляет разные гарантии на операции.
scipy.sparse.bsr_matrix
scipy.sparse.coo_matrix
scipy.sparse.csc_matrix
scipy.sparse.csr_matrix
scipy.sparse.dia_matrix
scipy.sparse.dok_matrix
scipy.sparse.lil_matrix
Подробнее про разреженые матрицы
Для разреженных матриц есть свои hstack
и vstack
, которые находятся в scipy.sparse
Подходят почти все модели
Не подходят
Мы хотим по тексту вопроса определить его тег.
texts = pd.read_csv('windows_vs_linux.10k.tsv', header=None, sep='\t')
texts.columns = ['text', 'is_windows']
print(texts.shape)
texts.head(4)
(10000, 2)
text | is_windows | |
---|---|---|
0 | so i find myself porting a game that was orig... | 0 |
1 | i ve been using tortoisesvn in a windows envi... | 1 |
2 | we are using wmv videos on an internal site a... | 1 |
3 | on one linux server running apache and php 5 ... | 0 |
В качестве признаков будем использовать факт вхождения слова в документ.
vectorizer = CountVectorizer(binary=True)
bow = vectorizer.fit_transform(texts.text)
print(bow.shape)
print(type(bow))
(10000, 40971) <class 'scipy.sparse.csr.csr_matrix'>
params = {'C': np.logspace(-5, 5, 11)}
clf = LogisticRegression()
cv = GridSearchCV(clf, params, n_jobs=-1, scoring='roc_auc', cv=5)
cv.fit(bow, texts.is_windows);
pd.DataFrame(cv.cv_results_)[['mean_test_score', 'params']].sort_values('mean_test_score', ascending=False)
mean_test_score | params | |
---|---|---|
4 | 0.965813 | {u'C': 0.1} |
5 | 0.962415 | {u'C': 1.0} |
3 | 0.961759 | {u'C': 0.01} |
6 | 0.955353 | {u'C': 10.0} |
7 | 0.948635 | {u'C': 100.0} |
2 | 0.945693 | {u'C': 0.001} |
8 | 0.945429 | {u'C': 1000.0} |
9 | 0.943599 | {u'C': 10000.0} |
10 | 0.942903 | {u'C': 100000.0} |
1 | 0.790566 | {u'C': 0.0001} |
0 | 0.632507 | {u'C': 1e-05} |
top = pd.DataFrame([get_top_windows(cv.best_estimator_, 6),
get_top_linux(cv.best_estimator_, 6)]).T
top.columns = ['Windows', 'Linux']
top
Windows | Linux | |
---|---|---|
0 | windows | ubuntu |
1 | win32 | root |
2 | vista | mono |
3 | exe | linux |
4 | dll | kernel |
5 | batch | bash |
Представим, что у нас есть очень много признаков и мы хотим сократить их количество, выбрав только самые нужные.
В данном случае мы взяли все возможные $N$-граммы, где $N \in \{1, \dots, 4\}$
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 4))
bow = vectorizer.fit_transform(texts.text)
print(bow.shape)
print(type(bow))
(10000, 2117115) <class 'scipy.sparse.csr.csr_matrix'>
Мы можем отобрать $50000$ признаков с помощью SelectKBest
и увидеть, что мы получили на кросс-валидации качество лучше чем до этого.
Есть ли тут проблемы?
k_best = SelectKBest(k=50000)
bow_k_best = k_best.fit_transform(bow, texts.is_windows)
clf = LogisticRegression()
np.mean(cross_val_score(clf, bow_k_best, texts.is_windows, scoring='roc_auc', cv=5))
0.97955838190531419
Когда делается отбор признаков всегда нужна проверка на отложенной выборке, иначе оценка будет сильно завышена.
x_train, x_test, y_train, y_test = train_test_split(bow, texts.is_windows)
print(x_train.shape)
print(x_test.shape)
(7500, 2117115) (2500, 2117115)
k_best = SelectKBest(k=50000)
x_train_k_best = k_best.fit_transform(x_train, y_train)
x_test_k_best = k_best.transform(x_test)
clf = LogisticRegression()
clf.fit(x_train_k_best, y_train)
roc_auc_score(y_test, clf.predict_proba(x_test_k_best)[:, 1])
0.96876106147585705
y_hat = cross_val_predict(cv.best_estimator_, bow, texts.is_windows, method='predict_proba')[:, 1]
y_hat[:5]
array([ 0.12431717, 0.99933152, 0.98910196, 0.02073042, 0.84102736])
Делая любое преобразование, сохраняющее порядок, мы будет получать одинаковое значение $AUC$
print(roc_auc_score(texts.is_windows, y_hat))
print(roc_auc_score(texts.is_windows, y_hat * 2 + 1))
print(roc_auc_score(texts.is_windows, y_hat ** 2))
0.963904819501 0.963904819501 0.963904819501
fpr, tpr, _ = roc_curve(texts.is_windows, y_hat)
plot(fpr, tpr, lw=2);
plot([0, 1], [0, 1], linestyle='--');
xlim([0.0, 1.0])
ylim([0.0, 1.05])
xlabel('False Positive Rate')
ylabel('True Positive Rate');
texts = pd.read_csv('multi_tag.10k.tsv', header=None, sep='\t')
texts.columns = ['text', 'tags']
print(texts.shape)
texts.head(4)
(10000, 2)
text | tags | |
---|---|---|
0 | i want to use a track bar to change a form s ... | c# winforms type-conversion decimal opacity |
1 | i have an absolutely positioned div containin... | html css css3 internet-explorer-7 |
2 | given a datetime representing a person s birt... | c# .net datetime |
3 | given a specific datetime value how do i disp... | c# datetime datediff relative-time-span |
Некоторые методы умеют работать с несколькими метками из коробки
KNeighborsClassifier
RandomForestClassifier
SVC
Остальные можно обобщить с помощью обучения нескольких моделей
OneVsRestClassifier
OneVsOneClassifier
$tp$, $fp$ и $fp$ будут считаться по всем тегам одного объекта
y, y_hat = np.array([[1, 1, 0, 0]]), np.array([[1, 0, 1, 0]])
tp, fp, fn = 1., 1., 1.
p, r = tp / (tp + fp), tp / (tp + fn)
f1 = 2 * p * r / (p + r)
print(f1)
print(f1_score(y, y_hat, average='samples'))
0.5 0.5
Повторим всё, как для случая двух классов
vectorizer = CountVectorizer(binary=True)
bow = vectorizer.fit_transform(texts.text)
print(bow.shape)
print(type(bow))
(10000, 35247) <class 'scipy.sparse.csr.csr_matrix'>
tags = texts.tags.apply(lambda x: x.split())
all_tags = reduce(lambda s, x: s + x, tags, [])
values, count = np.unique(all_tags, return_counts=True)
top_tags = sorted(zip(count, values), reverse=True)[:20]
top_tags[:5]
[(1198, 'c#'), (1090, '.net'), (696, 'java'), (610, 'asp.net'), (472, 'sql-server')]
Преобразуем списки тегов в матрицу, которая будет содержать индикаторы наличия тега у вопроса.
binarizer = MultiLabelBinarizer()
y = binarizer.fit_transform(texts.tags.apply(lambda x: filter_tags(x.split())))
print(y.shape)
print(type(y))
(10000, 21) <type 'numpy.ndarray'>
LogisticRegression
, но уже вместе с OneVsRestClassifier
params = {'estimator__C': np.logspace(-5, 5, 11)}
clf = OneVsRestClassifier(LogisticRegression())
cv = GridSearchCV(clf, params, n_jobs=-1, scoring=make_scorer(f1_score, average='samples'), cv=5)
cv.fit(bow, y);
pd.DataFrame(cv.cv_results_)[['mean_test_score', 'params']].sort_values('mean_test_score', ascending=False)
mean_test_score | params | |
---|---|---|
10 | 0.404732 | {u'estimator__C': 100000.0} |
9 | 0.404212 | {u'estimator__C': 10000.0} |
8 | 0.404140 | {u'estimator__C': 1000.0} |
7 | 0.403066 | {u'estimator__C': 100.0} |
6 | 0.402630 | {u'estimator__C': 10.0} |
5 | 0.390346 | {u'estimator__C': 1.0} |
4 | 0.319707 | {u'estimator__C': 0.1} |
3 | 0.092920 | {u'estimator__C': 0.01} |
2 | 0.000300 | {u'estimator__C': 0.001} |
0 | 0.000000 | {u'estimator__C': 1e-05} |
1 | 0.000000 | {u'estimator__C': 0.0001} |
CountVectorizer
vectorizer = TfidfVectorizer()
tf_idf = vectorizer.fit_transform(texts.text)
print(tf_idf.shape)
print(type(tf_idf))
(10000, 35247) <class 'scipy.sparse.csr.csr_matrix'>
Ищем лучшие параметры
params = {'estimator__C': np.logspace(-5, 5, 11)}
clf = OneVsRestClassifier(LogisticRegression())
cv = GridSearchCV(clf, params, n_jobs=-1, scoring=make_scorer(f1_score, average='samples'), cv=5)
cv.fit(tf_idf, y);
pd.DataFrame(cv.cv_results_)[['mean_test_score', 'params']].sort_values('mean_test_score', ascending=False)
mean_test_score | params | |
---|---|---|
10 | 0.382173 | {u'estimator__C': 100000.0} |
9 | 0.380033 | {u'estimator__C': 10000.0} |
8 | 0.376427 | {u'estimator__C': 1000.0} |
7 | 0.364680 | {u'estimator__C': 100.0} |
6 | 0.334247 | {u'estimator__C': 10.0} |
5 | 0.182323 | {u'estimator__C': 1.0} |
4 | 0.000700 | {u'estimator__C': 0.1} |
0 | 0.000000 | {u'estimator__C': 1e-05} |
1 | 0.000000 | {u'estimator__C': 0.0001} |
2 | 0.000000 | {u'estimator__C': 0.001} |
3 | 0.000000 | {u'estimator__C': 0.01} |
predict
возвращается 1, если вероятность принадлежности к классу больше $0.5$clf = OneVsRestClassifier(LogisticRegression(C=100000))
y_hat_bow = cross_val_predict(clf, bow, y, method='predict_proba')
y_hat_tf_idf = cross_val_predict(clf, tf_idf, y, method='predict_proba')
Функция, которая в зависимости от порога ставит тег
def get_score(alpha, y, y_hat):
return f1_score(y, (y_hat > alpha).astype('int'), average='samples')
alphas = np.linspace(0.0, 0.01, 100)
scores = [get_score(alpha, y, y_hat_bow) for alpha in alphas]
plot(alphas, scores);
scatter(alphas[np.argmax(scores)], np.max(scores));
print(np.max(scores))
print(alphas[np.argmax(scores)])
0.454356904762 0.00535353535354
alphas = np.linspace(0.0, 0.01, 100)
scores = [get_score(alpha, y, y_hat_tf_idf) for alpha in alphas]
plot(alphas, scores);
scatter(alphas[np.argmax(scores)], np.max(scores));
print(np.max(scores))
print(alphas[np.argmax(scores)])
0.493972857143 0.00191919191919
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3))
bow = vectorizer.fit_transform(texts.text)
print(bow.shape)
print(type(bow))
(10000, 1132373) <class 'scipy.sparse.csr.csr_matrix'>
Вместо того, чтобы хранить каждый вариант N-граммы отдельно, мы будем получать индекс столбца, хешируя содержимое.
Позволяет задавать произвольное количество столбцов.
vectorizer = HashingVectorizer(binary=True, ngram_range=(1, 3), n_features=50000)
bow = vectorizer.fit_transform(texts.text)
print(bow.shape)
print(type(bow))
(10000, 50000) <class 'scipy.sparse.csr.csr_matrix'>
clf = OneVsRestClassifier(LogisticRegression(C=100000))
y_hat_bow = cross_val_predict(clf, bow, y, method='predict_proba')
alphas = np.linspace(0.0, 0.01, 100)
scores = [get_score(alpha, y, y_hat_bow) for alpha in alphas]
plot(alphas, scores);
scatter(alphas[np.argmax(scores)], np.max(scores));
print(np.max(scores))
print(alphas[np.argmax(scores)])
0.509255 0.00191919191919
Можно получать dense
матрицы любым способом, который вам нравится.
Позволяет сократить время обучения и использовать методы, которые больше подходят dense
матрицы.
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
tf_idf = vectorizer.fit_transform(texts.text)
print(tf_idf.shape)
print(type(tf_idf))
(10000, 397950) <class 'scipy.sparse.csr.csr_matrix'>
svd = TruncatedSVD(n_components=200, n_iter=5)
tf_idf_svd = svd.fit_transform(tf_idf)
Когда у нас есть несколько моделей, мы можем получать смешенное предсказание.
Если модели не сильно скоррелированы, то зачастую мы можем улучшить качество результирующей модели.
vectorizer = HashingVectorizer(binary=True, ngram_range=(1, 3), n_features=50000)
bow = vectorizer.fit_transform(texts.text)
print(bow.shape)
print(type(bow))
(10000, 50000) <class 'scipy.sparse.csr.csr_matrix'>
vectorizer = TfidfVectorizer()
tf_idf = vectorizer.fit_transform(texts.text)
print(tf_idf.shape)
print(type(tf_idf))
(10000, 35247) <class 'scipy.sparse.csr.csr_matrix'>
С помощью кросс-валидации предскажем обучающую выборку для каждой модели.
В итоге мы получим несмещённые предсказания для объектов, у которых уже знаем метки.
clf_lr = OneVsRestClassifier(LogisticRegression(C=100000))
y_hat_lr = cross_val_predict(clf_lr, bow, y, method='predict_proba', cv=folds)
clf_lr = OneVsRestClassifier(LogisticRegression(C=100000))
y_hat_lr_tf_idf = cross_val_predict(clf_lr, tf_idf, y, method='predict_proba', cv=folds)
Получим качество на каждой моделе в отдельности и на их смеси.
alphas = np.linspace(0.0, 0.02, 100)
lr_scores = [get_score(alpha, y, y_hat_lr) for alpha in alphas]
nb_scores = [get_score(alpha, y, y_hat_lr_tf_idf) for alpha in alphas]
lr_nb_scores = [get_score(alpha, y, 0.5 * y_hat_lr_tf_idf + 0.5 * y_hat_lr) for alpha in alphas]
print(np.max(lr_scores))
print(np.max(nb_scores))
print(np.max(lr_nb_scores))
0.50923452381 0.493558095238 0.527097619048
plot(alphas, lr_scores);
plot(alphas, nb_scores);
plot(alphas, lr_nb_scores);
scatter(alphas[np.argmax(lr_scores)], np.max(lr_scores));
scatter(alphas[np.argmax(nb_scores)], np.max(nb_scores));
scatter(alphas[np.argmax(lr_nb_scores)], np.max(lr_nb_scores));
Вместо ручного смешивания результатов мы можем подавать их на вход другим алгоритмам.
Подготовим переменную stacked
, которая будет содержать предсказания предыдущих алгоритмов
stacked = np.hstack([y_hat_lr, y_hat_lr_tf_idf])
clf_stacked = OneVsRestClassifier(RandomForestClassifier(n_estimators=100))
y_hat_stacked = cross_val_predict(clf_stacked, stacked, y, method='predict_proba', cv=folds)
После подбора порога получим $F1=0.547874$, что больше всех предыдущих результатов.
alphas = np.linspace(0, 1, 100)
scores = [get_score(alpha, y, y_hat_stacked) for alpha in alphas]
plot(alphas, scores);
scatter(alphas[np.argmax(scores)], np.max(scores));
print(np.max(scores))
print(alphas[np.argmax(scores)])
0.547874126984 0.232323232323
Почти во всех конкурсах так или иначе используется стекинг или блендинг, поэтому очень важно понимать как они работают и как их использовать.
!vw -h
Num weight bits = 18 learning rate = 0.5 initial_t = 0 power_t = 0.5 using no cache Reading datafile = num sources = 1 VW options: --random_seed arg seed random number generator --ring_size arg size of example ring Update options: -l [ --learning_rate ] arg Set learning rate --power_t arg t power value --decay_learning_rate arg Set Decay factor for learning_rate between passes --initial_t arg initial t value --feature_mask arg Use existing regressor to determine which parameters may be updated. If no initial_regressor given, also used for initial weights. Weight options: -i [ --initial_regressor ] arg Initial regressor(s) --initial_weight arg Set all weights to an initial value of arg. --random_weights arg make initial weights random --input_feature_regularizer arg Per feature regularization input file Parallelization options: --span_server arg Location of server for setting up spanning tree --threads Enable multi-threading --unique_id arg (=0) unique id used for cluster parallel jobs --total arg (=1) total number of nodes used in cluster parallel job --node arg (=0) node number in cluster parallel job Diagnostic options: --version Version information -a [ --audit ] print weights of features -P [ --progress ] arg Progress update frequency. int: additive, float: multiplicative --quiet Don't output disgnostics and progress updates -h [ --help ] Look here: http://hunch.net/~vw/ and click on Tutorial. Feature options: --hash arg how to hash the features. Available options: strings, all --ignore arg ignore namespaces beginning with character <arg> --keep arg keep namespaces beginning with character <arg> --redefine arg redefine namespaces beginning with characters of string S as namespace N. <arg> shall be in form 'N:=S' where := is operator. Empty N or S are treated as default namespace. Use ':' as a wildcard in S. -b [ --bit_precision ] arg number of bits in the feature table --noconstant Don't add a constant feature -C [ --constant ] arg Set initial value of constant --ngram arg Generate N grams. To generate N grams for a single namespace 'foo', arg should be fN. --skips arg Generate skips in N grams. This in conjunction with the ngram tag can be used to generate generalized n-skip-k-gram. To generate n-skips for a single namespace 'foo', arg should be fN. --feature_limit arg limit to N features. To apply to a single namespace 'foo', arg should be fN --affix arg generate prefixes/suffixes of features; argument '+2a,-3b,+1' means generate 2-char prefixes for namespace a, 3-char suffixes for b and 1 char prefixes for default namespace --spelling arg compute spelling features for a give namespace (use '_' for default namespace) --dictionary arg read a dictionary for additional features (arg either 'x:file' or just 'file') --dictionary_path arg look in this directory for dictionaries; defaults to current directory or env{PATH} --interactions arg Create feature interactions of any level between namespaces. --permutations Use permutations instead of combinations for feature interactions of same namespace. --leave_duplicate_interactions Don't remove interactions with duplicate combinations of namespaces. For ex. this is a duplicate: '-q ab -q ba' and a lot more in '-q ::'. -q [ --quadratic ] arg Create and use quadratic features --q: arg : corresponds to a wildcard for all printable characters --cubic arg Create and use cubic features Example options: -t [ --testonly ] Ignore label information and just test --holdout_off no holdout data in multiple passes --holdout_period arg holdout period for test only, default 10 --holdout_after arg holdout after n training examples, default off (disables holdout_period) --early_terminate arg Specify the number of passes tolerated when holdout loss doesn't decrease before early termination, default is 3 --passes arg Number of Training Passes --initial_pass_length arg initial number of examples per pass --examples arg number of examples to parse --min_prediction arg Smallest prediction to output --max_prediction arg Largest prediction to output --sort_features turn this on to disregard order in which features have been defined. This will lead to smaller cache sizes --loss_function arg (=squared) Specify the loss function to be used, uses squared by default. Currently available ones are squared, classic, hinge, logistic, quantile and poisson. --quantile_tau arg (=0.5) Parameter \tau associated with Quantile loss. Defaults to 0.5 --l1 arg l_1 lambda --l2 arg l_2 lambda --named_labels arg use names for labels (multiclass, etc.) rather than integers, argument specified all possible labels, comma-sep, eg "--named_labels Noun,Verb,Adj,Punc" Output model: -f [ --final_regressor ] arg Final regressor --readable_model arg Output human-readable final regressor with numeric features --invert_hash arg Output human-readable final regressor with feature names. Computationally expensive. --save_resume save extra state so learning can be resumed later with new data --save_per_pass Save the model after every pass over data --output_feature_regularizer_binary arg Per feature regularization output file --output_feature_regularizer_text arg Per feature regularization output file, in text --id arg User supplied ID embedded into the final regressor Output options: -p [ --predictions ] arg File to output predictions to -r [ --raw_predictions ] arg File to output unnormalized predictions to Reduction options, use [option] --help for more info: --audit_regressor arg stores feature names and their regressor values. Same dataset must be used for both regressor training and this mode. --bootstrap arg k-way bootstrap by online importance resampling --search arg Use learning to search, argument=maximum action id or 0 for LDF --replay_c arg use experience replay at a specified level [b=classification/regression, m=multiclass, c=cost sensitive] with specified buffer size --explore_eval Evaluate explore_eval adf policies --cbify arg Convert multiclass on <k> classes into a contextual bandit problem --cb_explore_adf Online explore-exploit for a contextual bandit problem with multiline action dependent features --cb_explore arg Online explore-exploit for a <k> action contextual bandit problem --multiworld_test arg Evaluate features as a policies --cb_adf Do Contextual Bandit learning with multiline action dependent features. --cb arg Use contextual bandit learning with <k> costs --csoaa_ldf arg Use one-against-all multiclass learning with label dependent features. Specify singleline or multiline. --wap_ldf arg Use weighted all-pairs multiclass learning with label dependent features. Specify singleline or multiline. --interact arg Put weights on feature products from namespaces <n1> and <n2> --csoaa arg One-against-all multiclass with <k> costs --multilabel_oaa arg One-against-all multilabel with <k> labels --recall_tree arg Use online tree for multiclass --log_multi arg Use online tree for multiclass --ect arg Error correcting tournament with <k> labels --boosting arg Online boosting with <N> weak learners --oaa arg One-against-all multiclass with <k> labels --top arg top k recommendation --replay_m arg use experience replay at a specified level [b=classification/regression, m=multiclass, c=cost sensitive] with specified buffer size --binary report loss as binary classification on -1,1 --link arg (=identity) Specify the link function: identity, logistic, glf1 or poisson --stage_poly use stagewise polynomial feature learning --lrqfa arg use low rank quadratic features with field aware weights --lrq arg use low rank quadratic features --autolink arg create link function with polynomial d --marginal arg substitute marginal label estimates for ids --new_mf arg rank for reduction-based matrix factorization --nn arg Sigmoidal feedforward network with <k> hidden units confidence options: --confidence_after_training Confidence after training --confidence Get confidence for binary predictions --active_cover enable active learning with cover --active enable active learning --replay_b arg use experience replay at a specified level [b=classification/regression, m=multiclass, c=cost sensitive] with specified buffer size --OjaNewton Online Newton with Oja's Sketch --bfgs use bfgs optimization --conjugate_gradient use conjugate gradient based optimization --lda arg Run lda with <int> topics --noop do no learning --print print examples --rank arg rank for matrix factorization. --sendto arg send examples to <host> --svrg Streaming Stochastic Variance Reduced Gradient --ftrl FTRL: Follow the Proximal Regularized Leader --pistol FTRL: Parameter-free Stochastic Learning --ksvm kernel svm Gradient Descent options: --sgd use regular stochastic gradient descent update. --adaptive use adaptive, individual learning rates. --invariant use safe/importance aware updates. --normalized use per feature normalized updates --sparse_l2 arg (=0) use per feature normalized updates Input options: -d [ --data ] arg Example Set --daemon persistent daemon mode on port 26542 --port arg port to listen on; use 0 to pick unused port --num_children arg number of children for persistent daemon mode --pid_file arg Write pid file in persistent daemon mode --port_file arg Write port used in persistent daemon mode -c [ --cache ] Use a cache. The default is <data>.cache --cache_file arg The location(s) of cache_file. --json Enable JSON parsing. -k [ --kill_cache ] do not reuse existing cache: create a new one always --compressed use gzip format whenever possible. If a cache file is being created, this option creates a compressed cache file. A mixture of raw-text & compressed inputs are supported with autodetection. --no_stdin do not default to reading from stdin
Label [weight] |Namespace Feature ... |Namespace ...
Label
— метка класса для задачи классификации или действительное число для задачи регрессииweight
— вес объекта, по умолчанию у всех одинаковыйNamespace
— все признаки разбиты на области видимости, может использоваться для раздельного использования или создания квадратичных признаков между областямиFeature
— string[:value]
или int[:value]
строки будут хешированы, числа будут использоваться как индекс в векторе признаков. value
по умолчанию равно $1$Вводится функция $h$, с помощью которой получается индекс для записи значения в вектор признаков объекта.
$$h : F \rightarrow \{0, \dots, 2^b - 1\}$$С помощью --b
можно задавать размер области значений хеш-функции.
Может использовать SGD
или L-BFGS
SGD
по умолчанию, позволяет делать онлайн обучение. Почти всегда необходимо несколько проходов по данным.L-BFGS
включается с помощью --bfgs
, работает только с данными небольшого размераSGD
задаётся с помощью параметра --passes
Проходим по всем элементам обучающей выборки много раз, на каждом объекте делаем поправку весов:
Здесь $t$ — порядковый номер объекта обучения, $k$ — номер прохода по всей выборке.
-l
(learning rate) --decay_learning_rate
--initial_t
--power_t
--passes
average loss
— loss by progressive validation
$e_i$ — loss на объекте $x_i$ при обучении на объектах $\{x_1 ... x_{i-1}\}$
-1 | A:1 B:10 1 | A:-1 B:12
-1 | so i find myself porting a game that was originally written 1 | i ve been using tortoisesvn in a windows environment for quite some time
windows vs linux
¶Для этого сначала сконвертируем данные в формат для vw
texts = pd.read_csv('windows_vs_linux.10k.tsv', sep='\t', header=None)
texts[1].replace({0: '-1 ', 1: '1 '}, inplace=True)
train_texts, test_texts = train_test_split(texts)
train_texts[[1, 0]].to_csv('win_vs_lin.train.vw', sep='|', header=None, index=False)
test_texts[[1, 0]].to_csv('win_vs_lin.test.vw', sep='|', header=None, index=False)
!head -n 5 win_vs_lin.train.vw | cut -c 1-50
1 | i have a bat file shown below echo off for f d 1 | i need a way to determine whether the computer -1 | my c application uses 3rd libraries which do -1 | currently i m trying to install php 5 3 0 on 1 | i how to get the windowproc for a form in c fo
!vw -d win_vs_lin.train.vw --loss_function logistic -P 10000 -f model.vw --passes 100 -c
final_regressor = model.vw Num weight bits = 18 learning rate = 0.5 initial_t = 0 power_t = 0.5 decay_learning_rate = 1 creating cache_file = win_vs_lin.train.vw.cache Reading datafile = win_vs_lin.train.vw num sources = 1 average since example example current current current loss last counter weight label predict features 0.342860 0.342860 10000 10000.0 1.0000 3.3821 95 h 0.301154 0.259449 20000 20000.0 1.0000 4.0657 69 h 0.289920 0.267450 30000 30000.0 -1.0000 -4.0597 170 h 0.280396 0.251826 40000 40000.0 -1.0000 -2.4205 101 h 0.276837 0.262603 50000 50000.0 -1.0000 -7.5515 245 h 0.272949 0.253507 60000 60000.0 -1.0000 -5.6810 138 h finished run number of examples per pass = 6750 passes used = 10 weighted example sum = 67500.000000 weighted label sum = 14760.000000 average loss = 0.256594 h best constant = 0.444511 best constant's loss = 0.669045 total feature number = 8086910
!vw -i model.vw -t -p output.csv win_vs_lin.test.vw --loss_function logistic
only testing predictions = output.csv Num weight bits = 18 learning rate = 0.5 initial_t = 0 power_t = 0.5 using no cache Reading datafile = win_vs_lin.test.vw num sources = 1 average since example example current current current loss last counter weight label predict features 0.693859 0.693859 1 1.0 -1.0000 0.0014 46 0.794020 0.894181 2 2.0 -1.0000 0.3683 252 0.852172 0.910324 4 4.0 -1.0000 1.2895 301 0.437393 0.022613 8 8.0 1.0000 3.0448 56 0.280665 0.123937 16 16.0 1.0000 0.7930 51 0.225321 0.169977 32 32.0 1.0000 2.8181 74 0.175063 0.124804 64 64.0 1.0000 7.9573 50 0.182200 0.189338 128 128.0 1.0000 21.6971 223 0.230316 0.278431 256 256.0 -1.0000 -1.0294 169 0.217809 0.205302 512 512.0 1.0000 8.7075 91 0.235335 0.252861 1024 1024.0 1.0000 3.5087 72 0.233448 0.231560 2048 2048.0 1.0000 3.1936 40 finished run number of examples per pass = 2500 passes used = 1 weighted example sum = 2500.000000 weighted label sum = 652.000000 average loss = 0.238410 best constant = 0.533933 best constant's loss = 0.658742 total feature number = 292019
y_hat = pd.read_csv('output.csv', header=None)
roc_auc_score(test_texts[1].replace({'-1 ': 0, '1 ': 1}), y_hat[0])
0.96415249302305139
Включается с помощью флага --multilabel_oaa n
, где $n$ число классов
texts = pd.read_csv('multi_tag.10k.tsv', header=None, sep='\t')
texts.columns = ['text', 'tags']
classes = np.arange(21)
texts['tags'] = map(lambda row: ','.join(map(str, classes[row.astype('bool')])) + ' ', y)
texts[['tags', 'text']].to_csv('multi_tag.vw' , sep='|', header=None, index=False)
!head -n 5 multi_tag.vw | cut -c 1-80
3 | i want to use a track bar to change a form s opacity this is my code decimal 6 | i have an absolutely positioned div containing several children one of which 0,3 | given a datetime representing a person s birthday how do i calculate their 3 | given a specific datetime value how do i display relative time like 2 hours 6,8 | is there any standard way for a web server to be able to determine a user
!vw -d multi_tag.vw --loss_function logistic -f model.vw --multilabel_oaa 21 --passes 10 -c
final_regressor = model.vw Num weight bits = 18 learning rate = 0.5 initial_t = 0 power_t = 0.5 decay_learning_rate = 1 using cache_file = multi_tag.vw.cache ignoring text input in favor of cache input num sources = 1 average since example example current current current loss last counter weight label predict features 1.000000 1.000000 1 1.0 3 66 1.000000 1.000000 2 2.0 6 3 99 0.750000 0.500000 4 4.0 3 22 0.750000 0.750000 8 8.0 5 3 9 0.687500 0.625000 16 16.0 3 56 0.625000 0.562500 32 32.0 12 146 0.671875 0.718750 64 64.0 14 333 0.679688 0.687500 128 128.0 11 13 44 0.742188 0.804688 256 256.0 11 56 0.750000 0.757812 512 512.0 11 11 79 0.745117 0.740234 1024 1024.0 11 11 92 0.753906 0.762695 2048 2048.0 11 11 261 0.735840 0.717773 4096 4096.0 unknown 35 0.726807 0.717773 8192 8192.0 19 11 71 0.729670 0.729670 16384 16384.0 unknown 185 h 0.712637 0.695604 32768 32768.0 unknown 157 h 0.694410 0.676188 65536 65536.0 12 35 h finished run number of examples per pass = 9000 passes used = 10 weighted example sum = 90000.000000 weighted label sum = 0.000000 average loss = 0.674000 h total feature number = 9989130
Для простоты получим ответы на train
!vw -i model.vw -p output.csv multi_tag.vw --loss_function logistic
predictions = output.csv Num weight bits = 18 learning rate = 0.5 initial_t = 0 power_t = 0.5 using no cache Reading datafile = multi_tag.vw num sources = 1 average since example example current current current loss last counter weight label predict features 0.000000 0.000000 1 1.0 3 3 66 0.000000 0.000000 2 2.0 6 6 99 0.750000 1.500000 4 4.0 3 22 0.750000 0.750000 8 8.0 5 9 1.000000 1.250000 16 16.0 6 6 157 1.000000 1.000000 32 32.0 16 3 15 16 225 0.953125 0.906250 64 64.0 11 11 181 0.890625 0.828125 128 128.0 9 9 98 0.800781 0.710938 256 256.0 11 11 38 0.755859 0.710938 512 512.0 6 6 311 0.684570 0.613281 1024 1024.0 14 188 0.671875 0.659180 2048 2048.0 11 43 0.639404 0.606934 4096 4096.0 11 166 0.633423 0.627441 8192 8192.0 4 4 63 finished run number of examples per pass = 10000 passes used = 1 weighted example sum = 10000.000000 weighted label sum = 0.000000 average loss = 0.631000 total feature number = 1108776