

Существует много типов разреженных матриц, каждый из которых предоставляет разные гарантии на операции.
scipy.sparse.bsr_matrixscipy.sparse.coo_matrixscipy.sparse.csc_matrixscipy.sparse.csr_matrixscipy.sparse.dia_matrixscipy.sparse.dok_matrixscipy.sparse.lil_matrixПодробнее про разреженые матрицы
Для разреженных матриц есть свои hstack и vstack, которые находятся в scipy.sparse
Подходят почти все модели
Не подходят
Мы хотим по тексту вопроса определить его тег.
texts = pd.read_csv('windows_vs_linux.10k.tsv', header=None, sep='\t')
texts.columns = ['text', 'is_windows']
print(texts.shape)
texts.head(4)
(10000, 2)
| text | is_windows | |
|---|---|---|
| 0 | so i find myself porting a game that was orig... | 0 |
| 1 | i ve been using tortoisesvn in a windows envi... | 1 |
| 2 | we are using wmv videos on an internal site a... | 1 |
| 3 | on one linux server running apache and php 5 ... | 0 |
В качестве признаков будем использовать факт вхождения слова в документ.
vectorizer = CountVectorizer(binary=True)
bow = vectorizer.fit_transform(texts.text)
print(bow.shape)
print(type(bow))
(10000, 40971) <class 'scipy.sparse.csr.csr_matrix'>
params = {'C': np.logspace(-5, 5, 11)}
clf = LogisticRegression()
cv = GridSearchCV(clf, params, n_jobs=-1, scoring='roc_auc', cv=5)
cv.fit(bow, texts.is_windows);
pd.DataFrame(cv.cv_results_)[['mean_test_score', 'params']].sort_values('mean_test_score', ascending=False)
| mean_test_score | params | |
|---|---|---|
| 4 | 0.965813 | {u'C': 0.1} |
| 5 | 0.962415 | {u'C': 1.0} |
| 3 | 0.961759 | {u'C': 0.01} |
| 6 | 0.955353 | {u'C': 10.0} |
| 7 | 0.948635 | {u'C': 100.0} |
| 2 | 0.945693 | {u'C': 0.001} |
| 8 | 0.945429 | {u'C': 1000.0} |
| 9 | 0.943599 | {u'C': 10000.0} |
| 10 | 0.942903 | {u'C': 100000.0} |
| 1 | 0.790566 | {u'C': 0.0001} |
| 0 | 0.632507 | {u'C': 1e-05} |
top = pd.DataFrame([get_top_windows(cv.best_estimator_, 6),
get_top_linux(cv.best_estimator_, 6)]).T
top.columns = ['Windows', 'Linux']
top
| Windows | Linux | |
|---|---|---|
| 0 | windows | ubuntu |
| 1 | win32 | root |
| 2 | vista | mono |
| 3 | exe | linux |
| 4 | dll | kernel |
| 5 | batch | bash |
Представим, что у нас есть очень много признаков и мы хотим сократить их количество, выбрав только самые нужные.
В данном случае мы взяли все возможные $N$-граммы, где $N \in \{1, \dots, 4\}$
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 4))
bow = vectorizer.fit_transform(texts.text)
print(bow.shape)
print(type(bow))
(10000, 2117115) <class 'scipy.sparse.csr.csr_matrix'>
Мы можем отобрать $50000$ признаков с помощью SelectKBest и увидеть, что мы получили на кросс-валидации качество лучше чем до этого.
Есть ли тут проблемы?
k_best = SelectKBest(k=50000)
bow_k_best = k_best.fit_transform(bow, texts.is_windows)
clf = LogisticRegression()
np.mean(cross_val_score(clf, bow_k_best, texts.is_windows, scoring='roc_auc', cv=5))
0.97955838190531419
Когда делается отбор признаков всегда нужна проверка на отложенной выборке, иначе оценка будет сильно завышена.
x_train, x_test, y_train, y_test = train_test_split(bow, texts.is_windows)
print(x_train.shape)
print(x_test.shape)
(7500, 2117115) (2500, 2117115)
k_best = SelectKBest(k=50000)
x_train_k_best = k_best.fit_transform(x_train, y_train)
x_test_k_best = k_best.transform(x_test)
clf = LogisticRegression()
clf.fit(x_train_k_best, y_train)
roc_auc_score(y_test, clf.predict_proba(x_test_k_best)[:, 1])
0.96876106147585705
y_hat = cross_val_predict(cv.best_estimator_, bow, texts.is_windows, method='predict_proba')[:, 1]
y_hat[:5]
array([ 0.12431717, 0.99933152, 0.98910196, 0.02073042, 0.84102736])
Делая любое преобразование, сохраняющее порядок, мы будет получать одинаковое значение $AUC$
print(roc_auc_score(texts.is_windows, y_hat))
print(roc_auc_score(texts.is_windows, y_hat * 2 + 1))
print(roc_auc_score(texts.is_windows, y_hat ** 2))
0.963904819501 0.963904819501 0.963904819501
fpr, tpr, _ = roc_curve(texts.is_windows, y_hat)
plot(fpr, tpr, lw=2);
plot([0, 1], [0, 1], linestyle='--');
xlim([0.0, 1.0])
ylim([0.0, 1.05])
xlabel('False Positive Rate')
ylabel('True Positive Rate');
texts = pd.read_csv('multi_tag.10k.tsv', header=None, sep='\t')
texts.columns = ['text', 'tags']
print(texts.shape)
texts.head(4)
(10000, 2)
| text | tags | |
|---|---|---|
| 0 | i want to use a track bar to change a form s ... | c# winforms type-conversion decimal opacity |
| 1 | i have an absolutely positioned div containin... | html css css3 internet-explorer-7 |
| 2 | given a datetime representing a person s birt... | c# .net datetime |
| 3 | given a specific datetime value how do i disp... | c# datetime datediff relative-time-span |
Некоторые методы умеют работать с несколькими метками из коробки
KNeighborsClassifierRandomForestClassifierSVCОстальные можно обобщить с помощью обучения нескольких моделей
OneVsRestClassifierOneVsOneClassifier$tp$, $fp$ и $fp$ будут считаться по всем тегам одного объекта
y, y_hat = np.array([[1, 1, 0, 0]]), np.array([[1, 0, 1, 0]])
tp, fp, fn = 1., 1., 1.
p, r = tp / (tp + fp), tp / (tp + fn)
f1 = 2 * p * r / (p + r)
print(f1)
print(f1_score(y, y_hat, average='samples'))
0.5 0.5
Повторим всё, как для случая двух классов
vectorizer = CountVectorizer(binary=True)
bow = vectorizer.fit_transform(texts.text)
print(bow.shape)
print(type(bow))
(10000, 35247) <class 'scipy.sparse.csr.csr_matrix'>
tags = texts.tags.apply(lambda x: x.split())
all_tags = reduce(lambda s, x: s + x, tags, [])
values, count = np.unique(all_tags, return_counts=True)
top_tags = sorted(zip(count, values), reverse=True)[:20]
top_tags[:5]
[(1198, 'c#'), (1090, '.net'), (696, 'java'), (610, 'asp.net'), (472, 'sql-server')]
Преобразуем списки тегов в матрицу, которая будет содержать индикаторы наличия тега у вопроса.
binarizer = MultiLabelBinarizer()
y = binarizer.fit_transform(texts.tags.apply(lambda x: filter_tags(x.split())))
print(y.shape)
print(type(y))
(10000, 21) <type 'numpy.ndarray'>
LogisticRegression, но уже вместе с OneVsRestClassifierparams = {'estimator__C': np.logspace(-5, 5, 11)}
clf = OneVsRestClassifier(LogisticRegression())
cv = GridSearchCV(clf, params, n_jobs=-1, scoring=make_scorer(f1_score, average='samples'), cv=5)
cv.fit(bow, y);
pd.DataFrame(cv.cv_results_)[['mean_test_score', 'params']].sort_values('mean_test_score', ascending=False)
| mean_test_score | params | |
|---|---|---|
| 10 | 0.404732 | {u'estimator__C': 100000.0} |
| 9 | 0.404212 | {u'estimator__C': 10000.0} |
| 8 | 0.404140 | {u'estimator__C': 1000.0} |
| 7 | 0.403066 | {u'estimator__C': 100.0} |
| 6 | 0.402630 | {u'estimator__C': 10.0} |
| 5 | 0.390346 | {u'estimator__C': 1.0} |
| 4 | 0.319707 | {u'estimator__C': 0.1} |
| 3 | 0.092920 | {u'estimator__C': 0.01} |
| 2 | 0.000300 | {u'estimator__C': 0.001} |
| 0 | 0.000000 | {u'estimator__C': 1e-05} |
| 1 | 0.000000 | {u'estimator__C': 0.0001} |
CountVectorizervectorizer = TfidfVectorizer()
tf_idf = vectorizer.fit_transform(texts.text)
print(tf_idf.shape)
print(type(tf_idf))
(10000, 35247) <class 'scipy.sparse.csr.csr_matrix'>
Ищем лучшие параметры
params = {'estimator__C': np.logspace(-5, 5, 11)}
clf = OneVsRestClassifier(LogisticRegression())
cv = GridSearchCV(clf, params, n_jobs=-1, scoring=make_scorer(f1_score, average='samples'), cv=5)
cv.fit(tf_idf, y);
pd.DataFrame(cv.cv_results_)[['mean_test_score', 'params']].sort_values('mean_test_score', ascending=False)
| mean_test_score | params | |
|---|---|---|
| 10 | 0.382173 | {u'estimator__C': 100000.0} |
| 9 | 0.380033 | {u'estimator__C': 10000.0} |
| 8 | 0.376427 | {u'estimator__C': 1000.0} |
| 7 | 0.364680 | {u'estimator__C': 100.0} |
| 6 | 0.334247 | {u'estimator__C': 10.0} |
| 5 | 0.182323 | {u'estimator__C': 1.0} |
| 4 | 0.000700 | {u'estimator__C': 0.1} |
| 0 | 0.000000 | {u'estimator__C': 1e-05} |
| 1 | 0.000000 | {u'estimator__C': 0.0001} |
| 2 | 0.000000 | {u'estimator__C': 0.001} |
| 3 | 0.000000 | {u'estimator__C': 0.01} |
predict возвращается 1, если вероятность принадлежности к классу больше $0.5$clf = OneVsRestClassifier(LogisticRegression(C=100000))
y_hat_bow = cross_val_predict(clf, bow, y, method='predict_proba')
y_hat_tf_idf = cross_val_predict(clf, tf_idf, y, method='predict_proba')
Функция, которая в зависимости от порога ставит тег
def get_score(alpha, y, y_hat):
return f1_score(y, (y_hat > alpha).astype('int'), average='samples')
alphas = np.linspace(0.0, 0.01, 100)
scores = [get_score(alpha, y, y_hat_bow) for alpha in alphas]
plot(alphas, scores);
scatter(alphas[np.argmax(scores)], np.max(scores));
print(np.max(scores))
print(alphas[np.argmax(scores)])
0.454356904762 0.00535353535354
alphas = np.linspace(0.0, 0.01, 100)
scores = [get_score(alpha, y, y_hat_tf_idf) for alpha in alphas]
plot(alphas, scores);
scatter(alphas[np.argmax(scores)], np.max(scores));
print(np.max(scores))
print(alphas[np.argmax(scores)])
0.493972857143 0.00191919191919
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3))
bow = vectorizer.fit_transform(texts.text)
print(bow.shape)
print(type(bow))
(10000, 1132373) <class 'scipy.sparse.csr.csr_matrix'>
Вместо того, чтобы хранить каждый вариант N-граммы отдельно, мы будем получать индекс столбца, хешируя содержимое.
Позволяет задавать произвольное количество столбцов.
vectorizer = HashingVectorizer(binary=True, ngram_range=(1, 3), n_features=50000)
bow = vectorizer.fit_transform(texts.text)
print(bow.shape)
print(type(bow))
(10000, 50000) <class 'scipy.sparse.csr.csr_matrix'>
clf = OneVsRestClassifier(LogisticRegression(C=100000))
y_hat_bow = cross_val_predict(clf, bow, y, method='predict_proba')
alphas = np.linspace(0.0, 0.01, 100)
scores = [get_score(alpha, y, y_hat_bow) for alpha in alphas]
plot(alphas, scores);
scatter(alphas[np.argmax(scores)], np.max(scores));
print(np.max(scores))
print(alphas[np.argmax(scores)])
0.509255 0.00191919191919
Можно получать dense матрицы любым способом, который вам нравится.
Позволяет сократить время обучения и использовать методы, которые больше подходят dense матрицы.
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
tf_idf = vectorizer.fit_transform(texts.text)
print(tf_idf.shape)
print(type(tf_idf))
(10000, 397950) <class 'scipy.sparse.csr.csr_matrix'>
svd = TruncatedSVD(n_components=200, n_iter=5)
tf_idf_svd = svd.fit_transform(tf_idf)
Когда у нас есть несколько моделей, мы можем получать смешенное предсказание.
Если модели не сильно скоррелированы, то зачастую мы можем улучшить качество результирующей модели.
vectorizer = HashingVectorizer(binary=True, ngram_range=(1, 3), n_features=50000)
bow = vectorizer.fit_transform(texts.text)
print(bow.shape)
print(type(bow))
(10000, 50000) <class 'scipy.sparse.csr.csr_matrix'>
vectorizer = TfidfVectorizer()
tf_idf = vectorizer.fit_transform(texts.text)
print(tf_idf.shape)
print(type(tf_idf))
(10000, 35247) <class 'scipy.sparse.csr.csr_matrix'>
С помощью кросс-валидации предскажем обучающую выборку для каждой модели.
В итоге мы получим несмещённые предсказания для объектов, у которых уже знаем метки.
clf_lr = OneVsRestClassifier(LogisticRegression(C=100000))
y_hat_lr = cross_val_predict(clf_lr, bow, y, method='predict_proba', cv=folds)
clf_lr = OneVsRestClassifier(LogisticRegression(C=100000))
y_hat_lr_tf_idf = cross_val_predict(clf_lr, tf_idf, y, method='predict_proba', cv=folds)
Получим качество на каждой моделе в отдельности и на их смеси.
alphas = np.linspace(0.0, 0.02, 100)
lr_scores = [get_score(alpha, y, y_hat_lr) for alpha in alphas]
nb_scores = [get_score(alpha, y, y_hat_lr_tf_idf) for alpha in alphas]
lr_nb_scores = [get_score(alpha, y, 0.5 * y_hat_lr_tf_idf + 0.5 * y_hat_lr) for alpha in alphas]
print(np.max(lr_scores))
print(np.max(nb_scores))
print(np.max(lr_nb_scores))
0.50923452381 0.493558095238 0.527097619048
plot(alphas, lr_scores);
plot(alphas, nb_scores);
plot(alphas, lr_nb_scores);
scatter(alphas[np.argmax(lr_scores)], np.max(lr_scores));
scatter(alphas[np.argmax(nb_scores)], np.max(nb_scores));
scatter(alphas[np.argmax(lr_nb_scores)], np.max(lr_nb_scores));
Вместо ручного смешивания результатов мы можем подавать их на вход другим алгоритмам.
Подготовим переменную stacked, которая будет содержать предсказания предыдущих алгоритмов
stacked = np.hstack([y_hat_lr, y_hat_lr_tf_idf])
clf_stacked = OneVsRestClassifier(RandomForestClassifier(n_estimators=100))
y_hat_stacked = cross_val_predict(clf_stacked, stacked, y, method='predict_proba', cv=folds)
После подбора порога получим $F1=0.547874$, что больше всех предыдущих результатов.
alphas = np.linspace(0, 1, 100)
scores = [get_score(alpha, y, y_hat_stacked) for alpha in alphas]
plot(alphas, scores);
scatter(alphas[np.argmax(scores)], np.max(scores));
print(np.max(scores))
print(alphas[np.argmax(scores)])
0.547874126984 0.232323232323
Почти во всех конкурсах так или иначе используется стекинг или блендинг, поэтому очень важно понимать как они работают и как их использовать.
!vw -h
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile =
num sources = 1
VW options:
--random_seed arg seed random number generator
--ring_size arg size of example ring
Update options:
-l [ --learning_rate ] arg Set learning rate
--power_t arg t power value
--decay_learning_rate arg Set Decay factor for learning_rate
between passes
--initial_t arg initial t value
--feature_mask arg Use existing regressor to determine
which parameters may be updated. If no
initial_regressor given, also used for
initial weights.
Weight options:
-i [ --initial_regressor ] arg Initial regressor(s)
--initial_weight arg Set all weights to an initial value of
arg.
--random_weights arg make initial weights random
--input_feature_regularizer arg Per feature regularization input file
Parallelization options:
--span_server arg Location of server for setting up
spanning tree
--threads Enable multi-threading
--unique_id arg (=0) unique id used for cluster parallel
jobs
--total arg (=1) total number of nodes used in cluster
parallel job
--node arg (=0) node number in cluster parallel job
Diagnostic options:
--version Version information
-a [ --audit ] print weights of features
-P [ --progress ] arg Progress update frequency. int:
additive, float: multiplicative
--quiet Don't output disgnostics and progress
updates
-h [ --help ] Look here: http://hunch.net/~vw/ and
click on Tutorial.
Feature options:
--hash arg how to hash the features. Available
options: strings, all
--ignore arg ignore namespaces beginning with
character <arg>
--keep arg keep namespaces beginning with
character <arg>
--redefine arg redefine namespaces beginning with
characters of string S as namespace N.
<arg> shall be in form 'N:=S' where :=
is operator. Empty N or S are treated
as default namespace. Use ':' as a
wildcard in S.
-b [ --bit_precision ] arg number of bits in the feature table
--noconstant Don't add a constant feature
-C [ --constant ] arg Set initial value of constant
--ngram arg Generate N grams. To generate N grams
for a single namespace 'foo', arg
should be fN.
--skips arg Generate skips in N grams. This in
conjunction with the ngram tag can be
used to generate generalized
n-skip-k-gram. To generate n-skips for
a single namespace 'foo', arg should be
fN.
--feature_limit arg limit to N features. To apply to a
single namespace 'foo', arg should be
fN
--affix arg generate prefixes/suffixes of features;
argument '+2a,-3b,+1' means generate
2-char prefixes for namespace a, 3-char
suffixes for b and 1 char prefixes for
default namespace
--spelling arg compute spelling features for a give
namespace (use '_' for default
namespace)
--dictionary arg read a dictionary for additional
features (arg either 'x:file' or just
'file')
--dictionary_path arg look in this directory for
dictionaries; defaults to current
directory or env{PATH}
--interactions arg Create feature interactions of any
level between namespaces.
--permutations Use permutations instead of
combinations for feature interactions
of same namespace.
--leave_duplicate_interactions Don't remove interactions with
duplicate combinations of namespaces.
For ex. this is a duplicate: '-q ab -q
ba' and a lot more in '-q ::'.
-q [ --quadratic ] arg Create and use quadratic features
--q: arg : corresponds to a wildcard for all
printable characters
--cubic arg Create and use cubic features
Example options:
-t [ --testonly ] Ignore label information and just test
--holdout_off no holdout data in multiple passes
--holdout_period arg holdout period for test only, default
10
--holdout_after arg holdout after n training examples,
default off (disables holdout_period)
--early_terminate arg Specify the number of passes tolerated
when holdout loss doesn't decrease
before early termination, default is 3
--passes arg Number of Training Passes
--initial_pass_length arg initial number of examples per pass
--examples arg number of examples to parse
--min_prediction arg Smallest prediction to output
--max_prediction arg Largest prediction to output
--sort_features turn this on to disregard order in
which features have been defined. This
will lead to smaller cache sizes
--loss_function arg (=squared) Specify the loss function to be used,
uses squared by default. Currently
available ones are squared, classic,
hinge, logistic, quantile and poisson.
--quantile_tau arg (=0.5) Parameter \tau associated with Quantile
loss. Defaults to 0.5
--l1 arg l_1 lambda
--l2 arg l_2 lambda
--named_labels arg use names for labels (multiclass, etc.)
rather than integers, argument
specified all possible labels,
comma-sep, eg "--named_labels
Noun,Verb,Adj,Punc"
Output model:
-f [ --final_regressor ] arg Final regressor
--readable_model arg Output human-readable final regressor
with numeric features
--invert_hash arg Output human-readable final regressor
with feature names. Computationally
expensive.
--save_resume save extra state so learning can be
resumed later with new data
--save_per_pass Save the model after every pass over
data
--output_feature_regularizer_binary arg
Per feature regularization output file
--output_feature_regularizer_text arg Per feature regularization output file,
in text
--id arg User supplied ID embedded into the
final regressor
Output options:
-p [ --predictions ] arg File to output predictions to
-r [ --raw_predictions ] arg File to output unnormalized predictions
to
Reduction options, use [option] --help for more info:
--audit_regressor arg stores feature names and their
regressor values. Same dataset must be
used for both regressor training and
this mode.
--bootstrap arg k-way bootstrap by online importance
resampling
--search arg Use learning to search,
argument=maximum action id or 0 for LDF
--replay_c arg use experience replay at a specified
level [b=classification/regression,
m=multiclass, c=cost sensitive] with
specified buffer size
--explore_eval Evaluate explore_eval adf policies
--cbify arg Convert multiclass on <k> classes into
a contextual bandit problem
--cb_explore_adf Online explore-exploit for a contextual
bandit problem with multiline action
dependent features
--cb_explore arg Online explore-exploit for a <k> action
contextual bandit problem
--multiworld_test arg Evaluate features as a policies
--cb_adf Do Contextual Bandit learning with
multiline action dependent features.
--cb arg Use contextual bandit learning with <k>
costs
--csoaa_ldf arg Use one-against-all multiclass learning
with label dependent features. Specify
singleline or multiline.
--wap_ldf arg Use weighted all-pairs multiclass
learning with label dependent features.
Specify singleline or multiline.
--interact arg Put weights on feature products from
namespaces <n1> and <n2>
--csoaa arg One-against-all multiclass with <k>
costs
--multilabel_oaa arg One-against-all multilabel with <k>
labels
--recall_tree arg Use online tree for multiclass
--log_multi arg Use online tree for multiclass
--ect arg Error correcting tournament with <k>
labels
--boosting arg Online boosting with <N> weak learners
--oaa arg One-against-all multiclass with <k>
labels
--top arg top k recommendation
--replay_m arg use experience replay at a specified
level [b=classification/regression,
m=multiclass, c=cost sensitive] with
specified buffer size
--binary report loss as binary classification on
-1,1
--link arg (=identity) Specify the link function: identity,
logistic, glf1 or poisson
--stage_poly use stagewise polynomial feature
learning
--lrqfa arg use low rank quadratic features with
field aware weights
--lrq arg use low rank quadratic features
--autolink arg create link function with polynomial d
--marginal arg substitute marginal label estimates for
ids
--new_mf arg rank for reduction-based matrix
factorization
--nn arg Sigmoidal feedforward network with <k>
hidden units
confidence options:
--confidence_after_training Confidence after training
--confidence Get confidence for binary predictions
--active_cover enable active learning with cover
--active enable active learning
--replay_b arg use experience replay at a specified
level [b=classification/regression,
m=multiclass, c=cost sensitive] with
specified buffer size
--OjaNewton Online Newton with Oja's Sketch
--bfgs use bfgs optimization
--conjugate_gradient use conjugate gradient based
optimization
--lda arg Run lda with <int> topics
--noop do no learning
--print print examples
--rank arg rank for matrix factorization.
--sendto arg send examples to <host>
--svrg Streaming Stochastic Variance Reduced
Gradient
--ftrl FTRL: Follow the Proximal Regularized
Leader
--pistol FTRL: Parameter-free Stochastic
Learning
--ksvm kernel svm
Gradient Descent options:
--sgd use regular stochastic gradient descent
update.
--adaptive use adaptive, individual learning
rates.
--invariant use safe/importance aware updates.
--normalized use per feature normalized updates
--sparse_l2 arg (=0) use per feature normalized updates
Input options:
-d [ --data ] arg Example Set
--daemon persistent daemon mode on port 26542
--port arg port to listen on; use 0 to pick unused
port
--num_children arg number of children for persistent
daemon mode
--pid_file arg Write pid file in persistent daemon
mode
--port_file arg Write port used in persistent daemon
mode
-c [ --cache ] Use a cache. The default is
<data>.cache
--cache_file arg The location(s) of cache_file.
--json Enable JSON parsing.
-k [ --kill_cache ] do not reuse existing cache: create a
new one always
--compressed use gzip format whenever possible. If a
cache file is being created, this
option creates a compressed cache file.
A mixture of raw-text & compressed
inputs are supported with
autodetection.
--no_stdin do not default to reading from stdin
Label [weight] |Namespace Feature ... |Namespace ...Label — метка класса для задачи классификации или действительное число для задачи регрессииweight — вес объекта, по умолчанию у всех одинаковыйNamespace — все признаки разбиты на области видимости, может использоваться для раздельного использования или создания квадратичных признаков между областямиFeature — string[:value] или int[:value] строки будут хешированы, числа будут использоваться как индекс в векторе признаков. value по умолчанию равно $1$Вводится функция $h$, с помощью которой получается индекс для записи значения в вектор признаков объекта.
$$h : F \rightarrow \{0, \dots, 2^b - 1\}$$С помощью --b можно задавать размер области значений хеш-функции.
Может использовать SGD или L-BFGS
SGD по умолчанию, позволяет делать онлайн обучение. Почти всегда необходимо несколько проходов по данным.L-BFGS включается с помощью --bfgs, работает только с данными небольшого размераSGD задаётся с помощью параметра --passesПроходим по всем элементам обучающей выборки много раз, на каждом объекте делаем поправку весов:
Здесь $t$ — порядковый номер объекта обучения, $k$ — номер прохода по всей выборке.
-l (learning rate) --decay_learning_rate --initial_t --power_t --passesaverage loss — loss by progressive validation
$e_i$ — loss на объекте $x_i$ при обучении на объектах $\{x_1 ... x_{i-1}\}$
-1 | A:1 B:10 1 | A:-1 B:12
-1 | so i find myself porting a game that was originally written 1 | i ve been using tortoisesvn in a windows environment for quite some time
windows vs linux¶Для этого сначала сконвертируем данные в формат для vw
texts = pd.read_csv('windows_vs_linux.10k.tsv', sep='\t', header=None)
texts[1].replace({0: '-1 ', 1: '1 '}, inplace=True)
train_texts, test_texts = train_test_split(texts)
train_texts[[1, 0]].to_csv('win_vs_lin.train.vw', sep='|', header=None, index=False)
test_texts[[1, 0]].to_csv('win_vs_lin.test.vw', sep='|', header=None, index=False)
!head -n 5 win_vs_lin.train.vw | cut -c 1-50
1 | i have a bat file shown below echo off for f d 1 | i need a way to determine whether the computer -1 | my c application uses 3rd libraries which do -1 | currently i m trying to install php 5 3 0 on 1 | i how to get the windowproc for a form in c fo
!vw -d win_vs_lin.train.vw --loss_function logistic -P 10000 -f model.vw --passes 100 -c
final_regressor = model.vw Num weight bits = 18 learning rate = 0.5 initial_t = 0 power_t = 0.5 decay_learning_rate = 1 creating cache_file = win_vs_lin.train.vw.cache Reading datafile = win_vs_lin.train.vw num sources = 1 average since example example current current current loss last counter weight label predict features 0.342860 0.342860 10000 10000.0 1.0000 3.3821 95 h 0.301154 0.259449 20000 20000.0 1.0000 4.0657 69 h 0.289920 0.267450 30000 30000.0 -1.0000 -4.0597 170 h 0.280396 0.251826 40000 40000.0 -1.0000 -2.4205 101 h 0.276837 0.262603 50000 50000.0 -1.0000 -7.5515 245 h 0.272949 0.253507 60000 60000.0 -1.0000 -5.6810 138 h finished run number of examples per pass = 6750 passes used = 10 weighted example sum = 67500.000000 weighted label sum = 14760.000000 average loss = 0.256594 h best constant = 0.444511 best constant's loss = 0.669045 total feature number = 8086910
!vw -i model.vw -t -p output.csv win_vs_lin.test.vw --loss_function logistic
only testing predictions = output.csv Num weight bits = 18 learning rate = 0.5 initial_t = 0 power_t = 0.5 using no cache Reading datafile = win_vs_lin.test.vw num sources = 1 average since example example current current current loss last counter weight label predict features 0.693859 0.693859 1 1.0 -1.0000 0.0014 46 0.794020 0.894181 2 2.0 -1.0000 0.3683 252 0.852172 0.910324 4 4.0 -1.0000 1.2895 301 0.437393 0.022613 8 8.0 1.0000 3.0448 56 0.280665 0.123937 16 16.0 1.0000 0.7930 51 0.225321 0.169977 32 32.0 1.0000 2.8181 74 0.175063 0.124804 64 64.0 1.0000 7.9573 50 0.182200 0.189338 128 128.0 1.0000 21.6971 223 0.230316 0.278431 256 256.0 -1.0000 -1.0294 169 0.217809 0.205302 512 512.0 1.0000 8.7075 91 0.235335 0.252861 1024 1024.0 1.0000 3.5087 72 0.233448 0.231560 2048 2048.0 1.0000 3.1936 40 finished run number of examples per pass = 2500 passes used = 1 weighted example sum = 2500.000000 weighted label sum = 652.000000 average loss = 0.238410 best constant = 0.533933 best constant's loss = 0.658742 total feature number = 292019
y_hat = pd.read_csv('output.csv', header=None)
roc_auc_score(test_texts[1].replace({'-1 ': 0, '1 ': 1}), y_hat[0])
0.96415249302305139
Включается с помощью флага --multilabel_oaa n, где $n$ число классов
texts = pd.read_csv('multi_tag.10k.tsv', header=None, sep='\t')
texts.columns = ['text', 'tags']
classes = np.arange(21)
texts['tags'] = map(lambda row: ','.join(map(str, classes[row.astype('bool')])) + ' ', y)
texts[['tags', 'text']].to_csv('multi_tag.vw' , sep='|', header=None, index=False)
!head -n 5 multi_tag.vw | cut -c 1-80
3 | i want to use a track bar to change a form s opacity this is my code decimal 6 | i have an absolutely positioned div containing several children one of which 0,3 | given a datetime representing a person s birthday how do i calculate their 3 | given a specific datetime value how do i display relative time like 2 hours 6,8 | is there any standard way for a web server to be able to determine a user
!vw -d multi_tag.vw --loss_function logistic -f model.vw --multilabel_oaa 21 --passes 10 -c
final_regressor = model.vw Num weight bits = 18 learning rate = 0.5 initial_t = 0 power_t = 0.5 decay_learning_rate = 1 using cache_file = multi_tag.vw.cache ignoring text input in favor of cache input num sources = 1 average since example example current current current loss last counter weight label predict features 1.000000 1.000000 1 1.0 3 66 1.000000 1.000000 2 2.0 6 3 99 0.750000 0.500000 4 4.0 3 22 0.750000 0.750000 8 8.0 5 3 9 0.687500 0.625000 16 16.0 3 56 0.625000 0.562500 32 32.0 12 146 0.671875 0.718750 64 64.0 14 333 0.679688 0.687500 128 128.0 11 13 44 0.742188 0.804688 256 256.0 11 56 0.750000 0.757812 512 512.0 11 11 79 0.745117 0.740234 1024 1024.0 11 11 92 0.753906 0.762695 2048 2048.0 11 11 261 0.735840 0.717773 4096 4096.0 unknown 35 0.726807 0.717773 8192 8192.0 19 11 71 0.729670 0.729670 16384 16384.0 unknown 185 h 0.712637 0.695604 32768 32768.0 unknown 157 h 0.694410 0.676188 65536 65536.0 12 35 h finished run number of examples per pass = 9000 passes used = 10 weighted example sum = 90000.000000 weighted label sum = 0.000000 average loss = 0.674000 h total feature number = 9989130
Для простоты получим ответы на train
!vw -i model.vw -p output.csv multi_tag.vw --loss_function logistic
predictions = output.csv Num weight bits = 18 learning rate = 0.5 initial_t = 0 power_t = 0.5 using no cache Reading datafile = multi_tag.vw num sources = 1 average since example example current current current loss last counter weight label predict features 0.000000 0.000000 1 1.0 3 3 66 0.000000 0.000000 2 2.0 6 6 99 0.750000 1.500000 4 4.0 3 22 0.750000 0.750000 8 8.0 5 9 1.000000 1.250000 16 16.0 6 6 157 1.000000 1.000000 32 32.0 16 3 15 16 225 0.953125 0.906250 64 64.0 11 11 181 0.890625 0.828125 128 128.0 9 9 98 0.800781 0.710938 256 256.0 11 11 38 0.755859 0.710938 512 512.0 6 6 311 0.684570 0.613281 1024 1024.0 14 188 0.671875 0.659180 2048 2048.0 11 43 0.639404 0.606934 4096 4096.0 11 166 0.633423 0.627441 8192 8192.0 4 4 63 finished run number of examples per pass = 10000 passes used = 1 weighted example sum = 10000.000000 weighted label sum = 0.000000 average loss = 0.631000 total feature number = 1108776