scikit-learn 분류기에 대한 가장 유익한 기능을 얻는 방법은 무엇입니까?

ProgramingTip

scikit-learn 분류기에 대한 가장 유익한 기능을 얻는 방법은 무엇입니까?

bestdevel 2020. 11. 20. 09:18

scikit-learn 분류기에 대한 가장 유익한 기능을 얻는 방법은 무엇입니까?

liblinear 및 nltk와 같은 기계 학습 패키지의 분류자는 show_most_informative_features()기능을 수행하는 데 매우 유용한 방법을 제공 합니다.

viagra = None          ok : spam     =      4.5 : 1.0
hello = True           ok : spam     =      4.5 : 1.0
hello = None           spam : ok     =      3.3 : 1.0
viagra = True          spam : ok     =      3.3 : 1.0
casino = True          spam : ok     =      2.0 : 1.0
casino = None          ok : spam     =      1.5 : 1.0

내 질문은 scikit-learn의 분류기에 대해 구현 여부입니다. 문서를 검색 한 것은 없습니다.

아직없는 기능이 어떤 값을 얻는 방법입니까?

감사합니다!

분류 자 자체는 기능 이름을 기록하지 않고 숫자 배열 만 표시합니다. 당신이 사용하여 특징을 추출하는 경우에는, Vectorizer/ CountVectorizer/ TfidfVectorizer/ DictVectorizer, 그리고 당신이 사용하는 선형 모델 (예 LinearSVC또는 나이브 베이 즈를) 다음 같은 트릭을 적용 할 수있는 문서의 분류 예를 들어, 사용합니다. 예 ( 테스트되지 않음, 버그가 한두 개 있음) :

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))

이것은 다중 클래스 분류를위한 것입니다. 케이스의 경우 바이너리 clf.coef_[0]에만 사용해야한다고 생각합니다 . 당신은 정렬 할 수 있습니다 class_labels.

larsmans 코드의 도움으로 바이너리 케이스에 대한 다음 코드를 생각해 듣습니다.

def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)

RandomForestClassifier이제 추가 비용으로 .feature_importances_속성을 지원 합니다. 이 속성 은 관측 된 분산이 기능에 의해 설명되는 정도를 알려줍니다. 분명히 모든 값의 합은 <= 1이어야합니다.

이 속성은 기능 엔지니어링을 수행 할 때 매우 유용합니다.

이를 구현 한 scikit-learn 팀과 기여자에게 감사드립니다!

편집 : 이것은 RandomForest와 GradientBoosting 모두에서 작동합니다. 그래서 RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier및 GradientBoostingRegressor모든 지원이.

우리는 최근에 이를 가능하게 하는 라이브러리 ( https://github.com/TeamHG-Memex/eli5 )를 출시했습니다 . scikit-learn, 바이너리 / 다중 클래스 케이스의 다양한 분류기를 처리하고 기능 값에 텍스트를 강조 표시 할 수 있습니다. , IPython 수신 통합됩니다.

NaiveBayes 분류기에서 기능 중요성을 찾아야 위의 기능을 사용했지만 클래스를 기반으로 한 기능을 사용할 수 없었습니다. scikit-learn의 문서를 살펴보고 위의 기능을 약간 수정하여 내 문제를 해결했습니다. 도움이되기를 바랍니다.

def important_features(vectorizer,classifier,n=20):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()

    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]

    print("Important words in negative reviews")

    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)

    print("-----------------------------------------")
    print("Important words in positive reviews")

    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat)

이 작업을 수행 할 경우 분류 자 (내 경우 NaiveBayes)에 feature_count_ 속성이 표시됩니다.

다음과 같이 순서에 따라 중요 기능의 그래프를 만들 수도 있습니다.

importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
         axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
#print("Feature ranking:")


# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(train[features].shape[1]), importances[indices],
   color="r", yerr=std[indices], align="center")
plt.xticks(range(train[features].shape[1]), indices)
plt.xlim([-1, train[features].shape[1]])
plt.show()

RandomForestClassifier아직 coef_attrubute가 없지만 0.17 릴리스에서 나올 것입니다. 그러나 scikit-learn을 사용하는 Random Forest의 재귀 기능 제거RandomForestClassifierWithCoef 클래스를 참조하십시오 . 이 위의 제한을 제공 할 수있는 몇 가지 아이디어를 제공합니다.

정확히 원하는 것이 아니라 가장 큰 크기 계수를 얻는 빠른 방법입니다 (Pandas 데이터 프레임 열이 기능 이름이라고 가정).

다음과 같이 모델을 훈련했습니다.

lr = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(df, Y, test_size=0.25)
lr.fit(X_train, y_train)

다음과 같이 가장 큰 음의 계수 값 10 개를 가져옵니다 (또는 가장 큰 양의 경우 reverse = True로 변경).

sorted(list(zip(feature_df.columns, lr.coef_)), key=lambda x: x[1], 
reverse=False)[:10]

참고 URL : https://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers

'ProgramingTip' 카테고리의 다른 글

어떤 Ruby on Rails가 어떤 Ruby 버전과 호환 되나요? (0)	2020.11.20
엔터티 필드에 대한 Nullable 속성, Code First를 포함하는 Entity Framework (0)	2020.11.20
Netbeans에서 여러 줄의 코드 편집 (0)	2020.11.20
Linux에서 프로세스에 대해 kill -9가 효과가있는 것이 어떻게 가능합니까? (0)	2020.11.20
정적 함수 내에서 사용하면 실패합니다. (0)	2020.11.20

현재글scikit-learn 분류기에 대한 가장 유익한 기능을 얻는 방법은 무엇입니까?

bestdevel

scikit-learn 분류기에 대한 가장 유익한 기능을 얻는 방법은 무엇입니까?

scikit-learn 분류기에 대한 가장 유익한 기능을 얻는 방법은 무엇입니까?

'ProgramingTip' 카테고리의 다른 글

'ProgramingTip'의 다른글

티스토리툴바

scikit-learn 분류기에 대한 가장 유익한 기능을 얻는 방법은 무엇입니까?

scikit-learn 분류기에 대한 가장 유익한 기능을 얻는 방법은 무엇입니까?

'ProgramingTip' 카테고리의 다른 글

'ProgramingTip'의 다른글

관련글

티스토리툴바