W3cubDocs

sklearn.datasets.fetch_20newsgroups_vectorized

sklearn.datasets.fetch_20newsgroups_vectorized(subset=’train’, remove=(), data_home=None, download_if_missing=True, return_X_y=False) [source]

Load the 20 newsgroups dataset and vectorize it into token counts (classification).

Download it if necessary.

Parameters:	`subset : ‘train’ or ‘test’, ‘all’, optional` Select the dataset to load: ‘train’ for the training set, ‘test’ for the test set, ‘all’ for both, with shuffled ordering. `remove : tuple` May contain any subset of (‘headers’, ‘footers’, ‘quotes’). Each of these are kinds of text that will be detected and removed from the newsgroup posts, preventing classifiers from overfitting on metadata. ‘headers’ removes newsgroup headers, ‘footers’ removes blocks at the ends of posts that look like signatures, and ‘quotes’ removes lines that appear to be quoting another post. `data_home : optional, default: None` Specify an download and cache folder for the datasets. If None, all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders. `download_if_missing : optional, True by default` If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site. `return_X_y : boolean, default=False.` If True, returns `(data.data, data.target)` instead of a Bunch object. New in version 0.20.
Returns:	`bunch : Bunch object` bunch.data: sparse matrix, shape [n_samples, n_features] bunch.target: array, shape [n_samples] bunch.target_names: list, length [n_classes] bunch.DESCR: a description of the dataset. `(data, target) : tuple if return_X_y is True` New in version 0.20.