model.clusters returns dict of empty lists #42

gaardhus · 2021-11-14T17:40:44Z

I'm trying to implement the model on a corpus of 5352 documents following the tutorial notebook. After running the model.fit() method I can plot my results as a graph, see topic distributions per document, and use the model.clustering_query to get valid outputs. However when running model.clusters I get a dict with N-topics but with empty lists:

>>> model.clusters(l=1, n=5)

{0: [],
 1: [],
 2: [],
 3: [],
 4: [],
 5: [],
 6: [],
 7: [],
 8: [],
 9: [],
 10: [],
 11: [],
 12: [],
 13: [],
 14: [],
 15: [],
 16: []}

Any known reasons as to why this may happen, or do I need to provide more info? I've installed graph-tool on my Windows system through Docker.

Edit:
After looking further into the source code for the model.clusters I see that the problem is that one of the objects contain NaN values, as such recoding NaNs to 0s helped me solve my problem. The problem then seems to originate from the model.get_groups() method, though I havn't had the time debugging that yet.

def clusters(self,l=0,n=10):
    '''
    Get n 'most common' documents from each document cluster.
    most common refers to largest contribution in group membership vector.
    For the non-overlapping case, each document belongs to one and only one group with prob 1.

    '''
    # dict_groups = self.groups[l]
    dict_groups = self.get_groups(l=l)
    Bd = dict_groups['Bd']
    p_td_d = dict_groups['p_td_d']
    p_td_d = np.nan_to_num(p_td_d, 0) # <----- This solved my issue

    docs = self.documents
    ## loop over all word-groups
    dict_group_docs = {}
    for td in range(Bd):
        p_d_ = p_td_d[td,:]
        ind_d_ = np.argsort(p_d_)[::-1]
        list_docs_td = []
        for i in ind_d_[:n]:
            if p_d_[i] > 0:
                list_docs_td+=[(docs[i],p_d_[i])]
            else:
                break
        dict_group_docs[td] = list_docs_td
    return dict_group_docs

The error pertains to this warning:

/home/user/sbmtm.py:547: RuntimeWarning: invalid value encountered in true_divide
  p_td_d = (n_db/np.sum(n_db,axis=1)[:,np.newaxis]).T
/home/user/sbmtm.py:553: RuntimeWarning: invalid value encountered in true_divide
  p_tw_d = (n_dbw/np.sum(n_dbw,axis=1)[:,np.newaxis]).T

The text was updated successfully, but these errors were encountered:

fvalle1 · 2022-07-21T10:27:28Z

This may happen if a node has no links, have you verified that all nodes (both words and documents) have at least one edge?

Juan-Mateos mentioned this issue Jan 11, 2022

glass clustering flow + util scripts nestauk/industrial_taxonomy#27

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model.clusters returns dict of empty lists #42

model.clusters returns dict of empty lists #42

gaardhus commented Nov 14, 2021 •

edited

Loading

fvalle1 commented Jul 21, 2022

model.clusters returns dict of empty lists #42

model.clusters returns dict of empty lists #42

Comments

gaardhus commented Nov 14, 2021 • edited Loading

fvalle1 commented Jul 21, 2022

gaardhus commented Nov 14, 2021 •

edited

Loading