Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model.clusters returns dict of empty lists #42

Open
gaardhus opened this issue Nov 14, 2021 · 1 comment
Open

model.clusters returns dict of empty lists #42

gaardhus opened this issue Nov 14, 2021 · 1 comment

Comments

@gaardhus
Copy link
Contributor

gaardhus commented Nov 14, 2021

I'm trying to implement the model on a corpus of 5352 documents following the tutorial notebook. After running the model.fit() method I can plot my results as a graph, see topic distributions per document, and use the model.clustering_query to get valid outputs. However when running model.clusters I get a dict with N-topics but with empty lists:

>>> model.clusters(l=1, n=5)

{0: [],
 1: [],
 2: [],
 3: [],
 4: [],
 5: [],
 6: [],
 7: [],
 8: [],
 9: [],
 10: [],
 11: [],
 12: [],
 13: [],
 14: [],
 15: [],
 16: []}

Any known reasons as to why this may happen, or do I need to provide more info? I've installed graph-tool on my Windows system through Docker.

Edit:
After looking further into the source code for the model.clusters I see that the problem is that one of the objects contain NaN values, as such recoding NaNs to 0s helped me solve my problem. The problem then seems to originate from the model.get_groups() method, though I havn't had the time debugging that yet.

def clusters(self,l=0,n=10):
    '''
    Get n 'most common' documents from each document cluster.
    most common refers to largest contribution in group membership vector.
    For the non-overlapping case, each document belongs to one and only one group with prob 1.

    '''
    # dict_groups = self.groups[l]
    dict_groups = self.get_groups(l=l)
    Bd = dict_groups['Bd']
    p_td_d = dict_groups['p_td_d']
    p_td_d = np.nan_to_num(p_td_d, 0) # <----- This solved my issue

    docs = self.documents
    ## loop over all word-groups
    dict_group_docs = {}
    for td in range(Bd):
        p_d_ = p_td_d[td,:]
        ind_d_ = np.argsort(p_d_)[::-1]
        list_docs_td = []
        for i in ind_d_[:n]:
            if p_d_[i] > 0:
                list_docs_td+=[(docs[i],p_d_[i])]
            else:
                break
        dict_group_docs[td] = list_docs_td
    return dict_group_docs

The error pertains to this warning:

/home/user/sbmtm.py:547: RuntimeWarning: invalid value encountered in true_divide
  p_td_d = (n_db/np.sum(n_db,axis=1)[:,np.newaxis]).T
/home/user/sbmtm.py:553: RuntimeWarning: invalid value encountered in true_divide
  p_tw_d = (n_dbw/np.sum(n_dbw,axis=1)[:,np.newaxis]).T
@fvalle1
Copy link
Contributor

fvalle1 commented Jul 21, 2022

This may happen if a node has no links, have you verified that all nodes (both words and documents) have at least one edge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants