Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the relevant data of the error logs? #4

Open
jualvespereira opened this issue Apr 25, 2019 · 5 comments
Open

What are the relevant data of the error logs? #4

jualvespereira opened this issue Apr 25, 2019 · 5 comments

Comments

@jualvespereira
Copy link
Collaborator

Some relevant information that should be considered for clustering:

'make.: *.+'
'. 1:.+'
'.
error:.+'
'.
undefined reference.+'

@FAMILIAR-project
Copy link
Contributor

There are two radical approaches for clustering:

  • fully automated with the usual challenges (dealing with too much noise, with almost similar yet different content, with clusters that may "subsume" other clusters, etc.)
  • manual with pattern matching: yes, but you need to know what to search

Of course the solution is neither completely black nor completely white...
An hybrid approach is to find "generic" patterns (like @jualvespereira proposes).
Another is to use some knowledge we gather throughout the review of failures.

For instance, I have come across some patterns:

and I implemented some ad-hoc regex
something like

for err in err_logs_configuration(cid).splitlines(): 
        if "read_overflow2" in err:
            print (err)

maybe we can have pre-defined regex for labelling failures... and fully automated techniques for the rest.

Final remark: we may have more than one cluster attached to a failure -- see this failure #1 (comment)

@jualvespereira
Copy link
Collaborator Author

I extracted the four pieces of information above and then I clustered using brute force.
I should optimize the script since I have too many clusters making unfeasible the use of such an algorithm. Some ways that come in my mind to make it feasible:

  • Detect information noises by investigating automatically generated clusters from a sample of cids of each config option in the decision tree and then ignore such information.
  • Ignore the error order.
  • Sort the errors first before grouping.
  • Compute clusters of a small sample of randomly chosen configs.

@FAMILIAR-project
Copy link
Contributor

Interesting ideas, go ahead!

@jualvespereira
Copy link
Collaborator Author

I used the data frame created in issue #5 to cluster the errors and I got 32 clusters.
I removed the search for 'make.: *.+' errors (that may be not so significant) and 166 cids were not classified (i.e., I couldn't restrict its relevant error information by using just '. 1:.+', '. error:.+', '. undefined reference.+'). I'll try to use tfidf and k-means to discover the top terms to cluster.

@jualvespereira
Copy link
Collaborator Author

jualvespereira commented May 12, 2019

I'm able to cover all error logs after using k-means to discover the top terms to cluster.
For the clustering, I used 4 terms ('.* error:.+', 'undefined reference.+', '.* 1:.+', '.*aicasm.+') and I considered the error that comes first in each error message.
We have a total of 16 clusters. I was able to classify 12 of them by looking at the issues (TuxML/ProjetIrma), qualitative analysis of the bug, and decision tree.
You can find attached the file with further details. For each log error, we have:

  • configuration options responsible for the error
  • number of directly related errors
  • number of indirectly related errors
  • which errors dominate this one
  • cause of the error

logErr_detail.xlsx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants