Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer nil pointer dereference error with specific input text #13

Open
alkuma opened this issue Sep 17, 2024 · 2 comments
Open

tokenizer nil pointer dereference error with specific input text #13

alkuma opened this issue Sep 17, 2024 · 2 comments

Comments

@alkuma
Copy link

alkuma commented Sep 17, 2024

I am getting a nil pointer error with specific texts, I created a test at https://github.com/alkuma/tokenizerissue to demonstrate the issue.

There are two strings that are being embedded, the first one goes thru, but the second one fails.

Here is the output of the program:

/usr/local/go/bin/go tool test2json -t /home/alok/.cache/JetBrains/GoLand2024.2/tmp/GoLand/___1TestEmbedding_in_tokenizerissue.test -test.v=test2json -test.paniconexit0 -test.run ^\QTestEmbedding\E$
2024/09/17 15:19:17 INFO: CachedDir="/home/alok/.cache/tokenizer"
=== RUN   TestEmbedding
AS YOU LIKE IT
DRAMATIS PERSONAE
DUKE SENIOR     living in banishment.
DUKE FREDERICK  his brother, an usurper of his dominions.
AMIENS  | |  lords attending on the banished duke. JAQUES       |
LE BEAU a courtier attending upon Frederick.
CHARLES wrestler to Frederick.
OLIVER          | | JAQUES (JAQUES DE BOYS:)    |  sons of Sir Rowland de Boys. | ORLANDO               |
ADAM    | |  servants to Oliver. DENNIS |
TOUCHSTONE      a clown.
SIR OLIVER MARTEXT      a vicar.
CORIN   |
|  shepherds.
SILVIUS |
WILLIAM a country fellow in love with Audrey.
A person representing HYMEN. (HYMEN:)
ROSALIND        daughter to the banished duke.
CELIA   daughter to Frederick.
673
AS YOU LIKE IT
DRAMATIS PERSONAE
DUKE SENIOR     living in banishment.
DUKE FREDERICK  his brother, an usurper of his dominions.
AMIENS  | |  lords attending on the banished duke. JAQUES       |
LE BEAU a courtier attending upon Frederick.
CHARLES wrestler to Frederick.
OLIVER          | | JAQUES (JAQUES DE BOYS:)    |  sons of Sir Rowland de Boys. | ORLANDO               |
ADAM    | |  servants to Oliver. DENNIS |
TOUCHSTONE      a clown.
SIR OLIVER MARTEXT      a vicar.
CORIN   |
|  shepherds.
SILVIUS |
WILLIAM a country fellow in love with Audrey.
A person representing HYMEN. (HYMEN:)
ROSALIND        daughter to the banished duke.
CELIA   daughter to Frederick.

673
[[-0.03914698 -0.016538322 0.023178136 0.011642908 0.0564912 0.007894647 0.025433742 0.035716955 -0.019091763 -0.020351494 -0.01514089 0.012821367 -0.054411728 0.032000016 -0.018131282 0.08451082 0.049022898 0.017406626 0.017041788 -0.04623328 0.035323188 0.032879945 -0.00037763309 0.06970637 0.043458235 0.016123137 0.018936096 0.02063325 -0.05700099 -0.025971899 0.035846584 -0.011900318 -0.019333359 -0.02900332 -0.008514504 -0.04624489 -0.014379387 -0.020486495 0.011699142 0.006409424 -0.07082859 -0.013937539 0.005432031 -0.005396163 -0.06794449 -0.010348638 -0.036275588 0.017526548 -0.0058838953 -0.024287518 -0.048917953 0.039226435 -0.009778961 0.012310327 -0.023821943 0.022571076 -0.032029998 -0.04378382 0.028337928 -0.05277187 0.0017444869 0.006742158 0.0372529 0.0010056419 -0.00060814136 0.006685288 1.7267339e-05 0.06339974 -0.059310623 -0.021789877 -0.015668927 -0.0010967483 0.0025227757 -0.008613313 0.027283266 -0.063864276 -0.004863343 0.027290193 0.056868467 0.023454076 -0.016763298 -0.0054399003 0.006188793 0.048575606 -0.026876703 -0.03610761 -0.00043335767 0.0003789369 -0.081276685 0.039349385 -0.008940972 -0.049587313 0.03761981 0.042609006 -0.0079110265 -0.0035721026 0.051356286 -0.0044129933 0.0015475124 0.021560663 -0.031360585 -0.038248923 0.016443029 -0.00034044942 -0.09568959 -0.01081435 -0.009083789 0.03689808 0.0075237486 0.027368983 -0.005172214 -0.009384759 0.009782874 0.021921514 -0.045917857 0.06161481 0.043410208 -0.013445491 -0.0077494937 0.023404986 0.06215935 0.020469893 -0.0072953156 0.10336666 0.024090728 0.020731067 -0.011793335 0.05349762 -0.013003309 -0.06273616 0.005801809 0.05778524 -0.01770478 -0.010948908 -0.001877506 -0.042103637 0.014173885 -0.0043255757 0.035545927 0.010555955 0.0022608016 -0.01118194 0.013429064 0.020862108 0.043351796 -0.030716933 -0.017455425 -0.041980993 0.012616989 0.048666902 -0.012770984 0.031477388 0.030343024 -0.036949676 0.012917157 0.058238946 0.029963208 0.026458332 -0.021398345 0.02069551 0.03828373 0.027362816 0.01144157 0.01773172 0.022156883 -0.010513427 0.0060478807 0.01718434 -0.04091684 0.021870496 -0.072977014 -0.016520886 0.06187403 -0.041432407 -0.018332282 0.039845582 0.09559854 0.041445937 0.04684613 -0.0069881086 -0.07639642 0.040348083 0.02335101 -0.0046012397 0.026520465 -0.007890131 0.052469924 0.010194702 0.014858035 0.03759209 -0.06054353 -0.064428255 -0.02382746 0.0030103163 0.045369398 -0.019959413 -0.004659652 0.03708061 -0.0038971528 0.06279973 -0.012340761 0.020642685 0.04060641 -0.006453256 -0.061737575 0.018255593 -0.001301643 0.024874456 0.032203175 -0.011828143 -0.03851288 -0.012062547 0.0664155 -0.049623474 0.02326029 -0.015502906 0.052531365 -0.037508376 0.017440705 -0.0735822 0.025554365 -0.012990718 -0.041517846 0.023062894 -0.024853319 0.100043885 0.056865305 -0.05963884 -0.04027259 0.024926204 -0.01888787 -0.025096748 -0.0013074251 -0.01325122 -0.010748644 0.011728527 0.004855801 -0.046975892 0.03411985 -0.056537498 0.0056181317 0.053715814 0.011858979 0.079618104 0.017376544 0.01665108 0.034709867 0.0006663871 -0.056170613 -0.02711519 -0.0014543701 -0.03524299 0.0075247423 -0.022341667 0.008779559 -0.05686332 -0.032249086 0.049802165 0.03996286 0.05161114 -0.042233754 0.014376176 -0.016475571 -0.018463984 -0.013941526 -0.036131732 -0.037772164 -0.012133741 0.033861097 -0.005092063 0.02904495 -0.002741557 -0.0012500169 0.004346772 0.005123095 8.7455184e-05 0.072036505 0.00032354923 0.024320055 -0.039208207 -0.01390895 0.074305646 -0.047137924 -0.03887461 0.001901088 -0.10452757 0.03541344 -0.051450636 -0.039202023 0.0037135687 0.038421314 0.037239667 0.030913997 0.00741533 0.03195537 0.00699422 0.0046604634 0.035519995 -0.015194695 -0.0059102173 -0.0125123635 -0.0060820356 0.013914759 -0.0015158656 0.02122563 -0.02741586 0.0085247895 -0.031034654 -0.26160786 0.0021040207 0.034374084 -0.040845644 0.049236394 -0.019883346 0.040674936 -0.03596126 -0.05063188 0.035107706 -0.029123846 -0.0457412 0.010796176 0.042915713 0.039227314 0.015756665 -0.018854285 -0.045812745 -0.009172312 0.037980657 -0.021215655 -0.054714758 -0.030052204 0.024671923 0.025940834 0.059799552 -0.050287012 -0.0030677565 -0.086370826 -0.03388499 0.0021782645 0.0038881723 0.033259008 -0.015950117 -0.0035676898 -0.041105423 0.04366649 -0.0068972823 -0.021965034 0.0011830716 -0.02629944 -0.044561606 -0.023651939 0.009472122 0.05867902 -0.016693924 -0.06414829 -0.0066306265 -0.03840866 0.06758065 -0.02788127 -0.011591923 -0.005063631 0.002926456 -0.0056525413 0.0028303913 -0.0055221547 0.01315445 -0.06290255 -0.04398332 -0.012094841 -0.04480738 -0.041383084 -0.023820942 -0.008420631 -0.057843395 -0.04899028 -0.01342248 0.09446904 0.038170658 -0.040533535 -0.007015521 0.01192337 -0.08365915 0.0017968907 -0.0025380466 -0.009427974 -0.009932006 0.00026163805 0.02594476 -0.030752674 -0.026657093 0.021098124 0.008863274 -0.006488929 -0.03697985 0.023044562 -0.02419331 -0.036591124 -0.024120301 0.06960363 -0.010372081 -0.025158368 -0.013693026 0.01300504 0.02227767 -0.0015247545 -0.015730513 0.0238872 0.01825556 0.0370508 -0.074274346 0.033484608 -0.0060399654 0.0067823497 -0.0060035777 -0.05207626 0.029591309 0.03991352 0.017776724 0.056803543 -0.0036727912 0.034457386 -0.046009373 0.00023433224 -0.071260884 0.02851205 0.07166555 0.0063079665 0.038949873 -0.05573132 0.041894786 -0.036953613 -0.020935554 -0.0922639 -0.012961587 0.009381917 -0.011597907 -0.019261444 -0.00639877 -0.004511787 -0.0033551345 0.027393656 -0.024261534 0.017353045 0.00080475234 -0.05555794 -0.052705985 -0.0014381633 0.0018840844 0.021800129 0.010827761 -0.0063026166 0.03353285 0.044599503 0.0077848737 -0.0029283045 -0.00039049625 0.018483976 0.035987716 0.005219433 0.003641071 0.030400632 -0.059704714 -0.021531524 -0.032892182 0.013581656 -0.006007797 0.008786557 -0.02286594 -0.02111237 -0.04407928 -0.025530605 -0.0068782447 0.0074550346 0.062660806 -0.010601268 -0.010685531 -0.015256402 0.019312108 0.025710458 0.014006963 -0.045301154 0.01740028 -0.009736621 0.0066993353 -0.022136973 0.013612366 0.05849686 0.029680526 0.001417695 -0.03254062 -0.0018819447 0.0041718297 0.06276969 0.035705727 0.005127659 -0.06511382 -0.0036923448 -0.0047796667 -0.0006886609 0.028202135 -0.03349943 0.013126994 -0.057374008 -0.07305299 0.02789206 -0.0026524563 0.024118802 0.010876676 0.016884591 -0.006562245 -0.04699496 0.028407542 0.043413766 -0.072359815 0.061121542 0.0023021614 -0.009506745 0.017742652 -0.011882974 -0.051569894 -0.0032277745 0.013072393 0.0252644 -0.06367772 -0.012006346 -0.039752934 0.016992357 -0.01946568 0.017556485 -0.039766937 -0.015146741 0.0043553817 -0.03300536 0.041409392 -0.029696869 -0.034427825 0.03265753 -0.033445444 0.029599441 -0.015332254 0.0038055116 0.04395136 -0.019857742 -0.0037471876 -0.019987168 -0.027075827 0.0051693665 0.057406757 0.033968635 0.018858982 -0.032702416 -0.02568262 -0.015521807 0.02559059 0.011727608 -0.017817227 0.0022101407 0.04306708 0.0001521992 -0.002650939 -0.021742256 -0.012054737 0.068472214 -0.047306042 -0.014674873 0.017066197 -0.051577978 0.030212536 0.002544334 0.02917181 -0.019093212 0.02930066 0.05152553 0.009152614 0.029787736 0.0011963875 0.052472897 -0.0361598 0.00010058674 -0.06904818 0.016232267 -0.0039677448 0.011245551 0.013937295 -0.015575298 -0.046503574 0.06782438 -0.08391851 -0.026548455 0.04568559 -0.030084113 0.010012481 0.020641306 -0.069049835 0.0027308327 0.021092122 -0.03908603 0.0064549767 0.014999664 0.052215375 0.0031571654 0.02453982 0.015449896 -0.009599123 0.054865893 0.038270622 0.008379506 0.05169393 -0.0635431 0.05361065 0.027451267 -0.02504078 -0.0318296 0.021326253 -0.008771796 -0.07166529 0.0046098814 0.008210814 -0.012494197 -0.07983677 0.0322951 0.016638167 -0.027372014 -0.04498509 -0.0115331495 -0.026469693 -0.03370635 0.000676141 0.011307931 -0.011655599 0.06414379 0.018598035 0.025064886 0.063107245 -0.017471809 0.037015863 -0.0041355346 0.09167845 0.06278827 0.049575448 -0.032504965 0.094415836 -0.0070365896 -0.06828078 0.03029201 0.03385621 -0.023417555 -0.019534213 0.008425382 0.058012586 0.0021701755 0.050336093 -0.013609865 -0.011643509 -0.0058129276 -0.0142343035 0.04619372 0.015765378 0.028137436 0.038674865 0.018905077 -0.06938297 0.039243255 0.020575562 -0.027785309 0.0044124466 -0.041977398 0.033078786 0.0023755538 0.0013827555 0.080165684 0.021713875 -0.008895852 0.010854239 0.030240793 0.010076886 -0.0068736626 -0.010659401 0.0091342125 -0.016192537 -0.03269065 0.0015859033 0.014045188 -0.005773467 0.025777139 -0.03233787 0.0020606334 0.022983052 0.036939822 -0.043826174 -0.04531051 -0.052388918 -0.048537176 -0.05221436 -0.023132278 -0.008065607 -0.041005827 -0.048821874 -0.018616289 -0.036834672 -0.0131818615 0.00032311416 -0.0608724 -0.0473172 0.017388172 0.03620469 0.016872536 0.009612658 0.06283182 0.0266591 -0.0407606 -0.018680993 0.009808718 0.045869667 0.0017224478 0.020221831 -0.106909215 0.032913286 0.045634817 -0.011272518 -0.07594389 0.03301969 -0.014931814 -0.03439635 0.051964276 0.014607602 -0.0019748472 -0.031476032 -0.014223328 0.0025003545 0.010445406 0.049866706 -0.060485397 0.08876377 0.033138666 0.01942703 -0.052508734 0.015518047 0.0050181053 0.023438185 -0.06435748 -0.007261127 -0.009940068 -0.08559045 -0.02445086 0.01683098 -0.041163374 -0.044273637 0.017937073 -0.023909848 0.0026623239 0.019933624 -0.022201682 -0.029950371 -0.032257035 -0.0068081166 -0.043268044 0.032621004 0.02144448 -0.0013739939 0.019817922 -0.052019957 -0.0036603028 -0.009124586 -0.009007775 0.01633006 0.0038869274 0.010353903]]
AS YOU LIKE IT
DRAMATIS PERSONAE
DUKE SENIOR     living in banishment.
DUKE FREDERICK  his brother, an usurper of his dominions.
AMIENS  | |  lords attending on the banished duke. JAQUES       |
LE BEAU a courtier attending upon Frederick.
CHARLES wrestler to Frederick.
OLIVER          | | JAQUES (JAQUES DE BOYS:)    |  sons of Sir Rowland de Boys. | ORLANDO               |
ADAM    | |  servants to Oliver. DENNIS |
TOUCHSTONE      a clown.
SIR OLIVER MARTEXT      a vicar.
CORIN   |
|  shepherds.
SILVIUS |
WILLIAM a country fellow in love with Audrey.
A person representing HYMEN. (HYMEN:)
ROSALIND        daughter to the banished duke.
CELIA   daughter to Frederick.
PHEBE   a shepherdess.
AUDREY  a country wench.
Lords, pages, and attendants, &c. (Forester:) (A Lord:) (First Lord:) (Second Lord:) (First Page:) (Second Page:)
835
AS YOU LIKE IT
DRAMATIS PERSONAE
DUKE SENIOR     living in banishment.
DUKE FREDERICK  his brother, an usurper of his dominions.
AMIENS  | |  lords attending on the banished duke. JAQUES       |
LE BEAU a courtier attending upon Frederick.
CHARLES wrestler to Frederick.
OLIVER          | | JAQUES (JAQUES DE BOYS:)    |  sons of Sir Rowland de Boys. | ORLANDO               |
ADAM    | |  servants to Oliver. DENNIS |
TOUCHSTONE      a clown.
SIR OLIVER MARTEXT      a vicar.
CORIN   |
|  shepherds.
SILVIUS |
WILLIAM a country fellow in love with Audrey.
A person representing HYMEN. (HYMEN:)
ROSALIND        daughter to the banished duke.
CELIA   daughter to Frederick.
PHEBE   a shepherdess.
AUDREY  a country wench.
Lords, pages, and attendants, &c. (Forester:) (A Lord:) (First Lord:) (Second Lord:) (First Page:) (Second Page:)

835
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x730074]

goroutine 35 [running]:
github.com/sugarme/tokenizer.(*Encoding).GetIds(...)
	/path/to/go/pkg/mod/github.com/sugarme/[email protected]/encoding.go:215
github.com/sugarme/tokenizer.TruncateEncodings(0xc0009d2d00, 0x0, 0xc0009d2c30?)
	/path/to/go/pkg/mod/github.com/sugarme/[email protected]/util.go:108 +0x54
github.com/sugarme/tokenizer.(*Tokenizer).PostProcess(0xc000465680, 0xc0009d2d00?, 0x0?, 0x1)
	/path/to/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:602 +0xe5
github.com/sugarme/tokenizer.(*Tokenizer).Encode(0xc000465680, {0x7a1e20, 0xc0003121e0}, 0x1)
	/path/to/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:464 +0x2e5
github.com/sugarme/tokenizer.(*Tokenizer).EncodeBatch.func1(0x0)
	/path/to/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:647 +0x90
created by github.com/sugarme/tokenizer.(*Tokenizer).EncodeBatch in goroutine 34
	/path/to/go/pkg/mod/github.com/sugarme/[email protected]/tokenizer.go:644 +0xf5


Process finished with the exit code 1

The first chunk has 634 characters and the embedding is successful. The next chunk has 835 characters (ie the first 634 characters and an additional 201 characters beyond that) and it fails with the tokenizer nil pointer dereference error.

Has anybody faced this before, is it a known issue, and if so is there a way to work around it?

Please let me know if any additional information is required.

To execute the tests, follow these steps

  1. git clone the https://github.com/alkuma/tokenizerissue repository
  2. set the value of ONNX_PATH to the correct value
  3. simply run the test called TestEmbedding which is present in the file embed_test.go and you should get the error
@alkuma
Copy link
Author

alkuma commented Sep 17, 2024

Since there was a similar issue reported (and closed via a code change / PR) on the tokenizer side I just forked both tokenizer and fastemebed-go and published the latest master / main branch and used them as dependency, and the error is gone.

Perhaps all that's needed to be done is to publish the latest versions of both?

@Anush008
Copy link
Owner

@alkuma, I'd recommend you keep your project dependent on your fork. It gives you the flexibility to add any changes.
As I can see, both fastembed-go and https://github.com/sugarme/tokenizer aren't under active maintenance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants