User error or flaw in scheduler?

Comments

36 comments

  • Avatar
    Gordon Scott

    Hm,

     

    I just noticed something else that might be a clue.

    In my example timeouts, the changed goes from 0 to 0x7FFFFFFF, or from 0x1235 to 0x80001234, which suggests a decrement then the sign bit flipped.

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    STM32F205Z6      arm-unknown-elf

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    This isn't just because I was using 'now' as the wait time. I set it to now+1 and the problem still occurs, at least in it's 0x8000xxxx form.

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    Well, I now think the 'funny' values are made in ctl.c at line 98:

      t->timeout = ctl_current_time+ctl_timeslice_period;
    ... which suggests to me that this just marks the timeout as expired.

    Which leaves me back near the start of my problem.  The tasks stops being run, so I don't again call ctl_timeout_wait(), so the task doesn't get run.

    Now I'm back wondering if the task is being blocked, but there are lower priority tasks running, though I think those are all waiting on events, not the timer. Looking for evidence. It's  bit of a pig, because this typically takes some hours to happen :-(

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    Hm, OK, or this time after just 5 minutes. Lower priority tasks on ctl_timeout_wait() are still being run.

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    Nope, definitely not blocked.

    If I stop execution and manually edit the timer to greater than value in ctl_current_time, the task starts running again.

    Setting it to the same value did not, which might offer a clue. I'll repeat that next time it stops.

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    When this task stops, it has the task state set to CTL_STATE_TIMER_WAIT but the timer is at a value like 0x80012345, and timeout_occured == zero, which looks to me like a nonsense condition.  So I'm trying to see how it could get into that state.

     

    Looking through the ctl code I can see that ctl_mutex_lock() and ctl_events_wait(), both of which I definitely use, have somelines:

                type |= CTL_STATE_TIMER_WAIT;  
    followed by:

          ctl_task_executing->state = type;
    which look like they could be candidates to cause the condition. There are others, though I'm not aware of using them.

    The task stopped whilst I had breakpoints on mutex lines like the above and it never broke on them, probably eliminating them.

    Breakpoints on the event lines like those above break fairly frequently, though AFAICS only from another task, and that task is running fine.

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    OK, I've captured an incident right at ctl_timeout_wait().

    I think this implies some IRQ still happening despite being between the global On/Off.

    Image attached.

    ctl_task_executing == Ticker_task

    task timeout != parameter timeout

    state == CTL_STATE_TIMER_WAIT.

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    Hm,  Or  ctl_task_executing->timeout fails to get set by this code as the 0x8001150d is a timeout two ticks previous.

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    Eliminating one possibility ..  ctl_task_executing is the same at the beginning (after disable) and at the break and is the Ticker_task.

    I can get this problem to appear fairly quickly now by working a USART port relatively hard, though it's a fairly slow port. I've failed to find anything in that code to suggest it might be implicated, though there's still time:-/

    0
    Comment actions Permalink
  • Avatar
    Royce Arnold

    Gordon,

    Have you tried this with your testing this with your watch update frequency set to "Never".  I have seen weird stuff happen if the debugger is continuously polling the watch list(s).

    I'm using a STM32F205ZG with CTL and I have not observed anything like you are reporting.  I have multiple high speed UART/USART (115200) running continuously and use ctl_timeout_wait() all over the place in multiple task/threads.  I've never seen the ctl_timeout_wait() be any source of a failure.

    What version of Crossworks are you using?

     

    RA

     

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    Hi Royce,

    Watch update frequency is not "never" as I originally needed watch to identify when the problem occurred. I think I can now set it to never. I'll try it.

    Crossworks 2.2.0.2012022100.14129

    It'll be quite irritating if this symptom turns out to be related to Watch, because there is still a problem in the 'release' code with the task, or at least things managed by the task, stopping.

    Trying 'Never'.

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    Nope .. still breaking in ctl_timeout_wait() with the nonsense state+timeout combination.

    It's likely fluke, but that's one of the quickest times I've seen it fail.

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    I thought I'd try closing the 'threads' window as that always showed 'waiting' or something similar. No change.

    Interesting, though, that I've now had three fails in quite-quick succession. Maybe not fluke.

    FWIW, the USART code has a small IRQ that copies data into a small buffer, then sends an event to a task IRQ_Lower, which unloads the buffer and puts it into a circular buffer for the tasks to access. That means there are both interrupts and ctl calls going on quite frequently.  The problem seems substantially worse with higher traffic, suggesting quite strongly that the USART code is implicated somehow. But I've stuck traps in all the likely places and never see it. :-(

    0
    Comment actions Permalink
  • Avatar
    Royce Arnold

    Gordon,

    A couple of thoughts and some information,

    That's the same version of Crossworks that I'm running.

    My project has the GCC Target set to arm-unknown-eabi.  I do not remember why I changed it from -elf...  I think that it is the new default(?).  I can vaguely remember some issue that this resolved.

    I do initialize CTL with my own code rather than the crl_start_timer() function.  I operate CTL with a 1000Hz clock instead of the default 100Hz clock.

    I also setup the interrupts differently.  I use NVIC_PriorityGroup_4, with PendSV set to 15 and SysTick set to 8. 

    I've also abandoned the SysTick_Handler included with CTL and call ctl_increment_tick_from_isr() from my SysTick_Handler interrupt.

    Make sure you are not calling the ctl_timeout_wait() from within an interrupt.

    Make sure you are calling ctl_enter_isr() and ctl_exit_isr() from your interrupts that use CTL function calls.

    Make sure you can not exit the interrupts without calling the ctl_exit_isr().

     

    Royce

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    I've had -eabi suggested wrt a different problem. I think I should give that a try.

    I hacked something to get 1kHz too. Can't off-hand remember what but know I can find it.

    I'll have to explore the IRQ options. I have been wondering if SysTick might be involved with the problem, but have yet to find anything.

    I'm calling with ctl_enter_isr() and ctl_exit_isr() at  the right places and also increment/decrement an irq counter around the call to the event sender, so that lot _should_ be OK :-/

    The trouble with 'should' is that it means nothing :-(

    0
    Comment actions Permalink
  • Avatar
    Royce Arnold

    OK, I just re-read your problem.

    I do not think  is a good idea:  ctl_timeout_wait(ctl_get_current_time()) to ensure the task executes every tick....  Unless you completely disable interrupts, there is a non-zero (but small) chance than an interrupts could happen during the time that function executes.  This could create a condition where the being checked has already occurred.  That probably leaves the could waiting got the next occurrence to occur in the far future (2^32 ticks).  You could try testing this when in failure mode by manually changing the value.

    I always add 1 tick period at a minimum to a ctl_get_current_time() call.   However, for what you are trying to do, that probably will not work.

     

    The reason that I dumped the standard CTL SysTick_Handler was to do something similar to what you are doing.  For example, I have a series of counters that are incremented/decremented on each execution of SysTick that I use for timing throughout the my program.  I think you can also do something with userTimerISR() which CTL calls every tick.

     

    You might be able to keep your task as designed, but modify the SysTick_Handler to trigger the event...   That should work.....

     

    Royce

     

     

     

     

     

    0
    Comment actions Permalink
  • Avatar
    Royce Arnold

    Looks like there is no way to edit the post.  So here is a grammar checked version.

    OK, I just re-read your problem.

    I do not think it is a good idea to call ctl_timeout_wait(ctl_get_current_time()) to ensure the task executes every tick....  Unless you completely disable interrupts, there is a non-zero (but small) chance than an interrupts could happen during the time that function executes.  This could create a condition where the value being read has already ocurred.  That probably leaves it waiting for the next occurrence far into the future (2^32 ticks).  You could try testing this by manually changing the value when it fails.

    I always add 1 tick period at a minimum to a ctl_get_current_time() call.   However, for what you are trying to do, that probably will not work.

     

    The reason that I dumped the standard CTL SysTick_Handler was to do something similar to what you are doing.  For example, I have a series of counters that are incremented/decremented on each execution of SysTick that I use for timing throughout the my program.  I think you can also do something with userTimerISR() which CTL calls every tick.

     

    You might be able to keep your task as designed, but modify the SysTick_Handler to trigger the event...   That should work.....

     

    Royce

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    Hi Royce,

    No chance to edit. That's so often frustrating when you see a typo fter the post, or realise there's extra information that would have been useful. Ho hum.

    I changed to every second tick, though I'm not sure it really should make a difference. It didn't solve the problem. Originally I needed 1ms, but now I think I can safely use longer. In practice, the exec is looking for timer expired and it doesn't care by how much provided the sign is right, hence the 0x8xxxxx numbers. One of the problems, of course, with non-zero-but-small is that unless they're fixed, the most one can do is reduce the incidence of them happening. It'll still happen sometimes and will likely then be even harder to track down.

    The problem seems to occur right in the ctl_timout_wait() routine and the timeout value is not expired, it's the previous value marked as done. The scheduler marks it done by adding 0x7fffffff to it, so the result is a number of the form 0x80012345, which will not expire for a very long time indeed. The 0x7fffffff is the timeslice period, so if using timeslicing, the timeout would instead be scheduled for the reasonable future.

    I think there are a few ways I can keep the task essentially as it is, without finding the real problem; One is to change the timer-wait for an event wait, and fire an event every ms (probably that's what you're suggesting above), one is to use another task to monitor the status & timeout values and fix them when they go wrong.

    My concern is that I think there _is_ a real problem and I'd like to find it, fix it, and propogate the fix to the CTL code if that's appropriate ... I haven't yet proven that it's not _my_ bug :-).

    If it is my bug, a FAQ or caveat somewhere could help others avoid it.

    This does bear all the hallmarks of an interrupt getting in sometime/place where it shouldn't. I just don't yet know where :-/

     

    Gordon.

    0
    Comment actions Permalink
  • Avatar
    Royce Arnold

    Gordon,

    Looks like you've explored deeper into the CTL source than I.

    I am using timeslicing with period of 1 ms.  That might explain why I'm not seeing the same behavior you are.

    I'll be curious to see what you find.  If I can think of any other suggestions, I'll let you know.

    Royce

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    For info .. arm-unknown-eabi makes no difference.

     

    It appears I do indeed have an interrupt being serviced, at least whilst I'm single-stepping, despite the single-stepping being sandwiched between

      ctl_global_interrupts_disable();
    and

      ctl_global_interrupts_enable();

    The former of those sets basepri to 0x80, which presumably should stop all(?) interrupts, however one of my interrupt handlers still runs and on entry, basepri is still at 0x80.

    Within my isr, the call to ctl_enter_isr() first sets basepri to 0x80, does some work, then sets basepri to 0 and ctl_exit_isr() restores it to 0x80 on exit.

    So my interrupt is running and presumably, whilst it's running and other interrupt can interrupt it, despite that 'global disable'.

    Anyone know if this is a quirk of debugging, or if I'm misunderstanding ctl_global_interrupts_disable(), or something?

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    Hm,  I understood that 0 is the highest mask, implying the above is something like the inverse from what it should be.

    I'm not sure yet that I believe that, or maybe 0x80 is -127 and is the highest possible priority?   I'm confused.

    0
    Comment actions Permalink
  • Avatar
    Royce Arnold

    With respect to the processor core, internal interrupts are negative and external interrupts are positive.

     

    So, 0x80 is -127, so it should block anything greater than or equal to it.  The nonmaskable interrupt is -14.  I'm not sure who wins that fight.  There could we other nonmaskable interrupts (I have not checked).

     

    0
    Comment actions Permalink
  • Avatar
    Royce Arnold

    >>Within my isr, the call to ctl_enter_isr() first sets basepri to 0x80, does some work, then sets basepri to 0 >>and ctl_exit_isr() restores it to 0x80 on exit.

    That is disturbing.  That COULD allow for a PENDSV interrupt to occur during the interrupt being serviced unless "does some work" somehow disables the PENDSV execution in some manner.  This surely has to be handled in some manner, or it would cause all sorts of issues.

     

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    -127sounds rather as I thought was sensible (NMI notwithstanding) and pretty reasonable.

    It occurs to me that NMI might not be maskable with PRIMASK, but can be 'outranked' by BASPRI. I may investigate that sometime.

    That still leaves me with a puzzle how I can get an interrupt from a peripheral (USART3) whilst BASPRI contains 0x80.

    Hmmm .. I was going to post a screenshot of that, just to prove I wasn't being silly, but it's not now doing it.

    Woah .. hold on here.  I'd changed the USART priorities in NVIC from 8/8 to 7/7. Now I've changed it back to 8/8 and it's doing it again. Image attached.

    If I'm going mad, at least I'm doing so repeatably :-)

    With 7/7, the ctl_timeout_wait() is still getting interrupted, as BASPRI still gets set to zero, so I guess changing the priority pushed the problem somewhere else.

     

    0
    Comment actions Permalink
  • Avatar
    Royce Arnold

    When you say you set the NVIC to 8/8 and 7/7, what do mean?   Are you talking about the preemption priority and the sub priority?  If so, that doe not seem right....

     

    One possible issue that could be debugger related,  some registers could have their interrupt cleared when the register is read. Not sure if there are any like that on the STM32, but I've seen it on others....  

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    Having made the image of ctl_timeout_wait(), I thought I may as well post it.

    This shows BASEPRI reverted to zero whilst still between the disable/enable calls, one single-step after the break point.

    So I think it's been interrupted by something and can, now at least, be interrupted by anything.

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    Yes:

        /* Enable the USARTx Interrupt */
        NVIC_InitStructure.NVIC_IRQChannel = USART3_IRQn;
        NVIC_InitStructure.NVIC_IRQChannelPreemptionPriority = 8;   // midway for now [0 is highest, 15 is lowest]
        NVIC_InitStructure.NVIC_IRQChannelSubPriority = 8;          // midway for now [0 is highest, 15 is lowest]
        NVIC_InitStructure.NVIC_IRQChannelCmd = ENABLE;
        NVIC_Init(&NVIC_InitStructure);

    I _think_ it's not debugger related as I break within the ctl_timer_wait() as in the snapshot on 20th Sept.

    I did that with the express intention of removing debugger influence, though of course there might be something else going on of which I'm not aware.

    0
    Comment actions Permalink
  • Avatar
    Royce Arnold

    I think that may be related to your issue....   The NVIC is one complicated beast.

     

    The Preemption and Sub Prioritizes allowed ranges change based on what NVIC_PriorityGroup is being used.  There is not combination that is valid for either 8/8 or 7/7.    Open up misc.c from the STDPERIH library and read the  "How to configure Interrupts using driver" comments. 

    I personally use the NVIC_PriorityGroup4 and avoid the subpriority all together....  

     

    0
    Comment actions Permalink
  • Avatar
    Gordon Scott

    Well, that's definitely made a difference.

    I had two NVIC mistakes like that and set them both to priority group 4, one to priority 8, one to 9, both sub-priority 0.

    Single-stepping seems OK. Trying free running with the my_mark breakpoint.

    It's being OK at the moment, but this sometimes can take a while.

    0
    Comment actions Permalink

Please sign in to leave a comment.