Letux Kernel

Issue 876: OMAP5: reboot hangs

Reported by Nikolaus Schaller, Apr 4, 2018

reboot command restarts MLU and u-boot but the kernel does not start 
(again).

Maybe some DPLL is not shut down or has a problem when still running.

Comment 1 by Nikolaus Schaller, Apr 4, 2018

Labels: Pyra

Comment 2 by Nikolaus Schaller, Apr 4, 2018

Labels: Device:Pyra Pyra

Comment 3 by daveshah, Jun 18, 2020

Is this still a problem on the Pyra? Reboot works fine on my uEVM 
using latest Letux U-boot and letux-kernel 5.6.y.

Comment 4 by Nikolaus Schaller, Jun 18, 2020

AFAIK yes. There is some unknown difference to the uEVM which always 
reboots fine. It may (or may not) have something to do with the 
LCD/DSI setup which is not available on the uEVM.

Comment 5 by daveshah, Jul 23, 2020

I can reproduce this on my Pyra, both from a software reboot command 
and by doing a forced reset using Power+L2.

Comment 6 by daveshah, Jul 23, 2020

For reference, the full output for 5.6.y:

https://dev.pyra-handheld.com/snippets/765

Comment 7 by Nikolaus Schaller, Jul 23, 2020

Yes, I can confirm almost the same log after "reboot" for 
letux-5.8-rc5.

For completeness: the problem is very old (at least 2016) and I have 
found an excerpt of a boot log from back then:

[    1.504410] clock: dpll_abe_ck failed transition to 'locked'
[    2.803634] clock: dpll_abe_ck failed transition to 'locked'
[    4.102881] clock: dpll_abe_ck failed transition to 'locked'
[    5.402054] clock: dpll_abe_ck failed transition to 'locked'
[    5.404696] ------------[ cut here ]------------
[    5.404713] WARNING: CPU: 0 PID: 1 at drivers/clk/clk.c:679 
clk_disable+0x34/0x40
[    5.404720] Modules linked in:
[    5.404736] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.6.0-rc3-letux+ #50
[    5.404743] Hardware name: Generic OMAP5 (Flattened Device Tree)
[    5.404765] [<c02277dc>] (unwind_backtrace) from 
[<c0223f7c>] (show_stack+0x10/0x14)
[    5.404781] [<c0223f7c>] (show_stack) from 
[<c0517f24>] (dump_stack+0x88/0xc0)
[    5.404798] [<c0517f24>] (dump_stack) from 
[<c02497a0>] (__warn+0xc0/0xec)
[    5.404813] [<c02497a0>] (__warn) from [<c02497e8>] 
(warn_slowpath_null+0x1c/0x20)
[    5.404826] [<c02497e8>] (warn_slowpath_null) from 
[<c06e605c>] (clk_disable+0x34/0x40)
[    5.404842] [<c06e605c>] (clk_disable) from 
[<c0235e4c>] (_disable_clocks+0x18/0x70)
[    5.404856] [<c0235e4c>] (_disable_clocks) from 
[<c0237050>] (_enable.part.15+0x20c/0x248)
[    5.404869] [<c0237050>] (_enable.part.15) from 
[<c0e0ec7c>] (_setup+0xc0/0x200)
[    5.404881] [<c0e0ec7c>] (_setup) from [<c02375dc>] 
(omap_hwmod_for_each+0x28/0x58)
[    5.404894] [<c02375dc>] (omap_hwmod_for_each) from 
[<c0e0efac>] (__omap_hwmod_setup_all+0x30/0x40)
[    5.404908] [<c0e0efac>] (__omap_hwmod_setup_all) from 
[<c02018c0>] (do_one_initcall+0x100/0x1b8)
[    5.404920] [<c02018c0>] (do_one_initcall) from 
[<c0e00d28>] (do_basic_setup+0x98/0xd4)
[    5.404931] [<c0e00d28>] (do_basic_setup) from 
[<c0e00de8>] (kernel_init_freeable+0x84/0x124)
[    5.404946] [<c0e00de8>] (kernel_init_freeable) from 
[<c0803ff0>] (kernel_init+0x8/0x110)
[    5.404960] [<c0803ff0>] (kernel_init) from 
[<c0220708>] (ret_from_fork+0x14/0x2c)
[    5.405021] ---[ end trace fafe8ae97cceeb76 ]---
[    5.405032] omap_hwmod: dmic: _wait_target_ready failed: -16
[    5.405046] omap_hwmod: dmic: cannot be enabled for reset (3)
[    6.731358] clock: dpll_abe_ck failed transition to 'locked'
[    8.030578] clock: dpll_abe_ck failed transition to 'locked'
[

This was still hwmod based code (4.9?) while newer one uses sysc.
What is common is that there is something with the abe_ck.
But it is not clear why the uEVM successfully reboots. ABE & 
twl6040 etc. are connected differently.

What I also did dig out was running some tests in 2018. There I have 
found in my notes
that it may be something related to timer8 (backlight PWM) which 
also uses and enables the abe clock (!).
The uEVM does not use timer8 so that may be a significant 
difference.

I.e. it might be a bug enabling/disabling abe clock for abe and 
timer8 which breaks when doing a reboot.
Maybe there is some "locked" bit in the abe_clk control 
which is not unlocked after a reboot.

Unfortunately my tests did not come to a final conclusion or fix.

The initial starting point would be

root@letux:~# cat <<END >/etc/modprobe.d/pwm.conf 
> blacklist pwm_bl
> blacklist pwm_omap_dmtimer
> END
root@letux:~# reboot

and my notes said that it then reboots fine (without backlight of 
course).

A quick test with letux-5.8-rc5 shows the same symptom:
- reboot hangs with abe locked error and
- blacklisting pwm makes it succeed, although there is a 10 second 
pause between "Starting Kernel..." and start of the log. 
And there is still the "clock: dpll_abe_ck failed transition to 
'locked'"

Some random ideas:
* is the abe_clk already locked on reboot and the driver just misses 
the "transition to 'locked'" it waits for?
* this makes some code path for initialization fail until there is 
some assumption on something always initialized which ends in the 
problems?
* maybe it can be checked by identifying the code that prints the 
message "failed transition to 'locked'" and adding to 
print the locked state bits before and after significant operations

Comment 8 by Nikolaus Schaller, Jul 23, 2020

There is missing a "not":

"ABE & twl6040 etc. are connected differently."

=>

"ABE & twl6040 etc. are NOT connected differently."

Comment 9 by daveshah, Jul 23, 2020

Moving a bit closer to a solution. Looking around at various OMAP5 
code, I found

https://gitlab.com/linux-omap4-dev/omapboot/-/blob/kexec_support/arch
/omap5/clock.c#L335

My understanding is that both CM_CLKSEL_ABE_PLL_REF and 
CM_CLKSEL_WKUPAON should be set to 1 in our environment. But only 
CM_CLKSEL_ABE_PLL_REF was being set to 1, CM_CLKSEL_WKUPAON was 0.

I wrote a very hacky patch to set CM_CLKSEL_WKUPAON to 1 as part of 
clock init (some ioremap and iowrite32 fudging) and reboot now seems 
stable. I need to decouple this from a few other hacks in my tree 
but all going well I should have a proper patch for this soon.

Comment 10 by daveshah, Jul 23, 2020

This patch shows what is needed to be changed and does result in a 
stable reboot every time:

https://dev.pyra-handheld.com/snippets/770

Now I need to work out if there is some existing code that's 
supposed to set this bit but is failing, or whether it needs to be 
added somewhere.

Comment 11 by Nikolaus Schaller, Jul 23, 2020

Wow, cool! So quick to find a solution :)
Please can you write to linux-omap and Tony Lindgren for further 
discussion?

Comment 12 by Nikolaus Schaller, Jul 24, 2020

I have tested the hack and it is like magic: "reboot" 
works :)

This makes it also easier to test (bisect) kernel variants fully 
automatic, because I can now install a new kernel binary through the 
USB gadget driver to the SD card and reboot through ssh. Which is a 
very powerful and helpful tool (I had used it many time on other 
boards) but it was never possible to apply it to the Pyra.

Let's see what Tony (or other omap developers) will suggest. 
Probably they are currently focussing on v5.9-rc1.

Comment 13 by Nikolaus Schaller, Oct 27, 2020

still works with v5.9.y.
Can be closed although it needs an upstream solution.
Status: Fixed

Created: 7 years 1 month ago by Nikolaus Schaller

Updated: 4 years 6 months ago

Status: Fixed

Followed by: 1 person

Labels:
Type:Defect
Priority:Critical
Device:Pyra